[jira] [Commented] (SPARK-19569) could not get APP ID and cause failed to connect to spark driver on yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-19569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016996#comment-16016996 ] Xiaochen Ouyang commented on SPARK-19569: - It is really a problem, we should reopen this issue. Because we can reproduce this problem by programing way. as follow: val conf = new SparkConf() conf.set("spark.app.name", "SparkOnYarnClient") conf.setMaster("yarn-client") conf.set("spark.driver.host","192.168.10.128") val arg0 = new ArrayBuffer[String]() arg0 += "--jar" arg0 += args(0) arg0 += "--class" arg0 += "com.hello.SparkPI" val client = new Client(cArgs, hadoopConf, conf) client.submitApplication() But, it will be successfully when we using spark-submit shell to submit a job whih yarn-client mode. > could not get APP ID and cause failed to connect to spark driver on > yarn-client mode > - > > Key: SPARK-19569 > URL: https://issues.apache.org/jira/browse/SPARK-19569 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 > Environment: hadoop2.7.1 > spark2.0.2 > hive2.2 >Reporter: KaiXu > > when I run Hive queries on Spark, got below error in the console, after check > the container's log, found it failed to connected to spark driver. I have set > hive.spark.job.monitor.timeout=3600s, so the log said 'Job hasn't been > submitted after 3601s', actually during this long-time period it's impossible > no available resource, and also did not see any issue related to the network, > so the cause is not clear from the message "Possible reasons include network > issues, errors in remote driver or the cluster has no available resources, > etc.". > From Hive's log, failed to get APP ID, so this might be the cause why the > driver did not start up. > console log: > Starting Spark Job = e9ce42c8-ff20-4ac8-803f-7668678c2a00 > Job hasn't been submitted after 3601s. Aborting it. > Possible reasons include network issues, errors in remote driver or the > cluster has no available resources, etc. > Please check YARN or Spark driver's logs for further information. > Status: SENT > FAILED: Execution Error, return code 2 from > org.apache.hadoop.hive.ql.exec.spark.SparkTask > container's log: > 17/02/13 05:05:54 INFO yarn.ApplicationMaster: Preparing Local resources > 17/02/13 05:05:54 INFO yarn.ApplicationMaster: Prepared Local resources > Map(__spark_libs__ -> resource { scheme: "hdfs" host: "hsx-node1" port: 8020 > file: > "/user/root/.sparkStaging/application_1486905599813_0046/__spark_libs__6842484649003444330.zip" > } size: 153484072 timestamp: 1486926551130 type: ARCHIVE visibility: > PRIVATE, __spark_conf__ -> resource { scheme: "hdfs" host: "hsx-node1" port: > 8020 file: > "/user/root/.sparkStaging/application_1486905599813_0046/__spark_conf__.zip" > } size: 116245 timestamp: 1486926551318 type: ARCHIVE visibility: PRIVATE) > 17/02/13 05:05:54 INFO yarn.ApplicationMaster: ApplicationAttemptId: > appattempt_1486905599813_0046_02 > 17/02/13 05:05:54 INFO spark.SecurityManager: Changing view acls to: root > 17/02/13 05:05:54 INFO spark.SecurityManager: Changing modify acls to: root > 17/02/13 05:05:54 INFO spark.SecurityManager: Changing view acls groups to: > 17/02/13 05:05:54 INFO spark.SecurityManager: Changing modify acls groups to: > 17/02/13 05:05:54 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(root); groups > with view permissions: Set(); users with modify permissions: Set(root); > groups with modify permissions: Set() > 17/02/13 05:05:54 INFO yarn.ApplicationMaster: Waiting for Spark driver to be > reachable. > 17/02/13 05:05:54 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:54 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:54 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Fai
[jira] [Comment Edited] (SPARK-19569) could not get APP ID and cause failed to connect to spark driver on yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-19569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016996#comment-16016996 ] Xiaochen Ouyang edited comment on SPARK-19569 at 5/19/17 7:04 AM: -- It is really a problem, we should reopen this issue. Because we can reproduce this problem by programing way. as follow: val conf = new SparkConf() conf.set("spark.app.name", "SparkOnYarnClient") conf.setMaster("yarn-client") conf.set("spark.driver.host","192.168.10.128") val arg0 = new ArrayBuffer[String]() arg0 += "--jar" arg0 += args(0) arg0 += "--class" arg0 += "com.hello.SparkPI" val client = new Client(cArgs, hadoopConf, conf) client.submitApplication() But, it will be successfully when we using spark-submit shell to submit a job whih yarn-client mode. [~srowen] was (Author: ouyangxc.zte): It is really a problem, we should reopen this issue. Because we can reproduce this problem by programing way. as follow: val conf = new SparkConf() conf.set("spark.app.name", "SparkOnYarnClient") conf.setMaster("yarn-client") conf.set("spark.driver.host","192.168.10.128") val arg0 = new ArrayBuffer[String]() arg0 += "--jar" arg0 += args(0) arg0 += "--class" arg0 += "com.hello.SparkPI" val client = new Client(cArgs, hadoopConf, conf) client.submitApplication() But, it will be successfully when we using spark-submit shell to submit a job whih yarn-client mode. > could not get APP ID and cause failed to connect to spark driver on > yarn-client mode > - > > Key: SPARK-19569 > URL: https://issues.apache.org/jira/browse/SPARK-19569 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 > Environment: hadoop2.7.1 > spark2.0.2 > hive2.2 >Reporter: KaiXu > > when I run Hive queries on Spark, got below error in the console, after check > the container's log, found it failed to connected to spark driver. I have set > hive.spark.job.monitor.timeout=3600s, so the log said 'Job hasn't been > submitted after 3601s', actually during this long-time period it's impossible > no available resource, and also did not see any issue related to the network, > so the cause is not clear from the message "Possible reasons include network > issues, errors in remote driver or the cluster has no available resources, > etc.". > From Hive's log, failed to get APP ID, so this might be the cause why the > driver did not start up. > console log: > Starting Spark Job = e9ce42c8-ff20-4ac8-803f-7668678c2a00 > Job hasn't been submitted after 3601s. Aborting it. > Possible reasons include network issues, errors in remote driver or the > cluster has no available resources, etc. > Please check YARN or Spark driver's logs for further information. > Status: SENT > FAILED: Execution Error, return code 2 from > org.apache.hadoop.hive.ql.exec.spark.SparkTask > container's log: > 17/02/13 05:05:54 INFO yarn.ApplicationMaster: Preparing Local resources > 17/02/13 05:05:54 INFO yarn.ApplicationMaster: Prepared Local resources > Map(__spark_libs__ -> resource { scheme: "hdfs" host: "hsx-node1" port: 8020 > file: > "/user/root/.sparkStaging/application_1486905599813_0046/__spark_libs__6842484649003444330.zip" > } size: 153484072 timestamp: 1486926551130 type: ARCHIVE visibility: > PRIVATE, __spark_conf__ -> resource { scheme: "hdfs" host: "hsx-node1" port: > 8020 file: > "/user/root/.sparkStaging/application_1486905599813_0046/__spark_conf__.zip" > } size: 116245 timestamp: 1486926551318 type: ARCHIVE visibility: PRIVATE) > 17/02/13 05:05:54 INFO yarn.ApplicationMaster: ApplicationAttemptId: > appattempt_1486905599813_0046_02 > 17/02/13 05:05:54 INFO spark.SecurityManager: Changing view acls to: root > 17/02/13 05:05:54 INFO spark.SecurityManager: Changing modify acls to: root > 17/02/13 05:05:54 INFO spark.SecurityManager: Changing view acls groups to: > 17/02/13 05:05:54 INFO spark.SecurityManager: Changing modify acls groups to: > 17/02/13 05:05:54 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(root); groups > with view permissions: Set(); users with modify permissions: Set(root); > groups with modify permissions: Set() > 17/02/13 05:05:54 INFO yarn.ApplicationMaster: Waiting for Spark driver to be > reachable. > 17/02/13 05:05:54 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:54 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:54 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.1
[jira] [Commented] (SPARK-17875) Remove unneeded direct dependence on Netty 3.x
[ https://issues.apache.org/jira/browse/SPARK-17875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017001#comment-16017001 ] Morten Hornbech commented on SPARK-17875: - We were just hit by a runtime error caused by this apparently obsolete dependency. More specifically the version of SslHandler used by netty 3.8 is not binary compatible with the one we use (and the one spark-core uses in netty 4.0). We can get around this by shading our own dependency, but I think its a bit nasty having this floating around risking unnecessary runtime errors - dependency management is difficult enough as it is :-) Could we reopen the issue? > Remove unneeded direct dependence on Netty 3.x > -- > > Key: SPARK-17875 > URL: https://issues.apache.org/jira/browse/SPARK-17875 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Trivial > > The Spark build declares a dependency on Netty 3.x and 4.x, but only 4.x is > used. It's best to remove the 3.x dependency (and while we're at it, update a > few things like license info) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17875) Remove unneeded direct dependence on Netty 3.x
[ https://issues.apache.org/jira/browse/SPARK-17875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017006#comment-16017006 ] Sean Owen commented on SPARK-17875: --- Did you see my pull request? > Remove unneeded direct dependence on Netty 3.x > -- > > Key: SPARK-17875 > URL: https://issues.apache.org/jira/browse/SPARK-17875 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Trivial > > The Spark build declares a dependency on Netty 3.x and 4.x, but only 4.x is > used. It's best to remove the 3.x dependency (and while we're at it, update a > few things like license info) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20806) Launcher:redundant code,invalid branch of judgment
[ https://issues.apache.org/jira/browse/SPARK-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017041#comment-16017041 ] Phoenix_Dad commented on SPARK-20806: - the " libdir.isDirectory()" express is always true within the "if" if (new File(sparkHome, "jars").isDirectory()) { libdir = new File(sparkHome, "jars"); checkState(!failIfNotFound || libdir.isDirectory(), "Library directory '%s' does not exist.", libdir.getAbsolutePath()); } > Launcher:redundant code,invalid branch of judgment > -- > > Key: SPARK-20806 > URL: https://issues.apache.org/jira/browse/SPARK-20806 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Submit >Affects Versions: 2.1.1 >Reporter: Phoenix_Dad > > org.apache.spark.launcher.CommandBuilderUtils > In findJarsDir function, there is an “if or else” branch . > the first input argument of 'checkState' in 'if' subclause is always true, > so 'checkState' is useless here -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-20806) Launcher:redundant code,invalid branch of judgment
[ https://issues.apache.org/jira/browse/SPARK-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-20806: --- OK, I get it. That should be in the description. > Launcher:redundant code,invalid branch of judgment > -- > > Key: SPARK-20806 > URL: https://issues.apache.org/jira/browse/SPARK-20806 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Submit >Affects Versions: 2.1.1 >Reporter: Phoenix_Dad > > org.apache.spark.launcher.CommandBuilderUtils > In findJarsDir function, there is an “if or else” branch . > the first input argument of 'checkState' in 'if' subclause is always true, > so 'checkState' is useless here -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20806) Launcher: redundant check for Spark lib dir
[ https://issues.apache.org/jira/browse/SPARK-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20806: -- Summary: Launcher: redundant check for Spark lib dir (was: Launcher:redundant code,invalid branch of judgment) > Launcher: redundant check for Spark lib dir > --- > > Key: SPARK-20806 > URL: https://issues.apache.org/jira/browse/SPARK-20806 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Submit >Affects Versions: 2.1.1 >Reporter: Phoenix_Dad >Priority: Trivial > > org.apache.spark.launcher.CommandBuilderUtils > In findJarsDir function, there is an “if or else” branch . > the first input argument of 'checkState' in 'if' subclause is always true, > so 'checkState' is useless here -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20806) Launcher:redundant code,invalid branch of judgment
[ https://issues.apache.org/jira/browse/SPARK-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20806: -- Priority: Trivial (was: Major) Issue Type: Improvement (was: Bug) > Launcher:redundant code,invalid branch of judgment > -- > > Key: SPARK-20806 > URL: https://issues.apache.org/jira/browse/SPARK-20806 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Submit >Affects Versions: 2.1.1 >Reporter: Phoenix_Dad >Priority: Trivial > > org.apache.spark.launcher.CommandBuilderUtils > In findJarsDir function, there is an “if or else” branch . > the first input argument of 'checkState' in 'if' subclause is always true, > so 'checkState' is useless here -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20806) Launcher:redundant code,invalid branch of judgment
[ https://issues.apache.org/jira/browse/SPARK-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017041#comment-16017041 ] Phoenix_Dad edited comment on SPARK-20806 at 5/19/17 8:08 AM: -- [~srowen] the " libdir.isDirectory()" express is always true within the "if" if (new File(sparkHome, "jars").isDirectory()) { libdir = new File(sparkHome, "jars"); checkState(!failIfNotFound || libdir.isDirectory(), "Library directory '%s' does not exist.", libdir.getAbsolutePath()); } was (Author: phoenix_dad): the " libdir.isDirectory()" express is always true within the "if" if (new File(sparkHome, "jars").isDirectory()) { libdir = new File(sparkHome, "jars"); checkState(!failIfNotFound || libdir.isDirectory(), "Library directory '%s' does not exist.", libdir.getAbsolutePath()); } > Launcher:redundant code,invalid branch of judgment > -- > > Key: SPARK-20806 > URL: https://issues.apache.org/jira/browse/SPARK-20806 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Submit >Affects Versions: 2.1.1 >Reporter: Phoenix_Dad > > org.apache.spark.launcher.CommandBuilderUtils > In findJarsDir function, there is an “if or else” branch . > the first input argument of 'checkState' in 'if' subclause is always true, > so 'checkState' is useless here -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20806) Launcher:redundant code,invalid branch of judgment
[ https://issues.apache.org/jira/browse/SPARK-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017041#comment-16017041 ] Phoenix_Dad edited comment on SPARK-20806 at 5/19/17 8:09 AM: -- [~srowen] the " libdir.isDirectory()" express is always true within the "if" if (new File(sparkHome, "jars").isDirectory()) { libdir = new File(sparkHome, "jars"); checkState(!failIfNotFound || libdir.isDirectory(), "Library directory '%s' does not exist.",libdir.getAbsolutePath()); } was (Author: phoenix_dad): [~srowen] the " libdir.isDirectory()" express is always true within the "if" if (new File(sparkHome, "jars").isDirectory()) { libdir = new File(sparkHome, "jars"); checkState(!failIfNotFound || libdir.isDirectory(), "Library directory '%s' does not exist.", libdir.getAbsolutePath()); } > Launcher:redundant code,invalid branch of judgment > -- > > Key: SPARK-20806 > URL: https://issues.apache.org/jira/browse/SPARK-20806 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Submit >Affects Versions: 2.1.1 >Reporter: Phoenix_Dad > > org.apache.spark.launcher.CommandBuilderUtils > In findJarsDir function, there is an “if or else” branch . > the first input argument of 'checkState' in 'if' subclause is always true, > so 'checkState' is useless here -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20806) Launcher: redundant check for Spark lib dir
[ https://issues.apache.org/jira/browse/SPARK-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017049#comment-16017049 ] Apache Spark commented on SPARK-20806: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/18032 > Launcher: redundant check for Spark lib dir > --- > > Key: SPARK-20806 > URL: https://issues.apache.org/jira/browse/SPARK-20806 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Submit >Affects Versions: 2.1.1 >Reporter: Phoenix_Dad >Priority: Trivial > > org.apache.spark.launcher.CommandBuilderUtils > In findJarsDir function, there is an “if or else” branch . > the first input argument of 'checkState' in 'if' subclause is always true, > so 'checkState' is useless here -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20806) Launcher: redundant check for Spark lib dir
[ https://issues.apache.org/jira/browse/SPARK-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20806: Assignee: (was: Apache Spark) > Launcher: redundant check for Spark lib dir > --- > > Key: SPARK-20806 > URL: https://issues.apache.org/jira/browse/SPARK-20806 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Submit >Affects Versions: 2.1.1 >Reporter: Phoenix_Dad >Priority: Trivial > > org.apache.spark.launcher.CommandBuilderUtils > In findJarsDir function, there is an “if or else” branch . > the first input argument of 'checkState' in 'if' subclause is always true, > so 'checkState' is useless here -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20806) Launcher: redundant check for Spark lib dir
[ https://issues.apache.org/jira/browse/SPARK-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20806: Assignee: Apache Spark > Launcher: redundant check for Spark lib dir > --- > > Key: SPARK-20806 > URL: https://issues.apache.org/jira/browse/SPARK-20806 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Submit >Affects Versions: 2.1.1 >Reporter: Phoenix_Dad >Assignee: Apache Spark >Priority: Trivial > > org.apache.spark.launcher.CommandBuilderUtils > In findJarsDir function, there is an “if or else” branch . > the first input argument of 'checkState' in 'if' subclause is always true, > so 'checkState' is useless here -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17875) Remove unneeded direct dependence on Netty 3.x
[ https://issues.apache.org/jira/browse/SPARK-17875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017059#comment-16017059 ] Morten Hornbech commented on SPARK-17875: - Sorry, I have now. If the class files are indeed in the flume assembly my best guess is that this occurs because of binary compatibility issues between 4.0 and 3.8 triggered by static members upon load of ChannelPipelineFactory. I can see that ChannelPipelineFactory does not exist in 4.0 but it references ChannelPipeline in its class definition which does. So if that was loaded from 4.0 things could go wrong. If an upgrade of flume to netty 4.0 is a major task a simpler solution would be to shade netty 3.8 in the flume assembly. That way you should be able to get rid of it in spark-core. > Remove unneeded direct dependence on Netty 3.x > -- > > Key: SPARK-17875 > URL: https://issues.apache.org/jira/browse/SPARK-17875 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Trivial > > The Spark build declares a dependency on Netty 3.x and 4.x, but only 4.x is > used. It's best to remove the 3.x dependency (and while we're at it, update a > few things like license info) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17875) Remove unneeded direct dependence on Netty 3.x
[ https://issues.apache.org/jira/browse/SPARK-17875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-17875: - Assignee: (was: Sean Owen) Target Version/s: 3.0.0 At least, we can fix this in Spark 3, when we likely remove the flume integration or something. It's already a dependency liability and not sure how supported it is. > Remove unneeded direct dependence on Netty 3.x > -- > > Key: SPARK-17875 > URL: https://issues.apache.org/jira/browse/SPARK-17875 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Sean Owen >Priority: Trivial > > The Spark build declares a dependency on Netty 3.x and 4.x, but only 4.x is > used. It's best to remove the 3.x dependency (and while we're at it, update a > few things like license info) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19569) could not get APP ID and cause failed to connect to spark driver on yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-19569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017066#comment-16017066 ] Saisai Shao commented on SPARK-19569: - [~ouyangxc.zte] In you above code you directly call {{client.submitApplication()}} to invoke Spark application, I assume this client is {{org.apache.spark.deploy.yarn.Client}}. From my understanding it is not allowed to directly call this class. Also if you directly using yarn#client to invoke Spark on YARN application, I would doubt you will probably have to do lots of preparation works done by SparkSubmit. > could not get APP ID and cause failed to connect to spark driver on > yarn-client mode > - > > Key: SPARK-19569 > URL: https://issues.apache.org/jira/browse/SPARK-19569 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 > Environment: hadoop2.7.1 > spark2.0.2 > hive2.2 >Reporter: KaiXu > > when I run Hive queries on Spark, got below error in the console, after check > the container's log, found it failed to connected to spark driver. I have set > hive.spark.job.monitor.timeout=3600s, so the log said 'Job hasn't been > submitted after 3601s', actually during this long-time period it's impossible > no available resource, and also did not see any issue related to the network, > so the cause is not clear from the message "Possible reasons include network > issues, errors in remote driver or the cluster has no available resources, > etc.". > From Hive's log, failed to get APP ID, so this might be the cause why the > driver did not start up. > console log: > Starting Spark Job = e9ce42c8-ff20-4ac8-803f-7668678c2a00 > Job hasn't been submitted after 3601s. Aborting it. > Possible reasons include network issues, errors in remote driver or the > cluster has no available resources, etc. > Please check YARN or Spark driver's logs for further information. > Status: SENT > FAILED: Execution Error, return code 2 from > org.apache.hadoop.hive.ql.exec.spark.SparkTask > container's log: > 17/02/13 05:05:54 INFO yarn.ApplicationMaster: Preparing Local resources > 17/02/13 05:05:54 INFO yarn.ApplicationMaster: Prepared Local resources > Map(__spark_libs__ -> resource { scheme: "hdfs" host: "hsx-node1" port: 8020 > file: > "/user/root/.sparkStaging/application_1486905599813_0046/__spark_libs__6842484649003444330.zip" > } size: 153484072 timestamp: 1486926551130 type: ARCHIVE visibility: > PRIVATE, __spark_conf__ -> resource { scheme: "hdfs" host: "hsx-node1" port: > 8020 file: > "/user/root/.sparkStaging/application_1486905599813_0046/__spark_conf__.zip" > } size: 116245 timestamp: 1486926551318 type: ARCHIVE visibility: PRIVATE) > 17/02/13 05:05:54 INFO yarn.ApplicationMaster: ApplicationAttemptId: > appattempt_1486905599813_0046_02 > 17/02/13 05:05:54 INFO spark.SecurityManager: Changing view acls to: root > 17/02/13 05:05:54 INFO spark.SecurityManager: Changing modify acls to: root > 17/02/13 05:05:54 INFO spark.SecurityManager: Changing view acls groups to: > 17/02/13 05:05:54 INFO spark.SecurityManager: Changing modify acls groups to: > 17/02/13 05:05:54 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(root); groups > with view permissions: Set(); users with modify permissions: Set(root); > groups with modify permissions: Set() > 17/02/13 05:05:54 INFO yarn.ApplicationMaster: Waiting for Spark driver to be > reachable. > 17/02/13 05:05:54 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:54 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:54 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:43656, retrying ... > 17/02/13 05:05:55 ERROR yarn.ApplicationMaster: Failed to connect to driver > at 192.168.1.1:4365
[jira] [Created] (SPARK-20807) Add compression/decompression of data to ColumnVector
Kazuaki Ishizaki created SPARK-20807: Summary: Add compression/decompression of data to ColumnVector Key: SPARK-20807 URL: https://issues.apache.org/jira/browse/SPARK-20807 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Kazuaki Ishizaki While current {{CachedBatch}} can compress data by using of of multiple compression schemes, {{ColumnVector}} cannot compress data. It is mandatory for table cache. This JIRA adds compression/decompression to {{ColumnVector}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20808) External Table unnecessarily not create in Hive-compatible way
Joachim Hereth created SPARK-20808: -- Summary: External Table unnecessarily not create in Hive-compatible way Key: SPARK-20808 URL: https://issues.apache.org/jira/browse/SPARK-20808 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1, 2.1.0 Reporter: Joachim Hereth Priority: Minor In Spark 2.1.0 and 2.1.1 {{spark.catalog.createExternalTable}} creates tables unnecessarily in a hive-incompatible way. For instance executing in a spark shell {code} val database = "default" val table = "table_name" val path = "/user/daki/" + database + "/" + table var data = Array(("Alice", 23), ("Laura", 33), ("Peter", 54)) val df = sc.parallelize(data).toDF("name","age") df.write.mode(org.apache.spark.sql.SaveMode.Overwrite).parquet(path) spark.sql("DROP TABLE IF EXISTS " + database + "." + table) spark.catalog.createExternalTable(database + "."+ table, path) {code} issues the warning {code} Search Subject for Kerberos V5 INIT cred (<>, sun.security.jgss.krb5.Krb5InitCredential) 17/05/19 11:01:17 WARN hive.HiveExternalCatalog: Could not persist `default`.`table_name` in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format. org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:User daki does not have privileges for CREATETABLE) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:720) ... {code} The Exception (user does not have privileges for CREATETABLE) is misleading (I do have the CREATE TABLE privilege). Querying the table with Hive does not return any result. With Spark one can access the data. The following code creates the table correctly (workaround): {code} def sqlStatement(df : org.apache.spark.sql.DataFrame, database : String, table: String, path: String) : String = { val rows = (for(col <- df.schema) yield "`" + col.name + "` " + col.dataType.simpleString).mkString(",\n") val sqlStmnt = ("CREATE EXTERNAL TABLE `%s`.`%s` (%s) " + "STORED AS PARQUET " + "Location 'hdfs://nameservice1%s'").format(database, table, rows, path) return sqlStmnt } spark.sql("DROP TABLE IF EXISTS " + database + "." + table) spark.sql(sqlStatement(df, database, table, path)) {code} The code is executed via YARN against a Cloudera CDH 5.7.5 cluster with Sentry enabled (in case this matters regarding the privilege warning). Spark was built against the CDH libraries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20808) External Table unnecessarily not create in Hive-compatible way
[ https://issues.apache.org/jira/browse/SPARK-20808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017108#comment-16017108 ] Joachim Hereth commented on SPARK-20808: The warning is caused by an Exeption raised by a call to [saveTableIntoHive() | https://github.com/apache/spark/blob/ac1ab6b9db188ac54c745558d57dd0a031d0b162/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L369]. I was not able to debug what caused the misleading Exception about privileges. > External Table unnecessarily not create in Hive-compatible way > -- > > Key: SPARK-20808 > URL: https://issues.apache.org/jira/browse/SPARK-20808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: Joachim Hereth >Priority: Minor > > In Spark 2.1.0 and 2.1.1 {{spark.catalog.createExternalTable}} creates tables > unnecessarily in a hive-incompatible way. > For instance executing in a spark shell > {code} > val database = "default" > val table = "table_name" > val path = "/user/daki/" + database + "/" + table > var data = Array(("Alice", 23), ("Laura", 33), ("Peter", 54)) > val df = sc.parallelize(data).toDF("name","age") > df.write.mode(org.apache.spark.sql.SaveMode.Overwrite).parquet(path) > spark.sql("DROP TABLE IF EXISTS " + database + "." + table) > spark.catalog.createExternalTable(database + "."+ table, path) > {code} > issues the warning > {code} > Search Subject for Kerberos V5 INIT cred (<>, > sun.security.jgss.krb5.Krb5InitCredential) > 17/05/19 11:01:17 WARN hive.HiveExternalCatalog: Could not persist > `default`.`table_name` in a Hive compatible way. Persisting it into Hive > metastore in Spark SQL specific format. > org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:User > daki does not have privileges for CREATETABLE) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:720) > ... > {code} > The Exception (user does not have privileges for CREATETABLE) is misleading > (I do have the CREATE TABLE privilege). > Querying the table with Hive does not return any result. With Spark one can > access the data. > The following code creates the table correctly (workaround): > {code} > def sqlStatement(df : org.apache.spark.sql.DataFrame, database : String, > table: String, path: String) : String = { > val rows = (for(col <- df.schema) > yield "`" + col.name + "` " + > col.dataType.simpleString).mkString(",\n") > val sqlStmnt = ("CREATE EXTERNAL TABLE `%s`.`%s` (%s) " + > "STORED AS PARQUET " + > "Location 'hdfs://nameservice1%s'").format(database, table, rows, path) > return sqlStmnt > } > spark.sql("DROP TABLE IF EXISTS " + database + "." + table) > spark.sql(sqlStatement(df, database, table, path)) > {code} > The code is executed via YARN against a Cloudera CDH 5.7.5 cluster with > Sentry enabled (in case this matters regarding the privilege warning). Spark > was built against the CDH libraries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine PRANG updated SPARK-18838: -- Attachment: SparkListernerComputeTime.xlsx execution trace > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia > Attachments: SparkListernerComputeTime.xlsx > > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might > hurt the job performance significantly or even fail the job. For example, a > significant delay in receiving the `SparkListenerTaskStart` might cause > `ExecutorAllocationManager` manager to mistakenly remove an executor which is > not idle. > The problem is that the event processor in `ListenerBus` is a single thread > which loops through all the Listeners for each event and processes each event > synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > This single threaded processor often becomes the bottleneck for large jobs. > Also, if one of the Listener is very slow, all the listeners will pay the > price of delay incurred by the slow listener. In addition to that a slow > listener can cause events to be dropped from the event queue which might be > fatal to the job. > To solve the above problems, we propose to get rid of the event queue and the > single threaded event processor. Instead each listener will have its own > dedicate single threaded executor service . When ever an event is posted, it > will be submitted to executor service of all the listeners. The Single > threaded executor service will guarantee in order processing of the events > per listener. The queue used for the executor service will be bounded to > guarantee we do not grow the memory indefinitely. The downside of this > approach is separate event queue per listener will increase the driver memory > footprint. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017120#comment-16017120 ] Antoine PRANG commented on SPARK-18838: --- [~joshrosen] I uploaded the timings I get. I put some counters in the code. You can take a look at the metrics branch of my forks. I do not have exact profile of the methods. First the StorageListener "really" execute a lot of message. It has not a no-op received method for the most frequent messages (SparkListenerBlockUpdated) if i understand well. they are not logged in the EventLoggingListener for example. But the StorageStatusListener listens to this kind of events too, and its execution time is not comparable. But it seems to do much more work (with the parent classes). there is a lot of synchronization which may be avoided in my mind. > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia > Attachments: SparkListernerComputeTime.xlsx > > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might > hurt the job performance significantly or even fail the job. For example, a > significant delay in receiving the `SparkListenerTaskStart` might cause > `ExecutorAllocationManager` manager to mistakenly remove an executor which is > not idle. > The problem is that the event processor in `ListenerBus` is a single thread > which loops through all the Listeners for each event and processes each event > synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > This single threaded processor often becomes the bottleneck for large jobs. > Also, if one of the Listener is very slow, all the listeners will pay the > price of delay incurred by the slow listener. In addition to that a slow > listener can cause events to be dropped from the event queue which might be > fatal to the job. > To solve the above problems, we propose to get rid of the event queue and the > single threaded event processor. Instead each listener will have its own > dedicate single threaded executor service . When ever an event is posted, it > will be submitted to executor service of all the listeners. The Single > threaded executor service will guarantee in order processing of the events > per listener. The queue used for the executor service will be bounded to > guarantee we do not grow the memory indefinitely. The downside of this > approach is separate event queue per listener will increase the driver memory > footprint. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20809) PySpark: Java heap space issue despite apparently being within memory limits
James Porritt created SPARK-20809: - Summary: PySpark: Java heap space issue despite apparently being within memory limits Key: SPARK-20809 URL: https://issues.apache.org/jira/browse/SPARK-20809 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.1.1 Environment: Linux x86_64 Reporter: James Porritt I have the following script: {code} import itertools import loremipsum from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession conf = SparkConf().set("spark.cores.max", "16") \ .set("spark.driver.memory", "16g") \ .set("spark.executor.memory", "16g") \ .set("spark.executor.memory_overhead", "16g") \ .set("spark.driver.maxResultsSize", "0") sc = SparkContext(appName="testRDD", conf=conf) ss = SparkSession(sc) j = itertools.cycle(range(8)) rows = [(i, j.next(), ' '.join(map(lambda x: x[2], loremipsum.generate_sentences(600 for i in range(500)] * 100 rrd = sc.parallelize(rows, 128) {code} When I run it with: {noformat} /spark-2.1.1-bin-hadoop2.7/bin/spark-submit /writeTest.py {noformat} it fails with a 'Java heap space' error: {noformat} py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.readRDDFromFile. : java.lang.OutOfMemoryError: Java heap space at org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:468) at org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) {noformat} The data I create here approximates my actual data. The third element of each tuple should be around 25k, and there are 50k tuples overall. I estimate that I should have around 1.2G of data. Why then does it fail? All parts of the system should have enough memory? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.
[ https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20797: Assignee: Apache Spark > mllib lda's LocalLDAModel's save: out of memory. > - > > Key: SPARK-20797 > URL: https://issues.apache.org/jira/browse/SPARK-20797 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1 >Reporter: d0evi1 >Assignee: Apache Spark > > when i try online lda model with large text data(nearly 1 billion chinese > news' abstract), the training step went well, but the save step failed. > something like below happened (etc. 1.6.1): > problem 1.bigger than spark.kryoserializer.buffer.max. (turning bigger the > param can fix problem 1, but next will lead problem 2), > problem 2. exceed spark.akka.frameSize. (turning this param too bigger will > fail for the reason out of memory, kill it, version > 2.0.0, exceeds max > allowed: spark.rpc.message.maxSize). > when topics num is large(set topic num k=200 is ok, but set k=300 failed), > and vocab size is large(nearly 1000,000) too. this problem will appear. > so i found word2vec's save function is similar to the LocalLDAModel's save > function : > word2vec's problem (use repartition(1) to save) has been fixed > [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use: > repartition(1). use single partition when save. > word2vec's save method from latest code: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala: > val approxSize = (4L * vectorSize + 15) * numWords > val nPartitions = ((approxSize / bufferSize) + 1).toInt > val dataArray = model.toSeq.map { case (w, v) => Data(w, v) } > > spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path)) > but the code in mllib.clustering.LDAModel's LocalLDAModel's save: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala > you'll see: > val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix > val topics = Range(0, k).map { topicInd => > Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), > topicInd) > } > > spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path)) > refer to word2vec's save (repartition(nPartitions)), i replace numWords to > topic K, repartition(nPartitions) in the LocalLDAModel's save method, > recompile the code, deploy the new lda's project with large data on our > machine cluster, it works. > hopes it will fixed in the next version. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.
[ https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017127#comment-16017127 ] Apache Spark commented on SPARK-20797: -- User 'd0evi1' has created a pull request for this issue: https://github.com/apache/spark/pull/18034 > mllib lda's LocalLDAModel's save: out of memory. > - > > Key: SPARK-20797 > URL: https://issues.apache.org/jira/browse/SPARK-20797 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1 >Reporter: d0evi1 > > when i try online lda model with large text data(nearly 1 billion chinese > news' abstract), the training step went well, but the save step failed. > something like below happened (etc. 1.6.1): > problem 1.bigger than spark.kryoserializer.buffer.max. (turning bigger the > param can fix problem 1, but next will lead problem 2), > problem 2. exceed spark.akka.frameSize. (turning this param too bigger will > fail for the reason out of memory, kill it, version > 2.0.0, exceeds max > allowed: spark.rpc.message.maxSize). > when topics num is large(set topic num k=200 is ok, but set k=300 failed), > and vocab size is large(nearly 1000,000) too. this problem will appear. > so i found word2vec's save function is similar to the LocalLDAModel's save > function : > word2vec's problem (use repartition(1) to save) has been fixed > [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use: > repartition(1). use single partition when save. > word2vec's save method from latest code: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala: > val approxSize = (4L * vectorSize + 15) * numWords > val nPartitions = ((approxSize / bufferSize) + 1).toInt > val dataArray = model.toSeq.map { case (w, v) => Data(w, v) } > > spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path)) > but the code in mllib.clustering.LDAModel's LocalLDAModel's save: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala > you'll see: > val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix > val topics = Range(0, k).map { topicInd => > Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), > topicInd) > } > > spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path)) > refer to word2vec's save (repartition(nPartitions)), i replace numWords to > topic K, repartition(nPartitions) in the LocalLDAModel's save method, > recompile the code, deploy the new lda's project with large data on our > machine cluster, it works. > hopes it will fixed in the next version. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.
[ https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20797: Assignee: (was: Apache Spark) > mllib lda's LocalLDAModel's save: out of memory. > - > > Key: SPARK-20797 > URL: https://issues.apache.org/jira/browse/SPARK-20797 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1 >Reporter: d0evi1 > > when i try online lda model with large text data(nearly 1 billion chinese > news' abstract), the training step went well, but the save step failed. > something like below happened (etc. 1.6.1): > problem 1.bigger than spark.kryoserializer.buffer.max. (turning bigger the > param can fix problem 1, but next will lead problem 2), > problem 2. exceed spark.akka.frameSize. (turning this param too bigger will > fail for the reason out of memory, kill it, version > 2.0.0, exceeds max > allowed: spark.rpc.message.maxSize). > when topics num is large(set topic num k=200 is ok, but set k=300 failed), > and vocab size is large(nearly 1000,000) too. this problem will appear. > so i found word2vec's save function is similar to the LocalLDAModel's save > function : > word2vec's problem (use repartition(1) to save) has been fixed > [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use: > repartition(1). use single partition when save. > word2vec's save method from latest code: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala: > val approxSize = (4L * vectorSize + 15) * numWords > val nPartitions = ((approxSize / bufferSize) + 1).toInt > val dataArray = model.toSeq.map { case (w, v) => Data(w, v) } > > spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path)) > but the code in mllib.clustering.LDAModel's LocalLDAModel's save: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala > you'll see: > val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix > val topics = Range(0, k).map { topicInd => > Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), > topicInd) > } > > spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path)) > refer to word2vec's save (repartition(nPartitions)), i replace numWords to > topic K, repartition(nPartitions) in the LocalLDAModel's save method, > recompile the code, deploy the new lda's project with large data on our > machine cluster, it works. > hopes it will fixed in the next version. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.
[ https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017128#comment-16017128 ] d0evi1 commented on SPARK-20797: ok, there is: https://github.com/apache/spark/pull/18034 > mllib lda's LocalLDAModel's save: out of memory. > - > > Key: SPARK-20797 > URL: https://issues.apache.org/jira/browse/SPARK-20797 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1 >Reporter: d0evi1 > > when i try online lda model with large text data(nearly 1 billion chinese > news' abstract), the training step went well, but the save step failed. > something like below happened (etc. 1.6.1): > problem 1.bigger than spark.kryoserializer.buffer.max. (turning bigger the > param can fix problem 1, but next will lead problem 2), > problem 2. exceed spark.akka.frameSize. (turning this param too bigger will > fail for the reason out of memory, kill it, version > 2.0.0, exceeds max > allowed: spark.rpc.message.maxSize). > when topics num is large(set topic num k=200 is ok, but set k=300 failed), > and vocab size is large(nearly 1000,000) too. this problem will appear. > so i found word2vec's save function is similar to the LocalLDAModel's save > function : > word2vec's problem (use repartition(1) to save) has been fixed > [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use: > repartition(1). use single partition when save. > word2vec's save method from latest code: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala: > val approxSize = (4L * vectorSize + 15) * numWords > val nPartitions = ((approxSize / bufferSize) + 1).toInt > val dataArray = model.toSeq.map { case (w, v) => Data(w, v) } > > spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path)) > but the code in mllib.clustering.LDAModel's LocalLDAModel's save: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala > you'll see: > val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix > val topics = Range(0, k).map { topicInd => > Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), > topicInd) > } > > spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path)) > refer to word2vec's save (repartition(nPartitions)), i replace numWords to > topic K, repartition(nPartitions) in the LocalLDAModel's save method, > recompile the code, deploy the new lda's project with large data on our > machine cluster, it works. > hopes it will fixed in the next version. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20807) Add compression/decompression of data to ColumnVector
[ https://issues.apache.org/jira/browse/SPARK-20807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017154#comment-16017154 ] Apache Spark commented on SPARK-20807: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/18033 > Add compression/decompression of data to ColumnVector > - > > Key: SPARK-20807 > URL: https://issues.apache.org/jira/browse/SPARK-20807 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki > > While current {{CachedBatch}} can compress data by using of of multiple > compression schemes, {{ColumnVector}} cannot compress data. It is mandatory > for table cache. > This JIRA adds compression/decompression to {{ColumnVector}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20807) Add compression/decompression of data to ColumnVector
[ https://issues.apache.org/jira/browse/SPARK-20807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20807: Assignee: (was: Apache Spark) > Add compression/decompression of data to ColumnVector > - > > Key: SPARK-20807 > URL: https://issues.apache.org/jira/browse/SPARK-20807 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki > > While current {{CachedBatch}} can compress data by using of of multiple > compression schemes, {{ColumnVector}} cannot compress data. It is mandatory > for table cache. > This JIRA adds compression/decompression to {{ColumnVector}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20807) Add compression/decompression of data to ColumnVector
[ https://issues.apache.org/jira/browse/SPARK-20807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20807: Assignee: Apache Spark > Add compression/decompression of data to ColumnVector > - > > Key: SPARK-20807 > URL: https://issues.apache.org/jira/browse/SPARK-20807 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark > > While current {{CachedBatch}} can compress data by using of of multiple > compression schemes, {{ColumnVector}} cannot compress data. It is mandatory > for table cache. > This JIRA adds compression/decompression to {{ColumnVector}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16365) Ideas for moving "mllib-local" forward
[ https://issues.apache.org/jira/browse/SPARK-16365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017192#comment-16017192 ] Cristian Opris commented on SPARK-16365: There's another potential argument for exposing 'local' (non-distributed) implementations of the algorithms: sometimes it's useful to apply the algorithm on relatively small groupings of data in a very large dataset. In this case Spark would only serve to distribute the data and apply the algorithm locally on each partition/grouping of data, perhaps through an UDF. This may currently be achieved with the scikit integration, but would be useful to consider making it possible to use the Spark implementation of the algorithm, where that algorithm is not an inherently distributed implementation. CountVectorizer is a good example, nothing in there inherently requires a DataFrame. In practice this should mostly imply just exposing the core implementation of the algorithms where possible. > Ideas for moving "mllib-local" forward > -- > > Key: SPARK-16365 > URL: https://issues.apache.org/jira/browse/SPARK-16365 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Nick Pentreath > > Since SPARK-13944 is all done, we should all think about what the "next > steps" might be for {{mllib-local}}. E.g., it could be "improve Spark's > linear algebra", or "investigate how we will implement local models/pipelines > in Spark", etc. > This ticket is for comments, ideas, brainstormings and PoCs. The separation > of linalg into a standalone project turned out to be significantly more > complex than originally expected. So I vote we devote sufficient discussion > and time to planning out the next move :) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution
Yanbo Liang created SPARK-20810: --- Summary: ML LinearSVC vs MLlib SVMWithSGD output different solution Key: SPARK-20810 URL: https://issues.apache.org/jira/browse/SPARK-20810 Project: Spark Issue Type: Question Components: ML, MLlib Affects Versions: 2.2.0 Reporter: Yanbo Liang Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} produces different solution compared with MLlib {{SVMWithSGD}}. I understand they use different optimization solver (OWLQN vs SGD), does it make sense to converge to different solution? AFAIK, both of them use Hinge loss which is convex but not differentiable function. Since the derivative of the hinge loss at certain place is non-deterministic, should we switch to use squared hinge loss? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution
[ https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-20810: Description: Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} produces different solution compared with MLlib {{SVMWithSGD}}. I understand they use different optimization solver (OWLQN vs SGD), does it make sense to converge to different solution? AFAIK, both of them use Hinge loss which is convex but not differentiable function. Since the derivative of the hinge loss at certain place is non-deterministic, should we switch to use squared hinge loss which is the default loss function of {{sklearn.svm.LinearSVC}}? This issue is very easy to reproduce, you can paste the following code snippet to {{LinearSVCSuite}} and then click run in Intellij IDE. {code} test("LinearSVC vs SVMWithSGD") { import org.apache.spark.mllib.linalg.{Vectors => OldVectors} import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} val trainer1 = new LinearSVC() .setRegParam(0.2) .setMaxIter(200) .setTol(1e-4) val model1 = trainer1.fit(binaryDataset) println(model1.coefficients) println(model1.intercept) val oldData = binaryDataset.rdd.map { case Row(label: Double, features: Vector) => OldLabeledPoint(label, OldVectors.fromML(features)) } val trainer2 = new SVMWithSGD().setIntercept(true) trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4) val model2 = trainer2.run(oldData) println(model2.weights) println(model2.intercept) } {code} was: Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} produces different solution compared with MLlib {{SVMWithSGD}}. I understand they use different optimization solver (OWLQN vs SGD), does it make sense to converge to different solution? AFAIK, both of them use Hinge loss which is convex but not differentiable function. Since the derivative of the hinge loss at certain place is non-deterministic, should we switch to use squared hinge loss? > ML LinearSVC vs MLlib SVMWithSGD output different solution > -- > > Key: SPARK-20810 > URL: https://issues.apache.org/jira/browse/SPARK-20810 > Project: Spark > Issue Type: Question > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Yanbo Liang > > Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} > produces different solution compared with MLlib {{SVMWithSGD}}. I understand > they use different optimization solver (OWLQN vs SGD), does it make sense to > converge to different solution? > AFAIK, both of them use Hinge loss which is convex but not differentiable > function. Since the derivative of the hinge loss at certain place is > non-deterministic, should we switch to use squared hinge loss which is the > default loss function of {{sklearn.svm.LinearSVC}}? > This issue is very easy to reproduce, you can paste the following code > snippet to {{LinearSVCSuite}} and then click run in Intellij IDE. > {code} > test("LinearSVC vs SVMWithSGD") { > import org.apache.spark.mllib.linalg.{Vectors => OldVectors} > import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} > val trainer1 = new LinearSVC() > .setRegParam(0.2) > .setMaxIter(200) > .setTol(1e-4) > val model1 = trainer1.fit(binaryDataset) > println(model1.coefficients) > println(model1.intercept) > val oldData = binaryDataset.rdd.map { case Row(label: Double, features: > Vector) => > OldLabeledPoint(label, OldVectors.fromML(features)) > } > val trainer2 = new SVMWithSGD().setIntercept(true) > > trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4) > val model2 = trainer2.run(oldData) > println(model2.weights) > println(model2.intercept) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution
[ https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-20810: Description: Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} produces different solution compared with MLlib {{SVMWithSGD}}. I understand they use different optimization solver (OWLQN vs SGD), does it make sense to converge to different solution? AFAIK, both of them use Hinge loss which is convex but not differentiable function. Since the derivative of the hinge loss at certain place is non-deterministic, should we switch to use squared hinge loss which is the default loss function of {{sklearn.svm.LinearSVC}}? This issue is very easy to reproduce, you can paste the following code snippet to {{LinearSVCSuite}} and then click run in Intellij IDE. {code} test("LinearSVC vs SVMWithSGD") { import org.apache.spark.mllib.linalg.{Vectors => OldVectors} import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} val trainer1 = new LinearSVC() .setRegParam(0.2) .setMaxIter(200) .setTol(1e-4) val model1 = trainer1.fit(binaryDataset) println(model1.coefficients) println(model1.intercept) val oldData = binaryDataset.rdd.map { case Row(label: Double, features: Vector) => OldLabeledPoint(label, OldVectors.fromML(features)) } val trainer2 = new SVMWithSGD().setIntercept(true) trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4) val model2 = trainer2.run(oldData) println(model2.weights) println(model2.intercept) } {code} The output is: {code} [7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084] 7.373454363024084 [0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165] 0.667790514894194 {code} was: Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} produces different solution compared with MLlib {{SVMWithSGD}}. I understand they use different optimization solver (OWLQN vs SGD), does it make sense to converge to different solution? AFAIK, both of them use Hinge loss which is convex but not differentiable function. Since the derivative of the hinge loss at certain place is non-deterministic, should we switch to use squared hinge loss which is the default loss function of {{sklearn.svm.LinearSVC}}? This issue is very easy to reproduce, you can paste the following code snippet to {{LinearSVCSuite}} and then click run in Intellij IDE. {code} test("LinearSVC vs SVMWithSGD") { import org.apache.spark.mllib.linalg.{Vectors => OldVectors} import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} val trainer1 = new LinearSVC() .setRegParam(0.2) .setMaxIter(200) .setTol(1e-4) val model1 = trainer1.fit(binaryDataset) println(model1.coefficients) println(model1.intercept) val oldData = binaryDataset.rdd.map { case Row(label: Double, features: Vector) => OldLabeledPoint(label, OldVectors.fromML(features)) } val trainer2 = new SVMWithSGD().setIntercept(true) trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4) val model2 = trainer2.run(oldData) println(model2.weights) println(model2.intercept) } {code} > ML LinearSVC vs MLlib SVMWithSGD output different solution > -- > > Key: SPARK-20810 > URL: https://issues.apache.org/jira/browse/SPARK-20810 > Project: Spark > Issue Type: Question > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Yanbo Liang > > Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} > produces different solution compared with MLlib {{SVMWithSGD}}. I understand > they use different optimization solver (OWLQN vs SGD), does it make sense to > converge to different solution? > AFAIK, both of them use Hinge loss which is convex but not differentiable > function. Since the derivative of the hinge loss at certain place is > non-deterministic, should we switch to use squared hinge loss which is the > default loss function of {{sklearn.svm.LinearSVC}}? > This issue is very easy to reproduce, you can paste the following code > snippet to {{LinearSVCSuite}} and then click run in Intellij IDE. > {code} > test("LinearSVC vs SVMWithSGD") { > import org.apache.spark.mllib.linalg.{Vectors => OldVectors} > import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} > val trainer1 = new LinearSVC() > .setRegParam(0.2) > .setMaxIter(200) > .setTol(1e-4) > val model1 = trainer1.fit(binaryDataset) > println(model1.coefficients) > println(model1.intercept) > val oldData = binaryDataset.rdd.map { case Row(label: Double, feat
[jira] [Updated] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution
[ https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-20810: Description: Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} produces different solution compared with MLlib {{SVMWithSGD}}. I understand they use different optimization solver (OWLQN vs SGD), does it make sense to converge to different solution? AFAIK, both of them use {{hinge loss}} which is convex but not differentiable function. Since the derivative of the hinge loss at certain place is non-deterministic, should we switch to use {{squared hinge loss}} which is the default loss function of {{sklearn.svm.LinearSVC}}? This issue is very easy to reproduce, you can paste the following code snippet to {{LinearSVCSuite}} and then click run in Intellij IDE. {code} test("LinearSVC vs SVMWithSGD") { import org.apache.spark.mllib.linalg.{Vectors => OldVectors} import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} val trainer1 = new LinearSVC() .setRegParam(0.2) .setMaxIter(200) .setTol(1e-4) val model1 = trainer1.fit(binaryDataset) println(model1.coefficients) println(model1.intercept) val oldData = binaryDataset.rdd.map { case Row(label: Double, features: Vector) => OldLabeledPoint(label, OldVectors.fromML(features)) } val trainer2 = new SVMWithSGD().setIntercept(true) trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4) val model2 = trainer2.run(oldData) println(model2.weights) println(model2.intercept) } {code} The output is: {code} [7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084] 7.373454363024084 [0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165] 0.667790514894194 {code} was: Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} produces different solution compared with MLlib {{SVMWithSGD}}. I understand they use different optimization solver (OWLQN vs SGD), does it make sense to converge to different solution? AFAIK, both of them use Hinge loss which is convex but not differentiable function. Since the derivative of the hinge loss at certain place is non-deterministic, should we switch to use squared hinge loss which is the default loss function of {{sklearn.svm.LinearSVC}}? This issue is very easy to reproduce, you can paste the following code snippet to {{LinearSVCSuite}} and then click run in Intellij IDE. {code} test("LinearSVC vs SVMWithSGD") { import org.apache.spark.mllib.linalg.{Vectors => OldVectors} import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} val trainer1 = new LinearSVC() .setRegParam(0.2) .setMaxIter(200) .setTol(1e-4) val model1 = trainer1.fit(binaryDataset) println(model1.coefficients) println(model1.intercept) val oldData = binaryDataset.rdd.map { case Row(label: Double, features: Vector) => OldLabeledPoint(label, OldVectors.fromML(features)) } val trainer2 = new SVMWithSGD().setIntercept(true) trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4) val model2 = trainer2.run(oldData) println(model2.weights) println(model2.intercept) } {code} The output is: {code} [7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084] 7.373454363024084 [0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165] 0.667790514894194 {code} > ML LinearSVC vs MLlib SVMWithSGD output different solution > -- > > Key: SPARK-20810 > URL: https://issues.apache.org/jira/browse/SPARK-20810 > Project: Spark > Issue Type: Question > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Yanbo Liang > > Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} > produces different solution compared with MLlib {{SVMWithSGD}}. I understand > they use different optimization solver (OWLQN vs SGD), does it make sense to > converge to different solution? > AFAIK, both of them use {{hinge loss}} which is convex but not differentiable > function. Since the derivative of the hinge loss at certain place is > non-deterministic, should we switch to use {{squared hinge loss}} which is > the default loss function of {{sklearn.svm.LinearSVC}}? > This issue is very easy to reproduce, you can paste the following code > snippet to {{LinearSVCSuite}} and then click run in Intellij IDE. > {code} > test("LinearSVC vs SVMWithSGD") { > import org.apache.spark.mllib.linalg.{Vectors => OldVectors} > import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} > val trainer1 = new LinearSVC() > .setRegParam(0.2)
[jira] [Commented] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution
[ https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017217#comment-16017217 ] Yanbo Liang commented on SPARK-20810: - cc [~josephkb] [~yuhaoyan] > ML LinearSVC vs MLlib SVMWithSGD output different solution > -- > > Key: SPARK-20810 > URL: https://issues.apache.org/jira/browse/SPARK-20810 > Project: Spark > Issue Type: Question > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Yanbo Liang > > Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} > produces different solution compared with MLlib {{SVMWithSGD}}. I understand > they use different optimization solver (OWLQN vs SGD), does it make sense to > converge to different solution? > AFAIK, both of them use {{hinge loss}} which is convex but not differentiable > function. Since the derivative of the hinge loss at certain place is > non-deterministic, should we switch to use {{squared hinge loss}} which is > the default loss function of {{sklearn.svm.LinearSVC}}? > This issue is very easy to reproduce, you can paste the following code > snippet to {{LinearSVCSuite}} and then click run in Intellij IDE. > {code} > test("LinearSVC vs SVMWithSGD") { > import org.apache.spark.mllib.linalg.{Vectors => OldVectors} > import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} > val trainer1 = new LinearSVC() > .setRegParam(0.2) > .setMaxIter(200) > .setTol(1e-4) > val model1 = trainer1.fit(binaryDataset) > println(model1.coefficients) > println(model1.intercept) > val oldData = binaryDataset.rdd.map { case Row(label: Double, features: > Vector) => > OldLabeledPoint(label, OldVectors.fromML(features)) > } > val trainer2 = new SVMWithSGD().setIntercept(true) > > trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4) > val model2 = trainer2.run(oldData) > println(model2.weights) > println(model2.intercept) > } > {code} > The output is: > {code} > [7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084] > 7.373454363024084 > [0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165] > 0.667790514894194 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution
[ https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-20810: Description: Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} produces different solution compared with MLlib {{SVMWithSGD}}. I understand they use different optimization solver (OWLQN vs SGD), does it make sense to converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like {{SVMWithSGD}} produce wrong solution. Does it also like this? AFAIK, both of them use {{hinge loss}} which is convex but not differentiable function. Since the derivative of the hinge loss at certain place is non-deterministic, should we switch to use {{squared hinge loss}} which is the default loss function of {{sklearn.svm.LinearSVC}} and more robust than {{hinge loss}}? This issue is very easy to reproduce, you can paste the following code snippet to {{LinearSVCSuite}} and then click run in Intellij IDE. {code} test("LinearSVC vs SVMWithSGD") { import org.apache.spark.mllib.linalg.{Vectors => OldVectors} import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} val trainer1 = new LinearSVC() .setRegParam(0.2) .setMaxIter(200) .setTol(1e-4) val model1 = trainer1.fit(binaryDataset) println(model1.coefficients) println(model1.intercept) val oldData = binaryDataset.rdd.map { case Row(label: Double, features: Vector) => OldLabeledPoint(label, OldVectors.fromML(features)) } val trainer2 = new SVMWithSGD().setIntercept(true) trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4) val model2 = trainer2.run(oldData) println(model2.weights) println(model2.intercept) } {code} The output is: {code} [7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084] 7.373454363024084 [0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165] 0.667790514894194 {code} was: Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} produces different solution compared with MLlib {{SVMWithSGD}}. I understand they use different optimization solver (OWLQN vs SGD), does it make sense to converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like {{SVMWithSGD}} produce wrong solution. Does it also like this? AFAIK, both of them use {{hinge loss}} which is convex but not differentiable function. Since the derivative of the hinge loss at certain place is non-deterministic, should we switch to use {{squared hinge loss}} which is the default loss function of {{sklearn.svm.LinearSVC}} and more robust then {{hinge loss}}? This issue is very easy to reproduce, you can paste the following code snippet to {{LinearSVCSuite}} and then click run in Intellij IDE. {code} test("LinearSVC vs SVMWithSGD") { import org.apache.spark.mllib.linalg.{Vectors => OldVectors} import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} val trainer1 = new LinearSVC() .setRegParam(0.2) .setMaxIter(200) .setTol(1e-4) val model1 = trainer1.fit(binaryDataset) println(model1.coefficients) println(model1.intercept) val oldData = binaryDataset.rdd.map { case Row(label: Double, features: Vector) => OldLabeledPoint(label, OldVectors.fromML(features)) } val trainer2 = new SVMWithSGD().setIntercept(true) trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4) val model2 = trainer2.run(oldData) println(model2.weights) println(model2.intercept) } {code} The output is: {code} [7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084] 7.373454363024084 [0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165] 0.667790514894194 {code} > ML LinearSVC vs MLlib SVMWithSGD output different solution > -- > > Key: SPARK-20810 > URL: https://issues.apache.org/jira/browse/SPARK-20810 > Project: Spark > Issue Type: Question > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Yanbo Liang > > Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} > produces different solution compared with MLlib {{SVMWithSGD}}. I understand > they use different optimization solver (OWLQN vs SGD), does it make sense to > converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R > e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like > {{SVMWithSGD}} produce wrong solution. Does it also like this? > AFAIK, both of them use {{hinge loss}} which is convex but not differentiable >
[jira] [Updated] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution
[ https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-20810: Description: Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} produces different solution compared with MLlib {{SVMWithSGD}}. I understand they use different optimization solver (OWLQN vs SGD), does it make sense to converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like {{SVMWithSGD}} produce wrong solution. Does it also like this? AFAIK, both of them use {{hinge loss}} which is convex but not differentiable function. Since the derivative of the hinge loss at certain place is non-deterministic, should we switch to use {{squared hinge loss}} which is the default loss function of {{sklearn.svm.LinearSVC}} and more robust then {{hinge loss}}? This issue is very easy to reproduce, you can paste the following code snippet to {{LinearSVCSuite}} and then click run in Intellij IDE. {code} test("LinearSVC vs SVMWithSGD") { import org.apache.spark.mllib.linalg.{Vectors => OldVectors} import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} val trainer1 = new LinearSVC() .setRegParam(0.2) .setMaxIter(200) .setTol(1e-4) val model1 = trainer1.fit(binaryDataset) println(model1.coefficients) println(model1.intercept) val oldData = binaryDataset.rdd.map { case Row(label: Double, features: Vector) => OldLabeledPoint(label, OldVectors.fromML(features)) } val trainer2 = new SVMWithSGD().setIntercept(true) trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4) val model2 = trainer2.run(oldData) println(model2.weights) println(model2.intercept) } {code} The output is: {code} [7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084] 7.373454363024084 [0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165] 0.667790514894194 {code} was: Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} produces different solution compared with MLlib {{SVMWithSGD}}. I understand they use different optimization solver (OWLQN vs SGD), does it make sense to converge to different solution? AFAIK, both of them use {{hinge loss}} which is convex but not differentiable function. Since the derivative of the hinge loss at certain place is non-deterministic, should we switch to use {{squared hinge loss}} which is the default loss function of {{sklearn.svm.LinearSVC}}? This issue is very easy to reproduce, you can paste the following code snippet to {{LinearSVCSuite}} and then click run in Intellij IDE. {code} test("LinearSVC vs SVMWithSGD") { import org.apache.spark.mllib.linalg.{Vectors => OldVectors} import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} val trainer1 = new LinearSVC() .setRegParam(0.2) .setMaxIter(200) .setTol(1e-4) val model1 = trainer1.fit(binaryDataset) println(model1.coefficients) println(model1.intercept) val oldData = binaryDataset.rdd.map { case Row(label: Double, features: Vector) => OldLabeledPoint(label, OldVectors.fromML(features)) } val trainer2 = new SVMWithSGD().setIntercept(true) trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4) val model2 = trainer2.run(oldData) println(model2.weights) println(model2.intercept) } {code} The output is: {code} [7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084] 7.373454363024084 [0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165] 0.667790514894194 {code} > ML LinearSVC vs MLlib SVMWithSGD output different solution > -- > > Key: SPARK-20810 > URL: https://issues.apache.org/jira/browse/SPARK-20810 > Project: Spark > Issue Type: Question > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Yanbo Liang > > Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} > produces different solution compared with MLlib {{SVMWithSGD}}. I understand > they use different optimization solver (OWLQN vs SGD), does it make sense to > converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R > e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like > {{SVMWithSGD}} produce wrong solution. Does it also like this? > AFAIK, both of them use {{hinge loss}} which is convex but not differentiable > function. Since the derivative of the hinge loss at certain place is > non-deterministic, should we switch to use {{squared hinge loss}} which is > the default loss function of {{sklearn.svm.LinearSVC}} and mo
[jira] [Updated] (SPARK-20808) External Table unnecessarily not created in Hive-compatible way
[ https://issues.apache.org/jira/browse/SPARK-20808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joachim Hereth updated SPARK-20808: --- Summary: External Table unnecessarily not created in Hive-compatible way (was: External Table unnecessarily not create in Hive-compatible way) > External Table unnecessarily not created in Hive-compatible way > --- > > Key: SPARK-20808 > URL: https://issues.apache.org/jira/browse/SPARK-20808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: Joachim Hereth >Priority: Minor > > In Spark 2.1.0 and 2.1.1 {{spark.catalog.createExternalTable}} creates tables > unnecessarily in a hive-incompatible way. > For instance executing in a spark shell > {code} > val database = "default" > val table = "table_name" > val path = "/user/daki/" + database + "/" + table > var data = Array(("Alice", 23), ("Laura", 33), ("Peter", 54)) > val df = sc.parallelize(data).toDF("name","age") > df.write.mode(org.apache.spark.sql.SaveMode.Overwrite).parquet(path) > spark.sql("DROP TABLE IF EXISTS " + database + "." + table) > spark.catalog.createExternalTable(database + "."+ table, path) > {code} > issues the warning > {code} > Search Subject for Kerberos V5 INIT cred (<>, > sun.security.jgss.krb5.Krb5InitCredential) > 17/05/19 11:01:17 WARN hive.HiveExternalCatalog: Could not persist > `default`.`table_name` in a Hive compatible way. Persisting it into Hive > metastore in Spark SQL specific format. > org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:User > daki does not have privileges for CREATETABLE) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:720) > ... > {code} > The Exception (user does not have privileges for CREATETABLE) is misleading > (I do have the CREATE TABLE privilege). > Querying the table with Hive does not return any result. With Spark one can > access the data. > The following code creates the table correctly (workaround): > {code} > def sqlStatement(df : org.apache.spark.sql.DataFrame, database : String, > table: String, path: String) : String = { > val rows = (for(col <- df.schema) > yield "`" + col.name + "` " + > col.dataType.simpleString).mkString(",\n") > val sqlStmnt = ("CREATE EXTERNAL TABLE `%s`.`%s` (%s) " + > "STORED AS PARQUET " + > "Location 'hdfs://nameservice1%s'").format(database, table, rows, path) > return sqlStmnt > } > spark.sql("DROP TABLE IF EXISTS " + database + "." + table) > spark.sql(sqlStatement(df, database, table, path)) > {code} > The code is executed via YARN against a Cloudera CDH 5.7.5 cluster with > Sentry enabled (in case this matters regarding the privilege warning). Spark > was built against the CDH libraries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20799) Unable to infer schema for ORC on reading ORC from S3
[ https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017258#comment-16017258 ] Steve Loughran commented on SPARK-20799: bq. Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login details. This is insecure and may be unsupported in future., but this should not mean that it shouldn't work anymore. It probably will stop working at some point in the future as putting secrets in the URIs is too dangerous: everything logs them assuming they aren't sensitive data. the {{S3xLoginHelper}} not only warns you, it does a best-effort attempt to strip out the secrets from the public URI, hence the logs and the messages telling you off. Prior to Hadoop 2.8, the sole *defensible* use case of secrets in URIs was it was the only way to have different logins on different buckets. In Hadoop 2.8 we added the ability to configure any of the fs.s3a. options on a per-bucket basis, including the secret logins, endpoints, and other important values I see what may be happening; in which case it probably constitutes a hadoop regression: if the filesystem's URI is converted to a string it will have these stripped, so if something is going path -> URI -> String ->path the secrets will be lost. If you are seeing this stack trace, it means you are using Hadoop 2.8 or something else with the HADOOP-3733 patch in it. What version of Hadoop (or HDP, CDH..) are you using? If it is based on the full Apache 2.8 release, you get # per-bucket config to allow you to [configure each bucket separately|http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets] # the ability to use JCEKS files to keep the secrets out the configs # session token support. Accordingly, if you state the version, I may be able to look @ what's happening in a bit more detail > Unable to infer schema for ORC on reading ORC from S3 > - > > Key: SPARK-20799 > URL: https://issues.apache.org/jira/browse/SPARK-20799 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Jork Zijlstra > > We are getting the following exception: > {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. > It must be specified manually.{code} > Combining following factors will cause it: > - Use S3 > - Use format ORC > - Don't apply a partitioning on de data > - Embed AWS credentials in the path > The problem is in the PartitioningAwareFileIndex def allFiles() > {code} > leafDirToChildrenFiles.get(qualifiedPath) > .orElse { leafFiles.get(qualifiedPath).map(Array(_)) } > .getOrElse(Array.empty) > {code} > leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the > qualifiedPath contains the path WITH credentials. > So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no > data is read and the schema cannot be defined. > Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login > details. This is insecure and may be unsupported in future., but this should > not mean that it shouldn't work anymore. > Workaround: > Move the AWS credentials from the path to the SparkSession > {code} > SparkSession.builder > .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId}) > .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey}) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20798) GenerateUnsafeProjection should check if value is null before calling the getter
[ https://issues.apache.org/jira/browse/SPARK-20798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-20798. --- Resolution: Fixed Assignee: Ala Luszczak Fix Version/s: 2.2.0 2.1.2 > GenerateUnsafeProjection should check if value is null before calling the > getter > > > Key: SPARK-20798 > URL: https://issues.apache.org/jira/browse/SPARK-20798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ala Luszczak >Assignee: Ala Luszczak > Fix For: 2.1.2, 2.2.0 > > > GenerateUnsafeProjection.writeStructToBuffer() does not honor the assumption > that one should first make sure the value is not null before calling the > getter. This can lead to errors. > An example of generated code: > {noformat} > /* 059 */ final UTF8String fieldName = value.getUTF8String(0); > /* 060 */ if (value.isNullAt(0)) { > /* 061 */ rowWriter1.setNullAt(0); > /* 062 */ } else { > /* 063 */ rowWriter1.write(0, fieldName); > /* 064 */ } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution
[ https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017273#comment-16017273 ] Sean Owen commented on SPARK-20810: --- Are you pretty sure both are converged? You set the same params but do they have the same meaning in both implementations? I wonder if you can double-check the loss that both are computing to see if they even agree about how good a solution the other has found. I doubt the discontinuity of the hinge loss matters as it only affects the gradient when the loss is exactly 0, and defining the derivative as 0 or 1 is valid and doesn't matter much, or shouldn't. > ML LinearSVC vs MLlib SVMWithSGD output different solution > -- > > Key: SPARK-20810 > URL: https://issues.apache.org/jira/browse/SPARK-20810 > Project: Spark > Issue Type: Question > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Yanbo Liang > > Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} > produces different solution compared with MLlib {{SVMWithSGD}}. I understand > they use different optimization solver (OWLQN vs SGD), does it make sense to > converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R > e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like > {{SVMWithSGD}} produce wrong solution. Does it also like this? > AFAIK, both of them use {{hinge loss}} which is convex but not differentiable > function. Since the derivative of the hinge loss at certain place is > non-deterministic, should we switch to use {{squared hinge loss}} which is > the default loss function of {{sklearn.svm.LinearSVC}} and more robust than > {{hinge loss}}? > This issue is very easy to reproduce, you can paste the following code > snippet to {{LinearSVCSuite}} and then click run in Intellij IDE. > {code} > test("LinearSVC vs SVMWithSGD") { > import org.apache.spark.mllib.linalg.{Vectors => OldVectors} > import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} > val trainer1 = new LinearSVC() > .setRegParam(0.2) > .setMaxIter(200) > .setTol(1e-4) > val model1 = trainer1.fit(binaryDataset) > println(model1.coefficients) > println(model1.intercept) > val oldData = binaryDataset.rdd.map { case Row(label: Double, features: > Vector) => > OldLabeledPoint(label, OldVectors.fromML(features)) > } > val trainer2 = new SVMWithSGD().setIntercept(true) > > trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4) > val model2 = trainer2.run(oldData) > println(model2.weights) > println(model2.intercept) > } > {code} > The output is: > {code} > [7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084] > 7.373454363024084 > [0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165] > 0.667790514894194 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20799) Unable to infer schema for ORC on reading ORC from S3
[ https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017278#comment-16017278 ] Jork Zijlstra commented on SPARK-20799: --- Hi Steve, Thanks for the quick response. We indeed don't need the credentials anymore to be on the path. I indeed forgot to mention the version we are running. We are using Spark 2.1.1 with indeed Hadoop 2.8.0 Any other information you need? Regards, Jork > Unable to infer schema for ORC on reading ORC from S3 > - > > Key: SPARK-20799 > URL: https://issues.apache.org/jira/browse/SPARK-20799 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Jork Zijlstra > > We are getting the following exception: > {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. > It must be specified manually.{code} > Combining following factors will cause it: > - Use S3 > - Use format ORC > - Don't apply a partitioning on de data > - Embed AWS credentials in the path > The problem is in the PartitioningAwareFileIndex def allFiles() > {code} > leafDirToChildrenFiles.get(qualifiedPath) > .orElse { leafFiles.get(qualifiedPath).map(Array(_)) } > .getOrElse(Array.empty) > {code} > leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the > qualifiedPath contains the path WITH credentials. > So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no > data is read and the schema cannot be defined. > Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login > details. This is insecure and may be unsupported in future., but this should > not mean that it shouldn't work anymore. > Workaround: > Move the AWS credentials from the path to the SparkSession > {code} > SparkSession.builder > .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId}) > .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey}) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution
[ https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-20810: Description: Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} produces different solution compared with MLlib {{SVMWithSGD}}. I understand they use different optimization solver (OWLQN vs SGD), does it make sense to converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like {{SVMWithSGD}} produce wrong solution. Does it also like this? AFAIK, both of them use {{hinge loss}} which is convex but not differentiable function. Since the derivative of the hinge loss at certain place is non-deterministic, should we switch to use {{squared hinge loss}} which is the default loss function of {{sklearn.svm.LinearSVC}} and more robust than {{hinge loss}}? This issue is very easy to reproduce, you can paste the following code snippet to {{LinearSVCSuite}} and then click run in Intellij IDE. {code} test("LinearSVC vs SVMWithSGD") { import org.apache.spark.mllib.linalg.{Vectors => OldVectors} import org.apache.spark.mllib.classification.SVMWithSGD import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} val trainer1 = new LinearSVC() .setRegParam(0.2) .setMaxIter(200) .setTol(1e-4) val model1 = trainer1.fit(binaryDataset) println(model1.coefficients) println(model1.intercept) val oldData = binaryDataset.rdd.map { case Row(label: Double, features: Vector) => OldLabeledPoint(label, OldVectors.fromML(features)) } val trainer2 = new SVMWithSGD().setIntercept(true) trainer2.optimizer.setRegParam(0.2).setNumIterations(2000).setConvergenceTol(1e-4) val model2 = trainer2.run(oldData) println(model2.weights) println(model2.intercept) } {code} The output is: {code} [7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084] 7.373454363024084 [0.9257083966837497,1.8567843250728242,2.7381537413979595,3.7434319370941265] 0.9656577947867953 {code} was: Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} produces different solution compared with MLlib {{SVMWithSGD}}. I understand they use different optimization solver (OWLQN vs SGD), does it make sense to converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like {{SVMWithSGD}} produce wrong solution. Does it also like this? AFAIK, both of them use {{hinge loss}} which is convex but not differentiable function. Since the derivative of the hinge loss at certain place is non-deterministic, should we switch to use {{squared hinge loss}} which is the default loss function of {{sklearn.svm.LinearSVC}} and more robust than {{hinge loss}}? This issue is very easy to reproduce, you can paste the following code snippet to {{LinearSVCSuite}} and then click run in Intellij IDE. {code} test("LinearSVC vs SVMWithSGD") { import org.apache.spark.mllib.linalg.{Vectors => OldVectors} import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} val trainer1 = new LinearSVC() .setRegParam(0.2) .setMaxIter(200) .setTol(1e-4) val model1 = trainer1.fit(binaryDataset) println(model1.coefficients) println(model1.intercept) val oldData = binaryDataset.rdd.map { case Row(label: Double, features: Vector) => OldLabeledPoint(label, OldVectors.fromML(features)) } val trainer2 = new SVMWithSGD().setIntercept(true) trainer2.optimizer.setRegParam(0.2).setNumIterations(2000).setConvergenceTol(1e-4) val model2 = trainer2.run(oldData) println(model2.weights) println(model2.intercept) } {code} The output is: {code} [7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084] 7.373454363024084 [0.9257083966837497,1.8567843250728242,2.7381537413979595,3.7434319370941265] 0.9656577947867953 {code} > ML LinearSVC vs MLlib SVMWithSGD output different solution > -- > > Key: SPARK-20810 > URL: https://issues.apache.org/jira/browse/SPARK-20810 > Project: Spark > Issue Type: Question > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Yanbo Liang > > Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} > produces different solution compared with MLlib {{SVMWithSGD}}. I understand > they use different optimization solver (OWLQN vs SGD), does it make sense to > converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R > e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like > {{SVMWithSGD}} produce wrong solution. Does it also like this? > AFAIK, both o
[jira] [Updated] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution
[ https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-20810: Description: Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} produces different solution compared with MLlib {{SVMWithSGD}}. I understand they use different optimization solver (OWLQN vs SGD), does it make sense to converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like {{SVMWithSGD}} produce wrong solution. Does it also like this? AFAIK, both of them use {{hinge loss}} which is convex but not differentiable function. Since the derivative of the hinge loss at certain place is non-deterministic, should we switch to use {{squared hinge loss}} which is the default loss function of {{sklearn.svm.LinearSVC}} and more robust than {{hinge loss}}? This issue is very easy to reproduce, you can paste the following code snippet to {{LinearSVCSuite}} and then click run in Intellij IDE. {code} test("LinearSVC vs SVMWithSGD") { import org.apache.spark.mllib.linalg.{Vectors => OldVectors} import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} val trainer1 = new LinearSVC() .setRegParam(0.2) .setMaxIter(200) .setTol(1e-4) val model1 = trainer1.fit(binaryDataset) println(model1.coefficients) println(model1.intercept) val oldData = binaryDataset.rdd.map { case Row(label: Double, features: Vector) => OldLabeledPoint(label, OldVectors.fromML(features)) } val trainer2 = new SVMWithSGD().setIntercept(true) trainer2.optimizer.setRegParam(0.2).setNumIterations(2000).setConvergenceTol(1e-4) val model2 = trainer2.run(oldData) println(model2.weights) println(model2.intercept) } {code} The output is: {code} [7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084] 7.373454363024084 [0.9257083966837497,1.8567843250728242,2.7381537413979595,3.7434319370941265] 0.9656577947867953 {code} was: Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} produces different solution compared with MLlib {{SVMWithSGD}}. I understand they use different optimization solver (OWLQN vs SGD), does it make sense to converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like {{SVMWithSGD}} produce wrong solution. Does it also like this? AFAIK, both of them use {{hinge loss}} which is convex but not differentiable function. Since the derivative of the hinge loss at certain place is non-deterministic, should we switch to use {{squared hinge loss}} which is the default loss function of {{sklearn.svm.LinearSVC}} and more robust than {{hinge loss}}? This issue is very easy to reproduce, you can paste the following code snippet to {{LinearSVCSuite}} and then click run in Intellij IDE. {code} test("LinearSVC vs SVMWithSGD") { import org.apache.spark.mllib.linalg.{Vectors => OldVectors} import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} val trainer1 = new LinearSVC() .setRegParam(0.2) .setMaxIter(200) .setTol(1e-4) val model1 = trainer1.fit(binaryDataset) println(model1.coefficients) println(model1.intercept) val oldData = binaryDataset.rdd.map { case Row(label: Double, features: Vector) => OldLabeledPoint(label, OldVectors.fromML(features)) } val trainer2 = new SVMWithSGD().setIntercept(true) trainer2.optimizer.setRegParam(0.2).setNumIterations(200).setConvergenceTol(1e-4) val model2 = trainer2.run(oldData) println(model2.weights) println(model2.intercept) } {code} The output is: {code} [7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084] 7.373454363024084 [0.58166680313823,1.1938960150473041,1.7940106824589588,2.4884300611292165] 0.667790514894194 {code} > ML LinearSVC vs MLlib SVMWithSGD output different solution > -- > > Key: SPARK-20810 > URL: https://issues.apache.org/jira/browse/SPARK-20810 > Project: Spark > Issue Type: Question > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Yanbo Liang > > Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} > produces different solution compared with MLlib {{SVMWithSGD}}. I understand > they use different optimization solver (OWLQN vs SGD), does it make sense to > converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R > e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like > {{SVMWithSGD}} produce wrong solution. Does it also like this? > AFAIK, both of them use {{hinge loss}} which is convex but not differentiable
[jira] [Commented] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution
[ https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017300#comment-16017300 ] Yanbo Liang commented on SPARK-20810: - [~srowen] Thanks for your comments. I'm sure both are converged. ML LinearSVC converged after 143 epoch, and MLlib SVMWithSGD converged after 1794 epoch. It seems that we should pay some efforts to investigate the correctness of old MLlib implementation. > ML LinearSVC vs MLlib SVMWithSGD output different solution > -- > > Key: SPARK-20810 > URL: https://issues.apache.org/jira/browse/SPARK-20810 > Project: Spark > Issue Type: Question > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Yanbo Liang > > Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} > produces different solution compared with MLlib {{SVMWithSGD}}. I understand > they use different optimization solver (OWLQN vs SGD), does it make sense to > converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R > e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like > {{SVMWithSGD}} produce wrong solution. Does it also like this? > AFAIK, both of them use {{hinge loss}} which is convex but not differentiable > function. Since the derivative of the hinge loss at certain place is > non-deterministic, should we switch to use {{squared hinge loss}} which is > the default loss function of {{sklearn.svm.LinearSVC}} and more robust than > {{hinge loss}}? > This issue is very easy to reproduce, you can paste the following code > snippet to {{LinearSVCSuite}} and then click run in Intellij IDE. > {code} > test("LinearSVC vs SVMWithSGD") { > import org.apache.spark.mllib.linalg.{Vectors => OldVectors} > import org.apache.spark.mllib.classification.SVMWithSGD > import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} > val trainer1 = new LinearSVC() > .setRegParam(0.2) > .setMaxIter(200) > .setTol(1e-4) > val model1 = trainer1.fit(binaryDataset) > println(model1.coefficients) > println(model1.intercept) > val oldData = binaryDataset.rdd.map { case Row(label: Double, features: > Vector) => > OldLabeledPoint(label, OldVectors.fromML(features)) > } > val trainer2 = new SVMWithSGD().setIntercept(true) > > trainer2.optimizer.setRegParam(0.2).setNumIterations(2000).setConvergenceTol(1e-4) > val model2 = trainer2.run(oldData) > println(model2.weights) > println(model2.intercept) > } > {code} > The output is: > {code} > [7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084] > 7.373454363024084 > [0.9257083966837497,1.8567843250728242,2.7381537413979595,3.7434319370941265] > 0.9656577947867953 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20810) ML LinearSVC vs MLlib SVMWithSGD output different solution
[ https://issues.apache.org/jira/browse/SPARK-20810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017300#comment-16017300 ] Yanbo Liang edited comment on SPARK-20810 at 5/19/17 12:02 PM: --- [~srowen] Thanks for your comments. I'm sure both are converged. ML LinearSVC converged after 143 epoch, and MLlib SVMWithSGD converged after 1794 epoch. It seems that we should pay some efforts to investigate the correctness of old MLlib implementation. Or there are some implementation difference in detail, I'll try to make a closer inspection. was (Author: yanboliang): [~srowen] Thanks for your comments. I'm sure both are converged. ML LinearSVC converged after 143 epoch, and MLlib SVMWithSGD converged after 1794 epoch. It seems that we should pay some efforts to investigate the correctness of old MLlib implementation. > ML LinearSVC vs MLlib SVMWithSGD output different solution > -- > > Key: SPARK-20810 > URL: https://issues.apache.org/jira/browse/SPARK-20810 > Project: Spark > Issue Type: Question > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Yanbo Liang > > Fitting with SVM classification model on the same dataset, ML {{LinearSVC}} > produces different solution compared with MLlib {{SVMWithSGD}}. I understand > they use different optimization solver (OWLQN vs SGD), does it make sense to > converge to different solution? Since we use {{sklearn.svm.LinearSVC}} and R > e1071 SVM as the reference in {{LinearSVCSuite}}, it seems like > {{SVMWithSGD}} produce wrong solution. Does it also like this? > AFAIK, both of them use {{hinge loss}} which is convex but not differentiable > function. Since the derivative of the hinge loss at certain place is > non-deterministic, should we switch to use {{squared hinge loss}} which is > the default loss function of {{sklearn.svm.LinearSVC}} and more robust than > {{hinge loss}}? > This issue is very easy to reproduce, you can paste the following code > snippet to {{LinearSVCSuite}} and then click run in Intellij IDE. > {code} > test("LinearSVC vs SVMWithSGD") { > import org.apache.spark.mllib.linalg.{Vectors => OldVectors} > import org.apache.spark.mllib.classification.SVMWithSGD > import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint} > val trainer1 = new LinearSVC() > .setRegParam(0.2) > .setMaxIter(200) > .setTol(1e-4) > val model1 = trainer1.fit(binaryDataset) > println(model1.coefficients) > println(model1.intercept) > val oldData = binaryDataset.rdd.map { case Row(label: Double, features: > Vector) => > OldLabeledPoint(label, OldVectors.fromML(features)) > } > val trainer2 = new SVMWithSGD().setIntercept(true) > > trainer2.optimizer.setRegParam(0.2).setNumIterations(2000).setConvergenceTol(1e-4) > val model2 = trainer2.run(oldData) > println(model2.weights) > println(model2.intercept) > } > {code} > The output is: > {code} > [7.24661385022775,14.774484832179743,22.00945617480461,29.558498069476084] > 7.373454363024084 > [0.9257083966837497,1.8567843250728242,2.7381537413979595,3.7434319370941265] > 0.9656577947867953 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20773) ParquetWriteSupport.writeFields is quadratic in number of fields
[ https://issues.apache.org/jira/browse/SPARK-20773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-20773. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.2 > ParquetWriteSupport.writeFields is quadratic in number of fields > > > Key: SPARK-20773 > URL: https://issues.apache.org/jira/browse/SPARK-20773 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: T Poterba >Priority: Minor > Labels: easyfix, performance > Fix For: 2.1.2, 2.2.0 > > Original Estimate: 10m > Remaining Estimate: 10m > > The writeFields method in ParquetWriteSupport uses Seq.apply(i) to select all > elements. Since the fieldWriters object is a List, this is a quadratic > operation. > See line 123: > https://github.com/apache/spark/blob/ac1ab6b9db188ac54c745558d57dd0a031d0b162/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20773) ParquetWriteSupport.writeFields is quadratic in number of fields
[ https://issues.apache.org/jira/browse/SPARK-20773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell reassigned SPARK-20773: - Assignee: T Poterba > ParquetWriteSupport.writeFields is quadratic in number of fields > > > Key: SPARK-20773 > URL: https://issues.apache.org/jira/browse/SPARK-20773 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: T Poterba >Assignee: T Poterba >Priority: Minor > Labels: easyfix, performance > Fix For: 2.1.2, 2.2.0 > > Original Estimate: 10m > Remaining Estimate: 10m > > The writeFields method in ParquetWriteSupport uses Seq.apply(i) to select all > elements. Since the fieldWriters object is a List, this is a quadratic > operation. > See line 123: > https://github.com/apache/spark/blob/ac1ab6b9db188ac54c745558d57dd0a031d0b162/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17922) ClassCastException java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator cannot be cast to org.apache.spark.sql.cataly
[ https://issues.apache.org/jira/browse/SPARK-17922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artur Sukhenko updated SPARK-17922: --- Affects Version/s: 2.0.1 > ClassCastException java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator > cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeProjection > - > > Key: SPARK-17922 > URL: https://issues.apache.org/jira/browse/SPARK-17922 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: kanika dhuria > Attachments: spark_17922.tar.gz > > > I am using spark 2.0 > Seeing class loading issue because the whole stage code gen is generating > multiple classes with same name as > "org.apache.spark.sql.catalyst.expressions.GeneratedClass" > I am using dataframe transform. and within transform i use Osgi. > Osgi replaces the thread context class loader to ContextFinder which looks at > all the class loaders in the stack to find out the new generated class and > finds the GeneratedClass with inner class GeneratedIterator byteclass > loader(instead of falling back to the byte class loader created by janino > compiler), since the class name is same that byte class loader loads the > class and returns GeneratedClass$GeneratedIterator instead of expected > GeneratedClass$UnsafeProjection. > Can we generate different classes with different names or is it expected to > generate one class only? > This is the somewhat I am trying to do > {noformat} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > import com.databricks.spark.avro._ > def exePart(out:StructType): ((Iterator[Row]) => Iterator[Row]) = { > //Initialize osgi > (rows:Iterator[Row]) => { > var outi = Iterator[Row]() > while(rows.hasNext) { > val r = rows.next > outi = outi.++(Iterator(Row(r.get(0 > } > //val ors = Row("abc") > //outi =outi.++( Iterator(ors)) > outi > } > } > def transform1( outType:StructType) :((DataFrame) => DataFrame) = { > (d:DataFrame) => { > val inType = d.schema > val rdd = d.rdd.mapPartitions(exePart(outType)) > d.sqlContext.createDataFrame(rdd, outType) > } > > } > val df = spark.read.avro("file:///data/builds/a1.avro") > val df1 = df.select($"id2").filter(false) > val df2 = df1.transform(transform1(StructType(StructField("p1", IntegerType, > true)::Nil))).createOrReplaceTempView("tbl0") > spark.sql("insert overwrite table testtable select p1 from tbl0") > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20607) Add new unit tests to ShuffleSuite
[ https://issues.apache.org/jira/browse/SPARK-20607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20607: - Assignee: caoxuewen Priority: Trivial (was: Minor) > Add new unit tests to ShuffleSuite > -- > > Key: SPARK-20607 > URL: https://issues.apache.org/jira/browse/SPARK-20607 > Project: Spark > Issue Type: Test > Components: Shuffle, Tests >Affects Versions: 2.1.2 >Reporter: caoxuewen >Assignee: caoxuewen >Priority: Trivial > Fix For: 2.3.0 > > > 1.adds the new unit tests. > testing would be performed when there is no shuffle stage, > shuffle will not generate the data file and the index files. > 2.Modify the '[SPARK-4085] rerun map stage if reduce stage cannot find its > local shuffle file' unit test, > parallelize is 1 but not is 2, Check the index file and delete. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20607) Add new unit tests to ShuffleSuite
[ https://issues.apache.org/jira/browse/SPARK-20607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20607. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 17868 [https://github.com/apache/spark/pull/17868] > Add new unit tests to ShuffleSuite > -- > > Key: SPARK-20607 > URL: https://issues.apache.org/jira/browse/SPARK-20607 > Project: Spark > Issue Type: Test > Components: Shuffle, Tests >Affects Versions: 2.1.2 >Reporter: caoxuewen >Priority: Minor > Fix For: 2.3.0 > > > 1.adds the new unit tests. > testing would be performed when there is no shuffle stage, > shuffle will not generate the data file and the index files. > 2.Modify the '[SPARK-4085] rerun map stage if reduce stage cannot find its > local shuffle file' unit test, > parallelize is 1 but not is 2, Check the index file and delete. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20759) SCALA_VERSION in _config.yml,LICENSE and Dockerfile should be consistent with pom.xml
[ https://issues.apache.org/jira/browse/SPARK-20759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20759. --- Resolution: Fixed Fix Version/s: 2.1.2 2.2.0 Issue resolved by pull request 17992 [https://github.com/apache/spark/pull/17992] > SCALA_VERSION in _config.yml,LICENSE and Dockerfile should be consistent with > pom.xml > - > > Key: SPARK-20759 > URL: https://issues.apache.org/jira/browse/SPARK-20759 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.1 >Reporter: liuzhaokun >Priority: Minor > Fix For: 2.2.0, 2.1.2 > > > SCALA_VERSION in _config.yml,LICENSE and Dockerfile is 2.11.7, but 2.11.8 in > pom.xml. So I think SCALA_VERSION in _config.yml should be consistent with > pom.xml. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20759) SCALA_VERSION in _config.yml,LICENSE and Dockerfile should be consistent with pom.xml
[ https://issues.apache.org/jira/browse/SPARK-20759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20759: - Assignee: liuzhaokun Priority: Trivial (was: Minor) > SCALA_VERSION in _config.yml,LICENSE and Dockerfile should be consistent with > pom.xml > - > > Key: SPARK-20759 > URL: https://issues.apache.org/jira/browse/SPARK-20759 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.1 >Reporter: liuzhaokun >Assignee: liuzhaokun >Priority: Trivial > Fix For: 2.1.2, 2.2.0 > > > SCALA_VERSION in _config.yml,LICENSE and Dockerfile is 2.11.7, but 2.11.8 in > pom.xml. So I think SCALA_VERSION in _config.yml should be consistent with > pom.xml. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017528#comment-16017528 ] Mathieu D commented on SPARK-18838: --- I'm not very familiar with this part of Spark, but I'd like to share a thought. In my experience (SPARK-18881) when events start to be dropped because of full event queues, it's much more serious than just a failed job. The Spark driver became useless, I had to relaunch. So, besides the improvement of existing bus, listeners and threads, wouldn't be a kind of back-pressure mechanism (on tasks emission) better than dropping events ? I mean, this would obviously degrade the job performance, but it's still better than compromising the whole job or even the driver health. my2cent > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia > Attachments: SparkListernerComputeTime.xlsx > > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might > hurt the job performance significantly or even fail the job. For example, a > significant delay in receiving the `SparkListenerTaskStart` might cause > `ExecutorAllocationManager` manager to mistakenly remove an executor which is > not idle. > The problem is that the event processor in `ListenerBus` is a single thread > which loops through all the Listeners for each event and processes each event > synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > This single threaded processor often becomes the bottleneck for large jobs. > Also, if one of the Listener is very slow, all the listeners will pay the > price of delay incurred by the slow listener. In addition to that a slow > listener can cause events to be dropped from the event queue which might be > fatal to the job. > To solve the above problems, we propose to get rid of the event queue and the > single threaded event processor. Instead each listener will have its own > dedicate single threaded executor service . When ever an event is posted, it > will be submitted to executor service of all the listeners. The Single > threaded executor service will guarantee in order processing of the events > per listener. The queue used for the executor service will be bounded to > guarantee we do not grow the memory indefinitely. The downside of this > approach is separate event queue per listener will increase the driver memory > footprint. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017560#comment-16017560 ] Antoine PRANG commented on SPARK-18838: --- [~mathieude]]: Yep, I introduced a blocking strategy for the LiveListenerBus (If the queue is full, we wait for space instead of dropping events). This is not the default strategy but it can be activated through a settings. The default strategy remains the dropping one. > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia > Attachments: SparkListernerComputeTime.xlsx > > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might > hurt the job performance significantly or even fail the job. For example, a > significant delay in receiving the `SparkListenerTaskStart` might cause > `ExecutorAllocationManager` manager to mistakenly remove an executor which is > not idle. > The problem is that the event processor in `ListenerBus` is a single thread > which loops through all the Listeners for each event and processes each event > synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > This single threaded processor often becomes the bottleneck for large jobs. > Also, if one of the Listener is very slow, all the listeners will pay the > price of delay incurred by the slow listener. In addition to that a slow > listener can cause events to be dropped from the event queue which might be > fatal to the job. > To solve the above problems, we propose to get rid of the event queue and the > single threaded event processor. Instead each listener will have its own > dedicate single threaded executor service . When ever an event is posted, it > will be submitted to executor service of all the listeners. The Single > threaded executor service will guarantee in order processing of the events > per listener. The queue used for the executor service will be bounded to > guarantee we do not grow the memory indefinitely. The downside of this > approach is separate event queue per listener will increase the driver memory > footprint. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017560#comment-16017560 ] Antoine PRANG edited comment on SPARK-18838 at 5/19/17 3:39 PM: [~mathieude]: Yep, I introduced a blocking strategy for the LiveListenerBus (If the queue is full, we wait for space instead of dropping events). This is not the default strategy but it can be activated through a settings. The default strategy remains the dropping one. was (Author: boomx): [~mathieude]]: Yep, I introduced a blocking strategy for the LiveListenerBus (If the queue is full, we wait for space instead of dropping events). This is not the default strategy but it can be activated through a settings. The default strategy remains the dropping one. > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia > Attachments: SparkListernerComputeTime.xlsx > > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might > hurt the job performance significantly or even fail the job. For example, a > significant delay in receiving the `SparkListenerTaskStart` might cause > `ExecutorAllocationManager` manager to mistakenly remove an executor which is > not idle. > The problem is that the event processor in `ListenerBus` is a single thread > which loops through all the Listeners for each event and processes each event > synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > This single threaded processor often becomes the bottleneck for large jobs. > Also, if one of the Listener is very slow, all the listeners will pay the > price of delay incurred by the slow listener. In addition to that a slow > listener can cause events to be dropped from the event queue which might be > fatal to the job. > To solve the above problems, we propose to get rid of the event queue and the > single threaded event processor. Instead each listener will have its own > dedicate single threaded executor service . When ever an event is posted, it > will be submitted to executor service of all the listeners. The Single > threaded executor service will guarantee in order processing of the events > per listener. The queue used for the executor service will be bounded to > guarantee we do not grow the memory indefinitely. The downside of this > approach is separate event queue per listener will increase the driver memory > footprint. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20811) GBT Classifier failed with mysterious StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-20811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nan Zhu updated SPARK-20811: Summary: GBT Classifier failed with mysterious StackOverflowError (was: GBT Classifier failed with mysterious StackOverflowException ) > GBT Classifier failed with mysterious StackOverflowError > > > Key: SPARK-20811 > URL: https://issues.apache.org/jira/browse/SPARK-20811 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Nan Zhu > > I am running GBT Classifier over airline dataset (combining 2005-2008) and in > total it's around 22M examples as training data > code is simple > {code:title=Bar.scala|borderStyle=solid} > val gradientBoostedTrees = new GBTClassifier() > gradientBoostedTrees.setMaxBins(1000) > gradientBoostedTrees.setMaxIter(500) > gradientBoostedTrees.setMaxDepth(6) > gradientBoostedTrees.setStepSize(1.0) > transformedTrainingSet.cache().foreach(_ => Unit) > val startTime = System.nanoTime() > val model = gradientBoostedTrees.fit(transformedTrainingSet) > println(s"===training time cost: ${(System.nanoTime() - startTime) / > 1000.0 / 1000.0} ms") > val resultDF = model.transform(transformedTestset) > val binaryClassificationEvaluator = new BinaryClassificationEvaluator() > > binaryClassificationEvaluator.setRawPredictionCol("prediction").setLabelCol("label") > println(s"=test AUC: > ${binaryClassificationEvaluator.evaluate(resultDF)}==") > {code} > my training job always failed with > {quote} > 17/05/19 13:41:29 WARN TaskSetManager: Lost task 18.0 in stage 3907.0 (TID > 137506, 10.0.0.13, executor 3): java.lang.StackOverflowError > at > java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:3037) > at > java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:3061) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2234) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:479) > at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) > {quote} > the above pattern repeated for many times > Is it a bug or did I make something wrong when using GBTClassifier in ML? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20811) GBT Classifier failed with mysterious StackOverflowException
Nan Zhu created SPARK-20811: --- Summary: GBT Classifier failed with mysterious StackOverflowException Key: SPARK-20811 URL: https://issues.apache.org/jira/browse/SPARK-20811 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.1.0 Reporter: Nan Zhu I am running GBT Classifier over airline dataset (combining 2005-2008) and in total it's around 22M examples as training data code is simple {code:title=Bar.scala|borderStyle=solid} val gradientBoostedTrees = new GBTClassifier() gradientBoostedTrees.setMaxBins(1000) gradientBoostedTrees.setMaxIter(500) gradientBoostedTrees.setMaxDepth(6) gradientBoostedTrees.setStepSize(1.0) transformedTrainingSet.cache().foreach(_ => Unit) val startTime = System.nanoTime() val model = gradientBoostedTrees.fit(transformedTrainingSet) println(s"===training time cost: ${(System.nanoTime() - startTime) / 1000.0 / 1000.0} ms") val resultDF = model.transform(transformedTestset) val binaryClassificationEvaluator = new BinaryClassificationEvaluator() binaryClassificationEvaluator.setRawPredictionCol("prediction").setLabelCol("label") println(s"=test AUC: ${binaryClassificationEvaluator.evaluate(resultDF)}==") {code} my training job always failed with {quote} 17/05/19 13:41:29 WARN TaskSetManager: Lost task 18.0 in stage 3907.0 (TID 137506, 10.0.0.13, executor 3): java.lang.StackOverflowError at java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:3037) at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:3061) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2234) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422) at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:479) at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) {quote} the above pattern repeated for many times Is it a bug or did I make something wrong when using GBTClassifier in ML? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20751) Built-in SQL Function Support - COT
[ https://issues.apache.org/jira/browse/SPARK-20751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-20751: --- Assignee: Yuming Wang > Built-in SQL Function Support - COT > --- > > Key: SPARK-20751 > URL: https://issues.apache.org/jira/browse/SPARK-20751 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Yuming Wang > Fix For: 2.3.0 > > > {noformat} > COT() > {noformat} > Returns the cotangent of . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20751) Built-in SQL Function Support - COT
[ https://issues.apache.org/jira/browse/SPARK-20751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20751. - Resolution: Fixed Fix Version/s: 2.3.0 > Built-in SQL Function Support - COT > --- > > Key: SPARK-20751 > URL: https://issues.apache.org/jira/browse/SPARK-20751 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Yuming Wang > Fix For: 2.3.0 > > > {noformat} > COT() > {noformat} > Returns the cotangent of . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20811) GBT Classifier failed with mysterious StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-20811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017660#comment-16017660 ] Sean Owen commented on SPARK-20811: --- I assume it's serialization of a very deep tree via the Java mechanism. Does kryo work differently? does increasing the stack size with something like -Xss1m at least work around it? > GBT Classifier failed with mysterious StackOverflowError > > > Key: SPARK-20811 > URL: https://issues.apache.org/jira/browse/SPARK-20811 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Nan Zhu > > I am running GBT Classifier over airline dataset (combining 2005-2008) and in > total it's around 22M examples as training data > code is simple > {code:title=Bar.scala|borderStyle=solid} > val gradientBoostedTrees = new GBTClassifier() > gradientBoostedTrees.setMaxBins(1000) > gradientBoostedTrees.setMaxIter(500) > gradientBoostedTrees.setMaxDepth(6) > gradientBoostedTrees.setStepSize(1.0) > transformedTrainingSet.cache().foreach(_ => Unit) > val startTime = System.nanoTime() > val model = gradientBoostedTrees.fit(transformedTrainingSet) > println(s"===training time cost: ${(System.nanoTime() - startTime) / > 1000.0 / 1000.0} ms") > val resultDF = model.transform(transformedTestset) > val binaryClassificationEvaluator = new BinaryClassificationEvaluator() > > binaryClassificationEvaluator.setRawPredictionCol("prediction").setLabelCol("label") > println(s"=test AUC: > ${binaryClassificationEvaluator.evaluate(resultDF)}==") > {code} > my training job always failed with > {quote} > 17/05/19 13:41:29 WARN TaskSetManager: Lost task 18.0 in stage 3907.0 (TID > 137506, 10.0.0.13, executor 3): java.lang.StackOverflowError > at > java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:3037) > at > java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:3061) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2234) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:479) > at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) > {quote} > the above pattern repeated for many times > Is it a bug or did I make something wrong when using GBTClassifier in ML? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-12139) REGEX Column Specification for Hive Queries
[ https://issues.apache.org/jira/browse/SPARK-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reopened SPARK-12139: - > REGEX Column Specification for Hive Queries > --- > > Key: SPARK-12139 > URL: https://issues.apache.org/jira/browse/SPARK-12139 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Derek Sabry >Priority: Minor > > When executing a query of the form > Select `(a)?\+.\+` from A, > Hive would interpret this query as a regular expression, which can be > supported in the hive parser for spark -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18825) Eliminate duplicate links in SparkR API doc index
[ https://issues.apache.org/jira/browse/SPARK-18825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017699#comment-16017699 ] Felix Cheung commented on SPARK-18825: -- interesting - do you think knitr can take your change for -method? I'm actually not sure about the part with dontrun - could you explain a bit? > Eliminate duplicate links in SparkR API doc index > - > > Key: SPARK-18825 > URL: https://issues.apache.org/jira/browse/SPARK-18825 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Joseph K. Bradley > > The SparkR API docs contain many duplicate links with suffixes {{-method}} or > {{-class}} in the index . E.g., {{atan}} and {{atan-method}} link to the same > doc. > Copying from [~felixcheung] in [SPARK-18332]: > {quote} > They are because of the > {{@ aliases}} > tags. I think we are adding them because CRAN checks require them to match > the specific format - [~shivaram] would you know? > I am pretty sure they are double-listed because in addition to aliases we > also have > {{@ rdname}} > which automatically generate the links as well. > I suspect if we change all the rdname to match the string in aliases then > there will be one link. I can take a shot at this to test this out, but > changes will be very extensive - is this something we could get into 2.1 > still? > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20763) The function of `month` and `day` return a value which is not we expected
[ https://issues.apache.org/jira/browse/SPARK-20763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20763: Fix Version/s: (was: 2.3.0) > The function of `month` and `day` return a value which is not we expected > -- > > Key: SPARK-20763 > URL: https://issues.apache.org/jira/browse/SPARK-20763 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: liuxian >Assignee: liuxian >Priority: Minor > Fix For: 2.2.0 > > > spark-sql>select month("1582-09-28"); > spark-sql>10 > For this case, the expected result is 9, but it is 10. > spark-sql>select day("1582-04-18"); > spark-sql>28 > For this case, the expected result is 18, but it is 28. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20763) The function of `month` and `day` return a value which is not we expected
[ https://issues.apache.org/jira/browse/SPARK-20763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20763. - Resolution: Fixed Assignee: liuxian Fix Version/s: 2.3.0 2.2.0 > The function of `month` and `day` return a value which is not we expected > -- > > Key: SPARK-20763 > URL: https://issues.apache.org/jira/browse/SPARK-20763 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: liuxian >Assignee: liuxian >Priority: Minor > Fix For: 2.2.0, 2.3.0 > > > spark-sql>select month("1582-09-28"); > spark-sql>10 > For this case, the expected result is 9, but it is 10. > spark-sql>select day("1582-04-18"); > spark-sql>28 > For this case, the expected result is 18, but it is 28. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20812) Add Mesos Secrets support to the spark dispatcher
Michael Gummelt created SPARK-20812: --- Summary: Add Mesos Secrets support to the spark dispatcher Key: SPARK-20812 URL: https://issues.apache.org/jira/browse/SPARK-20812 Project: Spark Issue Type: New Feature Components: Mesos Affects Versions: 2.3.0 Reporter: Michael Gummelt Mesos 1.3 supports secrets. In order to support sending keytabs through the Spark Dispatcher, or any other secret, we need to integrate this with the Spark Dispatcher. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12139) REGEX Column Specification for Hive Queries
[ https://issues.apache.org/jira/browse/SPARK-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12139: Assignee: Apache Spark > REGEX Column Specification for Hive Queries > --- > > Key: SPARK-12139 > URL: https://issues.apache.org/jira/browse/SPARK-12139 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Derek Sabry >Assignee: Apache Spark >Priority: Minor > > When executing a query of the form > Select `(a)?\+.\+` from A, > Hive would interpret this query as a regular expression, which can be > supported in the hive parser for spark -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12139) REGEX Column Specification for Hive Queries
[ https://issues.apache.org/jira/browse/SPARK-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12139: Assignee: (was: Apache Spark) > REGEX Column Specification for Hive Queries > --- > > Key: SPARK-12139 > URL: https://issues.apache.org/jira/browse/SPARK-12139 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Derek Sabry >Priority: Minor > > When executing a query of the form > Select `(a)?\+.\+` from A, > Hive would interpret this query as a regular expression, which can be > supported in the hive parser for spark -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20812) Add Mesos Secrets support to the spark dispatcher
[ https://issues.apache.org/jira/browse/SPARK-20812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated SPARK-20812: Description: Mesos 1.3 supports secrets. In order to support sending keytabs through the Spark Dispatcher, or any other secret, we need to integrate this with the Spark Dispatcher. The integration should include support for both file-based and env-based secrets. was:Mesos 1.3 supports secrets. In order to support sending keytabs through the Spark Dispatcher, or any other secret, we need to integrate this with the Spark Dispatcher. > Add Mesos Secrets support to the spark dispatcher > - > > Key: SPARK-20812 > URL: https://issues.apache.org/jira/browse/SPARK-20812 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.3.0 >Reporter: Michael Gummelt > > Mesos 1.3 supports secrets. In order to support sending keytabs through the > Spark Dispatcher, or any other secret, we need to integrate this with the > Spark Dispatcher. > The integration should include support for both file-based and env-based > secrets. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20512) SparkR 2.2 QA: Programming guide, migration guide, vignettes updates
[ https://issues.apache.org/jira/browse/SPARK-20512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20512: -- Description: Before the release, we need to update the SparkR Programming Guide, its migration guide, and the R vignettes. Updates will include: * Add migration guide subsection. ** Use the results of the QA audit JIRAs and [SPARK-18864]. * Check phrasing, especially in main sections (for outdated items such as "In this release, ...") * Update R vignettes Note: This task is for large changes to the guides. New features are handled in [SPARK-18330]. was: Before the release, we need to update the SparkR Programming Guide, its migration guide, and the R vignettes. Updates will include: * Add migration guide subsection. ** Use the results of the QA audit JIRAs and [SPARK-17692]. * Check phrasing, especially in main sections (for outdated items such as "In this release, ...") * Update R vignettes Note: This task is for large changes to the guides. New features are handled in [SPARK-18330]. > SparkR 2.2 QA: Programming guide, migration guide, vignettes updates > > > Key: SPARK-20512 > URL: https://issues.apache.org/jira/browse/SPARK-20512 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Before the release, we need to update the SparkR Programming Guide, its > migration guide, and the R vignettes. Updates will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-18864]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") > * Update R vignettes > Note: This task is for large changes to the guides. New features are handled > in [SPARK-18330]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20512) SparkR 2.2 QA: Programming guide, migration guide, vignettes updates
[ https://issues.apache.org/jira/browse/SPARK-20512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20512: -- Description: Before the release, we need to update the SparkR Programming Guide, its migration guide, and the R vignettes. Updates will include: * Add migration guide subsection. ** Use the results of the QA audit JIRAs and [SPARK-18864]. * Check phrasing, especially in main sections (for outdated items such as "In this release, ...") * Update R vignettes Note: This task is for large changes to the guides. New features are handled in [SPARK-20505]. was: Before the release, we need to update the SparkR Programming Guide, its migration guide, and the R vignettes. Updates will include: * Add migration guide subsection. ** Use the results of the QA audit JIRAs and [SPARK-18864]. * Check phrasing, especially in main sections (for outdated items such as "In this release, ...") * Update R vignettes Note: This task is for large changes to the guides. New features are handled in [SPARK-18330]. > SparkR 2.2 QA: Programming guide, migration guide, vignettes updates > > > Key: SPARK-20512 > URL: https://issues.apache.org/jira/browse/SPARK-20512 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Before the release, we need to update the SparkR Programming Guide, its > migration guide, and the R vignettes. Updates will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-18864]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") > * Update R vignettes > Note: This task is for large changes to the guides. New features are handled > in [SPARK-20505]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20813) executor page search by status not working
Jong Yoon Lee created SPARK-20813: - Summary: executor page search by status not working Key: SPARK-20813 URL: https://issues.apache.org/jira/browse/SPARK-20813 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.1.0 Reporter: Jong Yoon Lee Priority: Trivial When searching for status keywords such as active, dead or Blacklisted nothing is returned on the table. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20813) Web UI executor page tab search by status not working
[ https://issues.apache.org/jira/browse/SPARK-20813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jong Yoon Lee updated SPARK-20813: -- Summary: Web UI executor page tab search by status not working (was: executor page search by status not working ) > Web UI executor page tab search by status not working > -- > > Key: SPARK-20813 > URL: https://issues.apache.org/jira/browse/SPARK-20813 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Jong Yoon Lee >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > When searching for status keywords such as active, dead or Blacklisted > nothing is returned on the table. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18825) Eliminate duplicate links in SparkR API doc index
[ https://issues.apache.org/jira/browse/SPARK-18825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017817#comment-16017817 ] Maciej Szymkiewicz commented on SPARK-18825: Originally I thought about patching it for our own usage, but I can open an issue / PR and see what they have to say. Problematic html is not even generated by {{knitr}} so technically speaking we can just {{sed}} this thing between: {code} . "$FWDIR/install-dev.sh" {code} and calling {{knitr}} Regarding {{dontrun}} - right now we have a lot of examples which are never executed to satisfy CRAN requirements. Calling these could: - Serve as additional tests. - Reduce maintenance burden. - Improve quality of the docs (strip {{## Not run:}} and {{##D}} and provide actual output). > Eliminate duplicate links in SparkR API doc index > - > > Key: SPARK-18825 > URL: https://issues.apache.org/jira/browse/SPARK-18825 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Joseph K. Bradley > > The SparkR API docs contain many duplicate links with suffixes {{-method}} or > {{-class}} in the index . E.g., {{atan}} and {{atan-method}} link to the same > doc. > Copying from [~felixcheung] in [SPARK-18332]: > {quote} > They are because of the > {{@ aliases}} > tags. I think we are adding them because CRAN checks require them to match > the specific format - [~shivaram] would you know? > I am pretty sure they are double-listed because in addition to aliases we > also have > {{@ rdname}} > which automatically generate the links as well. > I suspect if we change all the rdname to match the string in aliases then > there will be one link. I can take a shot at this to test this out, but > changes will be very extensive - is this something we could get into 2.1 > still? > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18825) Eliminate duplicate links in SparkR API doc index
[ https://issues.apache.org/jira/browse/SPARK-18825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017817#comment-16017817 ] Maciej Szymkiewicz edited comment on SPARK-18825 at 5/19/17 6:42 PM: - Originally I thought about patching it for our own usage, but I can open an issue / PR and see what they have to say. Problematic html is not even generated by {{knitr}} so technically speaking we can just {{sed}} this thing between: {code} . "$FWDIR/install-dev.sh" {code} and calling {{knitr}} Regarding {{dontrun}} - right now we have a lot of examples which are never executed to satisfy CRAN requirements but could be run locally when we {{create_docs}}. Running these could: - Serve as additional tests. - Reduce maintenance burden. - Improve quality of the docs (strip {{## Not run:}} and {{##D}} and provide actual output). was (Author: zero323): Originally I thought about patching it for our own usage, but I can open an issue / PR and see what they have to say. Problematic html is not even generated by {{knitr}} so technically speaking we can just {{sed}} this thing between: {code} . "$FWDIR/install-dev.sh" {code} and calling {{knitr}} Regarding {{dontrun}} - right now we have a lot of examples which are never executed to satisfy CRAN requirements. Calling these could: - Serve as additional tests. - Reduce maintenance burden. - Improve quality of the docs (strip {{## Not run:}} and {{##D}} and provide actual output). > Eliminate duplicate links in SparkR API doc index > - > > Key: SPARK-18825 > URL: https://issues.apache.org/jira/browse/SPARK-18825 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Joseph K. Bradley > > The SparkR API docs contain many duplicate links with suffixes {{-method}} or > {{-class}} in the index . E.g., {{atan}} and {{atan-method}} link to the same > doc. > Copying from [~felixcheung] in [SPARK-18332]: > {quote} > They are because of the > {{@ aliases}} > tags. I think we are adding them because CRAN checks require them to match > the specific format - [~shivaram] would you know? > I am pretty sure they are double-listed because in addition to aliases we > also have > {{@ rdname}} > which automatically generate the links as well. > I suspect if we change all the rdname to match the string in aliases then > there will be one link. I can take a shot at this to test this out, but > changes will be very extensive - is this something we could get into 2.1 > still? > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20506) ML, Graph 2.2 QA: Programming guide update and migration guide
[ https://issues.apache.org/jira/browse/SPARK-20506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-20506. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17996 [https://github.com/apache/spark/pull/17996] > ML, Graph 2.2 QA: Programming guide update and migration guide > -- > > Key: SPARK-20506 > URL: https://issues.apache.org/jira/browse/SPARK-20506 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath >Priority: Critical > Fix For: 2.2.0 > > > Before the release, we need to update the MLlib and GraphX Programming > Guides. Updates will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-17692]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20813) Web UI executor page tab search by status not working
[ https://issues.apache.org/jira/browse/SPARK-20813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20813: Assignee: Apache Spark > Web UI executor page tab search by status not working > -- > > Key: SPARK-20813 > URL: https://issues.apache.org/jira/browse/SPARK-20813 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Jong Yoon Lee >Assignee: Apache Spark >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > When searching for status keywords such as active, dead or Blacklisted > nothing is returned on the table. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20813) Web UI executor page tab search by status not working
[ https://issues.apache.org/jira/browse/SPARK-20813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017907#comment-16017907 ] Apache Spark commented on SPARK-20813: -- User 'yoonlee95' has created a pull request for this issue: https://github.com/apache/spark/pull/18036 > Web UI executor page tab search by status not working > -- > > Key: SPARK-20813 > URL: https://issues.apache.org/jira/browse/SPARK-20813 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Jong Yoon Lee >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > When searching for status keywords such as active, dead or Blacklisted > nothing is returned on the table. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20813) Web UI executor page tab search by status not working
[ https://issues.apache.org/jira/browse/SPARK-20813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20813: Assignee: (was: Apache Spark) > Web UI executor page tab search by status not working > -- > > Key: SPARK-20813 > URL: https://issues.apache.org/jira/browse/SPARK-20813 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Jong Yoon Lee >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > When searching for status keywords such as active, dead or Blacklisted > nothing is returned on the table. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4820) Spark build encounters "File name too long" on some encrypted filesystems
[ https://issues.apache.org/jira/browse/SPARK-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017912#comment-16017912 ] Paul Praet commented on SPARK-4820: --- I confirm - still an issue when trying to build Spark 2.1.1 on Ubuntu 16.04. > Spark build encounters "File name too long" on some encrypted filesystems > - > > Key: SPARK-4820 > URL: https://issues.apache.org/jira/browse/SPARK-4820 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Patrick Wendell >Assignee: Theodore Vasiloudis >Priority: Minor > Fix For: 1.4.0 > > > This was reported by Luchesar Cekov on github along with a proposed fix. The > fix has some potential downstream issues (it will modify the classnames) so > until we understand better how many users are affected we aren't going to > merge it. However, I'd like to include the issue and workaround here. If you > encounter this issue please comment on the JIRA so we can assess the > frequency. > The issue produces this error: > {code} > [error] == Expanded type of tree == > [error] > [error] ConstantType(value = Constant(Throwable)) > [error] > [error] uncaught exception during compilation: java.io.IOException > [error] File name too long > [error] two errors found > {code} > The workaround is in maven under the compile options add: > {code} > + -Xmax-classfile-name > + 128 > {code} > In SBT add: > {code} > +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"), > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20781) the location of Dockerfile in docker.properties.template is wrong
[ https://issues.apache.org/jira/browse/SPARK-20781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20781. --- Resolution: Fixed Fix Version/s: 2.1.2 2.2.0 Issue resolved by pull request 18013 [https://github.com/apache/spark/pull/18013] > the location of Dockerfile in docker.properties.template is wrong > - > > Key: SPARK-20781 > URL: https://issues.apache.org/jira/browse/SPARK-20781 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.1.1 >Reporter: liuzhaokun > Fix For: 2.2.0, 2.1.2 > > > the location of Dockerfile in docker.properties.template should be > "../external/docker/spark-mesos/Dockerfile" -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20781) the location of Dockerfile in docker.properties.template is wrong
[ https://issues.apache.org/jira/browse/SPARK-20781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20781: - Assignee: liuzhaokun Priority: Minor (was: Major) Issue Type: Bug (was: Improvement) > the location of Dockerfile in docker.properties.template is wrong > - > > Key: SPARK-20781 > URL: https://issues.apache.org/jira/browse/SPARK-20781 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.1 >Reporter: liuzhaokun >Assignee: liuzhaokun >Priority: Minor > Fix For: 2.1.2, 2.2.0 > > > the location of Dockerfile in docker.properties.template should be > "../external/docker/spark-mesos/Dockerfile" -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20814) Mesos scheduler does not respect spark.executor.extraClassPath configuration
Gene Pang created SPARK-20814: - Summary: Mesos scheduler does not respect spark.executor.extraClassPath configuration Key: SPARK-20814 URL: https://issues.apache.org/jira/browse/SPARK-20814 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 2.2.0 Reporter: Gene Pang When Spark executors are deployed on Mesos, the Mesos scheduler no longer respects the "spark.executor.extraClassPath" configuration parameter. MesosCoarseGrainedSchedulerBackend used to use the environment variable "SPARK_CLASSPATH" to add the value of "spark.executor.extraClassPath" to the executor classpath. However, "SPARK_CLASSPATH" was deprecated, and was removed in this commit [https://github.com/apache/spark/commit/8f0490e22b4c7f1fdf381c70c5894d46b7f7e6fb#diff-387c5d0c916278495fc28420571adf9eL178]. This effectively broke the ability for users to specify "spark.executor.extraClassPath" for Spark executors deployed on Mesos. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20814) Mesos scheduler does not respect spark.executor.extraClassPath configuration
[ https://issues.apache.org/jira/browse/SPARK-20814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-20814: --- Target Version/s: 2.2.0 Priority: Critical (was: Major) > Mesos scheduler does not respect spark.executor.extraClassPath configuration > > > Key: SPARK-20814 > URL: https://issues.apache.org/jira/browse/SPARK-20814 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.2.0 >Reporter: Gene Pang >Priority: Critical > > When Spark executors are deployed on Mesos, the Mesos scheduler no longer > respects the "spark.executor.extraClassPath" configuration parameter. > MesosCoarseGrainedSchedulerBackend used to use the environment variable > "SPARK_CLASSPATH" to add the value of "spark.executor.extraClassPath" to the > executor classpath. However, "SPARK_CLASSPATH" was deprecated, and was > removed in this commit > [https://github.com/apache/spark/commit/8f0490e22b4c7f1fdf381c70c5894d46b7f7e6fb#diff-387c5d0c916278495fc28420571adf9eL178]. > This effectively broke the ability for users to specify > "spark.executor.extraClassPath" for Spark executors deployed on Mesos. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20814) Mesos scheduler does not respect spark.executor.extraClassPath configuration
[ https://issues.apache.org/jira/browse/SPARK-20814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017979#comment-16017979 ] Marcelo Vanzin commented on SPARK-20814: Hmm, this sucks, we should fix it for 2.2 (FYI [~marmbrus]). Let me take a stab at fixing just the Mesos usage without re-introducing that variable. > Mesos scheduler does not respect spark.executor.extraClassPath configuration > > > Key: SPARK-20814 > URL: https://issues.apache.org/jira/browse/SPARK-20814 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.2.0 >Reporter: Gene Pang > > When Spark executors are deployed on Mesos, the Mesos scheduler no longer > respects the "spark.executor.extraClassPath" configuration parameter. > MesosCoarseGrainedSchedulerBackend used to use the environment variable > "SPARK_CLASSPATH" to add the value of "spark.executor.extraClassPath" to the > executor classpath. However, "SPARK_CLASSPATH" was deprecated, and was > removed in this commit > [https://github.com/apache/spark/commit/8f0490e22b4c7f1fdf381c70c5894d46b7f7e6fb#diff-387c5d0c916278495fc28420571adf9eL178]. > This effectively broke the ability for users to specify > "spark.executor.extraClassPath" for Spark executors deployed on Mesos. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20814) Mesos scheduler does not respect spark.executor.extraClassPath configuration
[ https://issues.apache.org/jira/browse/SPARK-20814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017993#comment-16017993 ] Apache Spark commented on SPARK-20814: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/18037 > Mesos scheduler does not respect spark.executor.extraClassPath configuration > > > Key: SPARK-20814 > URL: https://issues.apache.org/jira/browse/SPARK-20814 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.2.0 >Reporter: Gene Pang >Priority: Critical > > When Spark executors are deployed on Mesos, the Mesos scheduler no longer > respects the "spark.executor.extraClassPath" configuration parameter. > MesosCoarseGrainedSchedulerBackend used to use the environment variable > "SPARK_CLASSPATH" to add the value of "spark.executor.extraClassPath" to the > executor classpath. However, "SPARK_CLASSPATH" was deprecated, and was > removed in this commit > [https://github.com/apache/spark/commit/8f0490e22b4c7f1fdf381c70c5894d46b7f7e6fb#diff-387c5d0c916278495fc28420571adf9eL178]. > This effectively broke the ability for users to specify > "spark.executor.extraClassPath" for Spark executors deployed on Mesos. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20814) Mesos scheduler does not respect spark.executor.extraClassPath configuration
[ https://issues.apache.org/jira/browse/SPARK-20814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20814: Assignee: Apache Spark > Mesos scheduler does not respect spark.executor.extraClassPath configuration > > > Key: SPARK-20814 > URL: https://issues.apache.org/jira/browse/SPARK-20814 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.2.0 >Reporter: Gene Pang >Assignee: Apache Spark >Priority: Critical > > When Spark executors are deployed on Mesos, the Mesos scheduler no longer > respects the "spark.executor.extraClassPath" configuration parameter. > MesosCoarseGrainedSchedulerBackend used to use the environment variable > "SPARK_CLASSPATH" to add the value of "spark.executor.extraClassPath" to the > executor classpath. However, "SPARK_CLASSPATH" was deprecated, and was > removed in this commit > [https://github.com/apache/spark/commit/8f0490e22b4c7f1fdf381c70c5894d46b7f7e6fb#diff-387c5d0c916278495fc28420571adf9eL178]. > This effectively broke the ability for users to specify > "spark.executor.extraClassPath" for Spark executors deployed on Mesos. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20814) Mesos scheduler does not respect spark.executor.extraClassPath configuration
[ https://issues.apache.org/jira/browse/SPARK-20814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20814: Assignee: (was: Apache Spark) > Mesos scheduler does not respect spark.executor.extraClassPath configuration > > > Key: SPARK-20814 > URL: https://issues.apache.org/jira/browse/SPARK-20814 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.2.0 >Reporter: Gene Pang >Priority: Critical > > When Spark executors are deployed on Mesos, the Mesos scheduler no longer > respects the "spark.executor.extraClassPath" configuration parameter. > MesosCoarseGrainedSchedulerBackend used to use the environment variable > "SPARK_CLASSPATH" to add the value of "spark.executor.extraClassPath" to the > executor classpath. However, "SPARK_CLASSPATH" was deprecated, and was > removed in this commit > [https://github.com/apache/spark/commit/8f0490e22b4c7f1fdf381c70c5894d46b7f7e6fb#diff-387c5d0c916278495fc28420571adf9eL178]. > This effectively broke the ability for users to specify > "spark.executor.extraClassPath" for Spark executors deployed on Mesos. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20683) Make table uncache chaining optional
[ https://issues.apache.org/jira/browse/SPARK-20683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018008#comment-16018008 ] Andrew Ash commented on SPARK-20683: Thanks for that diff [~shea.parkes] -- we're planning on trying it in our fork too: https://github.com/palantir/spark/pull/188 > Make table uncache chaining optional > > > Key: SPARK-20683 > URL: https://issues.apache.org/jira/browse/SPARK-20683 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: Not particularly environment sensitive. > Encountered/tested on Linux and Windows. >Reporter: Shea Parkes > > A recent change was made in SPARK-19765 that causes table uncaching to chain. > That is, if table B is a child of table A, and they are both cached, now > uncaching table A will automatically uncache table B. > At first I did not understand the need for this, but when reading the unit > tests, I see that it is likely that many people do not keep named references > to the child table (e.g. B). Perhaps B is just made and cached as some part > of data exploration. In that situation, it makes sense for B to > automatically be uncached when you are finished with A. > However, we commonly utilize a different design pattern that is now harmed by > this automatic uncaching. It is common for us to cache table A to then make > two, independent children tables (e.g. B and C). Once those two child tables > are realized and cached, we'd then uncache table A (as it was no longer > needed and could be quite large). After this change now, when we uncache > table A, we suddenly lose our cached status on both table B and C (which is > quite frustrating). All of these tables are often quite large, and we view > what we're doing as mindful memory management. We are maintaining named > references to B and C at all times, so we can always uncache them ourselves > when it makes sense. > Would it be acceptable/feasible to make this table uncache chaining optional? > I would be fine if the default is for the chaining to happen, as long as we > can turn it off via parameters. > If acceptable, I can try to work towards making the required changes. I am > most comfortable in Python (and would want the optional parameter surfaced in > Python), but have found the places required to make this change in Scala > (since I reverted the functionality in a private fork already). Any help > would be greatly appreciated however. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20815) NullPointerException in RPackageUtils#checkManifestForR
Andrew Ash created SPARK-20815: -- Summary: NullPointerException in RPackageUtils#checkManifestForR Key: SPARK-20815 URL: https://issues.apache.org/jira/browse/SPARK-20815 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.1.1 Reporter: Andrew Ash Some jars don't have manifest files in them, such as in my case javax.inject-1.jar and value-2.2.1-annotations.jar This causes the below NPE: {noformat} Exception in thread "main" java.lang.NullPointerException at org.apache.spark.deploy.RPackageUtils$.checkManifestForR(RPackageUtils.scala:95) at org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply$mcV$sp(RPackageUtils.scala:180) at org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply(RPackageUtils.scala:180) at org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply(RPackageUtils.scala:180) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1322) at org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1.apply(RPackageUtils.scala:202) at org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1.apply(RPackageUtils.scala:175) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.deploy.RPackageUtils$.checkAndBuildRPackage(RPackageUtils.scala:175) at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:311) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:152) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {noformat} due to RPackageUtils#checkManifestForR assuming {{jar.getManifest}} is non-null. However per the JDK spec it can be null: {noformat} /** * Returns the jar file manifest, or null if none. * * @return the jar file manifest, or null if none * * @throws IllegalStateException * may be thrown if the jar file has been closed * @throws IOException if an I/O error has occurred */ public Manifest getManifest() throws IOException { return getManifestFromReference(); } {noformat} This method should do a null check and return false if the manifest is null (meaning no R code in that jar) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20811) GBT Classifier failed with mysterious StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-20811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018246#comment-16018246 ] Nan Zhu commented on SPARK-20811: - thanks, let me try it > GBT Classifier failed with mysterious StackOverflowError > > > Key: SPARK-20811 > URL: https://issues.apache.org/jira/browse/SPARK-20811 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Nan Zhu > > I am running GBT Classifier over airline dataset (combining 2005-2008) and in > total it's around 22M examples as training data > code is simple > {code:title=Bar.scala|borderStyle=solid} > val gradientBoostedTrees = new GBTClassifier() > gradientBoostedTrees.setMaxBins(1000) > gradientBoostedTrees.setMaxIter(500) > gradientBoostedTrees.setMaxDepth(6) > gradientBoostedTrees.setStepSize(1.0) > transformedTrainingSet.cache().foreach(_ => Unit) > val startTime = System.nanoTime() > val model = gradientBoostedTrees.fit(transformedTrainingSet) > println(s"===training time cost: ${(System.nanoTime() - startTime) / > 1000.0 / 1000.0} ms") > val resultDF = model.transform(transformedTestset) > val binaryClassificationEvaluator = new BinaryClassificationEvaluator() > > binaryClassificationEvaluator.setRawPredictionCol("prediction").setLabelCol("label") > println(s"=test AUC: > ${binaryClassificationEvaluator.evaluate(resultDF)}==") > {code} > my training job always failed with > {quote} > 17/05/19 13:41:29 WARN TaskSetManager: Lost task 18.0 in stage 3907.0 (TID > 137506, 10.0.0.13, executor 3): java.lang.StackOverflowError > at > java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:3037) > at > java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:3061) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2234) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:479) > at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) > {quote} > the above pattern repeated for many times > Is it a bug or did I make something wrong when using GBTClassifier in ML? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20815) NullPointerException in RPackageUtils#checkManifestForR
[ https://issues.apache.org/jira/browse/SPARK-20815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018300#comment-16018300 ] Felix Cheung commented on SPARK-20815: -- make sense to me. would you like to contribute the fix? > NullPointerException in RPackageUtils#checkManifestForR > --- > > Key: SPARK-20815 > URL: https://issues.apache.org/jira/browse/SPARK-20815 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.1 >Reporter: Andrew Ash > > Some jars don't have manifest files in them, such as in my case > javax.inject-1.jar and value-2.2.1-annotations.jar > This causes the below NPE: > {noformat} > Exception in thread "main" java.lang.NullPointerException > at > org.apache.spark.deploy.RPackageUtils$.checkManifestForR(RPackageUtils.scala:95) > at > org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply$mcV$sp(RPackageUtils.scala:180) > at > org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply(RPackageUtils.scala:180) > at > org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply(RPackageUtils.scala:180) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1322) > at > org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1.apply(RPackageUtils.scala:202) > at > org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1.apply(RPackageUtils.scala:175) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.deploy.RPackageUtils$.checkAndBuildRPackage(RPackageUtils.scala:175) > at > org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:311) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:152) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {noformat} > due to RPackageUtils#checkManifestForR assuming {{jar.getManifest}} is > non-null. > However per the JDK spec it can be null: > {noformat} > /** > * Returns the jar file manifest, or null if none. > * > * @return the jar file manifest, or null if none > * > * @throws IllegalStateException > * may be thrown if the jar file has been closed > * @throws IOException if an I/O error has occurred > */ > public Manifest getManifest() throws IOException { > return getManifestFromReference(); > } > {noformat} > This method should do a null check and return false if the manifest is null > (meaning no R code in that jar) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20727) Skip SparkR tests when missing Hadoop winutils on CRAN windows machines
[ https://issues.apache.org/jira/browse/SPARK-20727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20727: - Issue Type: Sub-task (was: Bug) Parent: SPARK-15799 > Skip SparkR tests when missing Hadoop winutils on CRAN windows machines > --- > > Key: SPARK-20727 > URL: https://issues.apache.org/jira/browse/SPARK-20727 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.1.1, 2.2.0 >Reporter: Shivaram Venkataraman > > We should skips tests that use the Hadoop libraries while running > on CRAN check with Windows as the operating system. This is to handle > cases where the Hadoop winutils binaries are not available on the target > system. The skipped tests will consist of > 1. Tests that save, load a model in MLlib > 2. Tests that save, load CSV, JSON and Parquet files in SQL > 3. Hive tests > Note that these tests will still be run on AppVeyor for every PR, so our > overall test coverage should not go down -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18825) Eliminate duplicate links in SparkR API doc index
[ https://issues.apache.org/jira/browse/SPARK-18825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018313#comment-16018313 ] Felix Cheung commented on SPARK-18825: -- I see about dontrun - yes, I don't think we could remove dontrun from example because they would take too long for CRAN check (we are already trimming a lot and likely will need to trim more to make it work), but if we have a way to run the example during an explicit gen doc step it could be useful > Eliminate duplicate links in SparkR API doc index > - > > Key: SPARK-18825 > URL: https://issues.apache.org/jira/browse/SPARK-18825 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Joseph K. Bradley > > The SparkR API docs contain many duplicate links with suffixes {{-method}} or > {{-class}} in the index . E.g., {{atan}} and {{atan-method}} link to the same > doc. > Copying from [~felixcheung] in [SPARK-18332]: > {quote} > They are because of the > {{@ aliases}} > tags. I think we are adding them because CRAN checks require them to match > the specific format - [~shivaram] would you know? > I am pretty sure they are double-listed because in addition to aliases we > also have > {{@ rdname}} > which automatically generate the links as well. > I suspect if we change all the rdname to match the string in aliases then > there will be one link. I can take a shot at this to test this out, but > changes will be very extensive - is this something we could get into 2.1 > still? > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20805) updated updateP in SVD++ is error
[ https://issues.apache.org/jira/browse/SPARK-20805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018315#comment-16018315 ] BoLing commented on SPARK-20805: hi, Sean Owen, you can see this url https://github.com/NewBoLing/GraphSVD-/blob/master/SVDPlusPlus > updated updateP in SVD++ is error > -- > > Key: SPARK-20805 > URL: https://issues.apache.org/jira/browse/SPARK-20805 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.6.1, 2.1.1 >Reporter: BoLing > > In algorithm svd++, we all known that the usr._2 store the value of pu + > |N(u)|^(-0.5)*sum(y); the function sendMsgTrainF compute the updated value of > updateP,updateQ and updateY. At the beginning,the cycle iteration update the > part of y in usr._2, but pu is never updated. so we should fix the > sendMessageToSrcFunction in sendMsgTrainF. the old code is > ctx.sendToSrc((updateP, updateY, (err - conf.gamma6 * usr._3) * > conf.gamma1)). if we fix like that ctx.sendToSrc((updateP, updateP, (err - > conf.gamma6 * usr._3) * conf.gamma1)), it maybe arrive the effect we want. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20805) updated updateP in SVD++ is error
[ https://issues.apache.org/jira/browse/SPARK-20805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018315#comment-16018315 ] BoLing edited comment on SPARK-20805 at 5/20/17 4:31 AM: - hi, @Sean Owen, you can see this url https://github.com/NewBoLing/GraphSVD-/blob/master/SVDPlusPlus was (Author: boling): hi, Sean Owen, you can see this url https://github.com/NewBoLing/GraphSVD-/blob/master/SVDPlusPlus > updated updateP in SVD++ is error > -- > > Key: SPARK-20805 > URL: https://issues.apache.org/jira/browse/SPARK-20805 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.6.1, 2.1.1 >Reporter: BoLing > > In algorithm svd++, we all known that the usr._2 store the value of pu + > |N(u)|^(-0.5)*sum(y); the function sendMsgTrainF compute the updated value of > updateP,updateQ and updateY. At the beginning,the cycle iteration update the > part of y in usr._2, but pu is never updated. so we should fix the > sendMessageToSrcFunction in sendMsgTrainF. the old code is > ctx.sendToSrc((updateP, updateY, (err - conf.gamma6 * usr._3) * > conf.gamma1)). if we fix like that ctx.sendToSrc((updateP, updateP, (err - > conf.gamma6 * usr._3) * conf.gamma1)), it maybe arrive the effect we want. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18825) Eliminate duplicate links in SparkR API doc index
[ https://issues.apache.org/jira/browse/SPARK-18825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018314#comment-16018314 ] Felix Cheung commented on SPARK-18825: -- handling a fork of knitr might be too hard to maintain, given that we don't have direct access to the Jenkins boxes. > Eliminate duplicate links in SparkR API doc index > - > > Key: SPARK-18825 > URL: https://issues.apache.org/jira/browse/SPARK-18825 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Joseph K. Bradley > > The SparkR API docs contain many duplicate links with suffixes {{-method}} or > {{-class}} in the index . E.g., {{atan}} and {{atan-method}} link to the same > doc. > Copying from [~felixcheung] in [SPARK-18332]: > {quote} > They are because of the > {{@ aliases}} > tags. I think we are adding them because CRAN checks require them to match > the specific format - [~shivaram] would you know? > I am pretty sure they are double-listed because in addition to aliases we > also have > {{@ rdname}} > which automatically generate the links as well. > I suspect if we change all the rdname to match the string in aliases then > there will be one link. I can take a shot at this to test this out, but > changes will be very extensive - is this something we could get into 2.1 > still? > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20805) updated updateP in SVD++ is error
[ https://issues.apache.org/jira/browse/SPARK-20805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018315#comment-16018315 ] BoLing edited comment on SPARK-20805 at 5/20/17 4:32 AM: - hi, Sean Owen, you can see this url https://github.com/NewBoLing/GraphSVD-/blob/master/SVDPlusPlus was (Author: boling): hi, @Sean Owen, you can see this url https://github.com/NewBoLing/GraphSVD-/blob/master/SVDPlusPlus > updated updateP in SVD++ is error > -- > > Key: SPARK-20805 > URL: https://issues.apache.org/jira/browse/SPARK-20805 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.6.1, 2.1.1 >Reporter: BoLing > > In algorithm svd++, we all known that the usr._2 store the value of pu + > |N(u)|^(-0.5)*sum(y); the function sendMsgTrainF compute the updated value of > updateP,updateQ and updateY. At the beginning,the cycle iteration update the > part of y in usr._2, but pu is never updated. so we should fix the > sendMessageToSrcFunction in sendMsgTrainF. the old code is > ctx.sendToSrc((updateP, updateY, (err - conf.gamma6 * usr._3) * > conf.gamma1)). if we fix like that ctx.sendToSrc((updateP, updateP, (err - > conf.gamma6 * usr._3) * conf.gamma1)), it maybe arrive the effect we want. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20751) Built-in SQL Function Support - COT
[ https://issues.apache.org/jira/browse/SPARK-20751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018321#comment-16018321 ] Apache Spark commented on SPARK-20751: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/18039 > Built-in SQL Function Support - COT > --- > > Key: SPARK-20751 > URL: https://issues.apache.org/jira/browse/SPARK-20751 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Yuming Wang > Fix For: 2.3.0 > > > {noformat} > COT() > {noformat} > Returns the cotangent of . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20815) NullPointerException in RPackageUtils#checkManifestForR
[ https://issues.apache.org/jira/browse/SPARK-20815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018326#comment-16018326 ] James Shuster commented on SPARK-20815: --- I have a fix in the works, just adding a test case and running the full test suite now. > NullPointerException in RPackageUtils#checkManifestForR > --- > > Key: SPARK-20815 > URL: https://issues.apache.org/jira/browse/SPARK-20815 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.1 >Reporter: Andrew Ash > > Some jars don't have manifest files in them, such as in my case > javax.inject-1.jar and value-2.2.1-annotations.jar > This causes the below NPE: > {noformat} > Exception in thread "main" java.lang.NullPointerException > at > org.apache.spark.deploy.RPackageUtils$.checkManifestForR(RPackageUtils.scala:95) > at > org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply$mcV$sp(RPackageUtils.scala:180) > at > org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply(RPackageUtils.scala:180) > at > org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply(RPackageUtils.scala:180) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1322) > at > org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1.apply(RPackageUtils.scala:202) > at > org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1.apply(RPackageUtils.scala:175) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.deploy.RPackageUtils$.checkAndBuildRPackage(RPackageUtils.scala:175) > at > org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:311) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:152) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {noformat} > due to RPackageUtils#checkManifestForR assuming {{jar.getManifest}} is > non-null. > However per the JDK spec it can be null: > {noformat} > /** > * Returns the jar file manifest, or null if none. > * > * @return the jar file manifest, or null if none > * > * @throws IllegalStateException > * may be thrown if the jar file has been closed > * @throws IOException if an I/O error has occurred > */ > public Manifest getManifest() throws IOException { > return getManifestFromReference(); > } > {noformat} > This method should do a null check and return false if the manifest is null > (meaning no R code in that jar) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org