[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189651#comment-14189651 ] Zhan Zhang commented on SPARK-1537: --- Hi Marcelo, Do you have update on this? If you don't mind, I can work on your branch to get this done asap. Please let me know how do you think? > Add integration with Yarn's Application Timeline Server > --- > > Key: SPARK-1537 > URL: https://issues.apache.org/jira/browse/SPARK-1537 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > > It would be nice to have Spark integrate with Yarn's Application Timeline > Server (see YARN-321, YARN-1530). This would allow users running Spark on > Yarn to have a single place to go for all their history needs, and avoid > having to manage a separate service (Spark's built-in server). > At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, > although there is still some ongoing work. But the basics are there, and I > wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4149) ISO 8601 support for json date time strings
[ https://issues.apache.org/jira/browse/SPARK-4149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189618#comment-14189618 ] Apache Spark commented on SPARK-4149: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/3012 > ISO 8601 support for json date time strings > --- > > Key: SPARK-4149 > URL: https://issues.apache.org/jira/browse/SPARK-4149 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Adrian Wang >Assignee: Adrian Wang >Priority: Minor > > parse json string like "2014-10-29T20:05:00-08:00" or "2014-10-29T20:05:00Z". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4150) rdd.setName returns None in PySpark
[ https://issues.apache.org/jira/browse/SPARK-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189592#comment-14189592 ] Apache Spark commented on SPARK-4150: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/3011 > rdd.setName returns None in PySpark > --- > > Key: SPARK-4150 > URL: https://issues.apache.org/jira/browse/SPARK-4150 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Trivial > > We should return self so we can do > {code} > rdd.setName('abc').cache().count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4150) rdd.setName returns None in PySpark
Xiangrui Meng created SPARK-4150: Summary: rdd.setName returns None in PySpark Key: SPARK-4150 URL: https://issues.apache.org/jira/browse/SPARK-4150 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Trivial We should return self so we can do {code} rdd.setName('abc').cache().count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4148) PySpark's sample uses the same seed for all partitions
[ https://issues.apache.org/jira/browse/SPARK-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4148: - Description: The current way of seed distribution makes the random sequences from partition i and i+1 offset by 1. {code} In [14]: import random In [15]: r1 = random.Random(10) In [16]: r1.randint(0, 1) Out[16]: 1 In [17]: r1.random() Out[17]: 0.4288890546751146 In [18]: r1.random() Out[18]: 0.5780913011344704 In [19]: r2 = random.Random(10) In [20]: r2.randint(0, 1) Out[20]: 1 In [21]: r2.randint(0, 1) Out[21]: 0 In [22]: r2.random() Out[22]: 0.5780913011344704 {code} So the second value from partition 1 is the same as the first value from partition 2. was:We should have different seeds. Otherwise, we get the same sequence from each partition. > PySpark's sample uses the same seed for all partitions > -- > > Key: SPARK-4148 > URL: https://issues.apache.org/jira/browse/SPARK-4148 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.2, 1.1.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > The current way of seed distribution makes the random sequences from > partition i and i+1 offset by 1. > {code} > In [14]: import random > In [15]: r1 = random.Random(10) > In [16]: r1.randint(0, 1) > Out[16]: 1 > In [17]: r1.random() > Out[17]: 0.4288890546751146 > In [18]: r1.random() > Out[18]: 0.5780913011344704 > In [19]: r2 = random.Random(10) > In [20]: r2.randint(0, 1) > Out[20]: 1 > In [21]: r2.randint(0, 1) > Out[21]: 0 > In [22]: r2.random() > Out[22]: 0.5780913011344704 > {code} > So the second value from partition 1 is the same as the first value from > partition 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4148) PySpark's sample uses the same seed for all partitions
[ https://issues.apache.org/jira/browse/SPARK-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189579#comment-14189579 ] Apache Spark commented on SPARK-4148: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/3010 > PySpark's sample uses the same seed for all partitions > -- > > Key: SPARK-4148 > URL: https://issues.apache.org/jira/browse/SPARK-4148 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.2, 1.1.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We should have different seeds. Otherwise, we get the same sequence from each > partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4149) ISO 8601 support for json date time strings
Adrian Wang created SPARK-4149: -- Summary: ISO 8601 support for json date time strings Key: SPARK-4149 URL: https://issues.apache.org/jira/browse/SPARK-4149 Project: Spark Issue Type: New Feature Components: SQL Reporter: Adrian Wang Priority: Minor parse json string like "2014-10-29T20:05:00-08:00" or "2014-10-29T20:05:00Z". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4148) PySpark's sample uses the same seed for all partitions
[ https://issues.apache.org/jira/browse/SPARK-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4148: - Affects Version/s: (was: 1.0.0) 1.0.2 > PySpark's sample uses the same seed for all partitions > -- > > Key: SPARK-4148 > URL: https://issues.apache.org/jira/browse/SPARK-4148 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.2, 1.1.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We should have different seeds. Otherwise, we get the same sequence from each > partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4148) PySpark's sample uses the same seed for all partitions
Xiangrui Meng created SPARK-4148: Summary: PySpark's sample uses the same seed for all partitions Key: SPARK-4148 URL: https://issues.apache.org/jira/browse/SPARK-4148 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0, 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng We should have different seeds. Otherwise, we get the same sequence from each partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4147) Remove log4j dependency
Tobias Pfeiffer created SPARK-4147: -- Summary: Remove log4j dependency Key: SPARK-4147 URL: https://issues.apache.org/jira/browse/SPARK-4147 Project: Spark Issue Type: Wish Components: Spark Core Affects Versions: 1.1.0 Reporter: Tobias Pfeiffer spark-core has a hard dependency on log4j, which shouldn't be necessary since slf4j is used. I tried to exclude slf4j-log4j12 and log4j dependencies in my sbt file. Excluding org.slf4j.slf4j-log4j12 works fine if logback is on the classpath. However, removing the log4j dependency fails because in https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/Logging.scala#L121 a static method of org.apache.log4j.LogManager is accessed *even if* log4j is not in use. I guess removing all dependencies on log4j may be a bigger task, but it would be a great help if the access to LogManager would be done only if log4j use was detected before. (This is a 2-line change.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4146) [GraphX] Modify option name according to example doc in SynthBenchmark
[ https://issues.apache.org/jira/browse/SPARK-4146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Huang updated SPARK-4146: - Fix Version/s: 1.2.0 > [GraphX] Modify option name according to example doc in SynthBenchmark > --- > > Key: SPARK-4146 > URL: https://issues.apache.org/jira/browse/SPARK-4146 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.1.0, 1.1.1 >Reporter: Jie Huang >Priority: Minor > Fix For: 1.2.0 > > > Now graphx.SynthBenchmark example has an option of iteration number named as > "niter". However, in its document, it is named as "niters". The mismatch > between the implementation and document causes certain > IllegalArgumentException while trying that example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4144) Support incremental model training of Naive Bayes classifier
[ https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4144: - Assignee: Liquan Pei > Support incremental model training of Naive Bayes classifier > > > Key: SPARK-4144 > URL: https://issues.apache.org/jira/browse/SPARK-4144 > Project: Spark > Issue Type: Improvement > Components: MLlib, Streaming >Reporter: Chris Fregly >Assignee: Liquan Pei > > Per Xiangrui Meng from the following user list discussion: > http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E > > "For Naive Bayes, we need to update the priors and conditional > probabilities, which means we should also remember the number of > observations for the updates." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4146) [GraphX] Modify option name according to example doc in SynthBenchmark
[ https://issues.apache.org/jira/browse/SPARK-4146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Huang updated SPARK-4146: - Affects Version/s: 1.1.1 > [GraphX] Modify option name according to example doc in SynthBenchmark > --- > > Key: SPARK-4146 > URL: https://issues.apache.org/jira/browse/SPARK-4146 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.1.0, 1.1.1 >Reporter: Jie Huang >Priority: Minor > Fix For: 1.2.0 > > > Now graphx.SynthBenchmark example has an option of iteration number named as > "niter". However, in its document, it is named as "niters". The mismatch > between the implementation and document causes certain > IllegalArgumentException while trying that example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4146) [GraphX] Modify option name according to example doc in SynthBenchmark
[ https://issues.apache.org/jira/browse/SPARK-4146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Huang resolved SPARK-4146. -- Resolution: Fixed > [GraphX] Modify option name according to example doc in SynthBenchmark > --- > > Key: SPARK-4146 > URL: https://issues.apache.org/jira/browse/SPARK-4146 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.1.0, 1.1.1 >Reporter: Jie Huang >Priority: Minor > > Now graphx.SynthBenchmark example has an option of iteration number named as > "niter". However, in its document, it is named as "niters". The mismatch > between the implementation and document causes certain > IllegalArgumentException while trying that example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4146) [GraphX] Modify option name according to example doc in SynthBenchmark
Jie Huang created SPARK-4146: Summary: [GraphX] Modify option name according to example doc in SynthBenchmark Key: SPARK-4146 URL: https://issues.apache.org/jira/browse/SPARK-4146 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.1.0 Reporter: Jie Huang Priority: Minor Now graphx.SynthBenchmark example has an option of iteration number named as "niter". However, in its document, it is named as "niters". The mismatch between the implementation and document causes certain IllegalArgumentException while trying that example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4078) New FsPermission instance w/o FsPermission.createImmutable in eventlog
[ https://issues.apache.org/jira/browse/SPARK-4078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dai updated SPARK-4078: - Assignee: Jason Dai > New FsPermission instance w/o FsPermission.createImmutable in eventlog > -- > > Key: SPARK-4078 > URL: https://issues.apache.org/jira/browse/SPARK-4078 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Jie Huang >Assignee: Jason Dai > > By default, Spark builds its package against Hadoop 1.0.4 version. In that > version, it has some FsPermission bug (see HADOOP-7629 by Todd Lipcon). This > bug got fixed since 1.1 version. By using that FsPermission.createImmutable() > API, end-user may see some RPC exception like below (if turn on eventlog over > HDFS). > {quote} > Exception in thread "main" java.io.IOException: Call to sr484/10.1.2.84:54310 > failed on local exception: java.io.EOFException > at org.apache.hadoop.ipc.Client.wrapException(Client.java:1150) > at org.apache.hadoop.ipc.Client.call(Client.java:1118) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) > at $Proxy6.setPermission(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62) > at $Proxy6.setPermission(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.setPermission(DFSClient.java:1285) > at > org.apache.hadoop.hdfs.DistributedFileSystem.setPermission(DistributedFileSystem.java:572) > at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:138) > at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) > at > org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) > at org.apache.spark.SparkContext.(SparkContext.scala:324) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4094) checkpoint should still be available after rdd actions
[ https://issues.apache.org/jira/browse/SPARK-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dai updated SPARK-4094: - Assignee: Zhang, Liye > checkpoint should still be available after rdd actions > -- > > Key: SPARK-4094 > URL: https://issues.apache.org/jira/browse/SPARK-4094 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Zhang, Liye >Assignee: Zhang, Liye > > rdd.checkpoint() must be called before any actions on this rdd, if there is > any other actions before, checkpoint would never succeed. For the following > code as example: > *rdd = sc.makeRDD(...)* > *rdd.collect()* > *rdd.checkpoint()* > *rdd.count()* > This rdd would never be checkpointed. But this would not happen for RDD > cache. RDD cache would always make successfully before rdd actions no matter > whether there is any actions before cache(). > So rdd.checkpoint() should also be with the same behavior with rdd.cache(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dai updated SPARK-2926: - Assignee: Saisai Shao > Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle > -- > > Key: SPARK-2926 > URL: https://issues.apache.org/jira/browse/SPARK-2926 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 1.1.0 >Reporter: Saisai Shao >Assignee: Saisai Shao > Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test > Report(contd).pdf, Spark Shuffle Test Report.pdf > > > Currently Spark has already integrated sort-based shuffle write, which > greatly improve the IO performance and reduce the memory consumption when > reducer number is very large. But for the reducer side, it still adopts the > implementation of hash-based shuffle reader, which neglects the ordering > attributes of map output data in some situations. > Here we propose a MR style sort-merge like shuffle reader for sort-based > shuffle to better improve the performance of sort-based shuffle. > Working in progress code and performance test report will be posted later > when some unit test bugs are fixed. > Any comments would be greatly appreciated. > Thanks a lot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4145) Create jobs overview and job details pages on the web UI
[ https://issues.apache.org/jira/browse/SPARK-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189429#comment-14189429 ] Apache Spark commented on SPARK-4145: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/3009 > Create jobs overview and job details pages on the web UI > > > Key: SPARK-4145 > URL: https://issues.apache.org/jira/browse/SPARK-4145 > Project: Spark > Issue Type: New Feature > Components: Web UI >Reporter: Josh Rosen >Assignee: Josh Rosen > > We should create a jobs overview and job details page on the web UI. The > overview page would list all jobs in the SparkContext and would replace the > current "Stages" page as the default web UI page. The job details page would > provide information on the stages triggered by a particular job; it would > also serve as a place to show DAG visualizations and other debugging aids. > I still plan to keep the current "Stages" page, which lists all stages of all > jobs, since it's a useful debugging aid for figuring out how resources are > being consumed across all jobs in a Spark Cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4145) Create jobs overview and job details pages on the web UI
Josh Rosen created SPARK-4145: - Summary: Create jobs overview and job details pages on the web UI Key: SPARK-4145 URL: https://issues.apache.org/jira/browse/SPARK-4145 Project: Spark Issue Type: New Feature Components: Web UI Reporter: Josh Rosen Assignee: Josh Rosen We should create a jobs overview and job details page on the web UI. The overview page would list all jobs in the SparkContext and would replace the current "Stages" page as the default web UI page. The job details page would provide information on the stages triggered by a particular job; it would also serve as a place to show DAG visualizations and other debugging aids. I still plan to keep the current "Stages" page, which lists all stages of all jobs, since it's a useful debugging aid for figuring out how resources are being consumed across all jobs in a Spark Cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4053) Block generator throttling in NetworkReceiverSuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4053. Resolution: Fixed Fix Version/s: 1.2.0 > Block generator throttling in NetworkReceiverSuite is flaky > --- > > Key: SPARK-4053 > URL: https://issues.apache.org/jira/browse/SPARK-4053 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Minor > Fix For: 1.2.0 > > > In the unit test that checked whether blocks generated by throttled block > generator had expected number of records, the thresholds are too tight, which > sometimes led to the test failing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3795) Add scheduler hooks/heuristics for adding and removing executors
[ https://issues.apache.org/jira/browse/SPARK-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3795. Resolution: Fixed Fix Version/s: 1.2.0 > Add scheduler hooks/heuristics for adding and removing executors > > > Key: SPARK-3795 > URL: https://issues.apache.org/jira/browse/SPARK-3795 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Patrick Wendell >Assignee: Andrew Or > Fix For: 1.2.0 > > > To support dynamic scaling of a Spark application, Spark's scheduler will > need to have hooks around explicitly decommissioning executors. We'll also > need basic heuristics governing when to start/stop executors based on load. > An initial goal is to keep this very simple. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4144) Support incremental model training of Naive Bayes classifier
Chris Fregly created SPARK-4144: --- Summary: Support incremental model training of Naive Bayes classifier Key: SPARK-4144 URL: https://issues.apache.org/jira/browse/SPARK-4144 Project: Spark Issue Type: Improvement Components: MLlib, Streaming Reporter: Chris Fregly Per Xiangrui Meng from the following user list discussion: http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E "For Naive Bayes, we need to update the priors and conditional probabilities, which means we should also remember the number of observations for the updates." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4143) Move inner class DeferredObjectAdapter to top level
[ https://issues.apache.org/jira/browse/SPARK-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189380#comment-14189380 ] Apache Spark commented on SPARK-4143: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/3007 > Move inner class DeferredObjectAdapter to top level > --- > > Key: SPARK-4143 > URL: https://issues.apache.org/jira/browse/SPARK-4143 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Cheng Hao >Priority: Trivial > > The class DeferredObjectAdapter is the inner class of HiveGenericUdf, which > may cause some overhead in closure ser/de-ser. Move it to top level. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4143) Move inner class DeferredObjectAdapter to top level
Cheng Hao created SPARK-4143: Summary: Move inner class DeferredObjectAdapter to top level Key: SPARK-4143 URL: https://issues.apache.org/jira/browse/SPARK-4143 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Trivial The class DeferredObjectAdapter is the inner class of HiveGenericUdf, which may cause some overhead in closure ser/de-ser. Move it to top level. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4132) Spark uses incompatible HDFS API
[ https://issues.apache.org/jira/browse/SPARK-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189351#comment-14189351 ] kuromatsu nobuyuki commented on SPARK-4132: --- Owen, thank you for your indication. It looks much the same as my trouble. > Spark uses incompatible HDFS API > > > Key: SPARK-4132 > URL: https://issues.apache.org/jira/browse/SPARK-4132 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 > Environment: Spark1.1.0 on Hadoop1.2.1 > CentOS 6.3 64bit >Reporter: kuromatsu nobuyuki >Priority: Minor > > When I enable event logging and set it to output to HDFS, initialization > fails with 'java.lang.ClassNotFoundException' (see trace below). > I found that an API incompatibility in > org.apache.hadoop.fs.permission.FsPermission between Hadoop 1.0.4 and Hadoop > 1.1.0 (and above) causes this error > (org.apache.hadoop.fs.permission.FsPermission$2 is used in 1.0.4 but doesn't > exist in my 1.2.1 environment). > I think that the Spark jar file pre-built for Hadoop1.X should be built on > Hadoop Stable version(Hadoop 1.2.1). > 2014-10-24 10:43:22,893 INFO org.apache.hadoop.ipc.Server: IPC Server > listener on 9000: > readAndProcess threw exception java.lang.RuntimeException: > readObject can't find class org.apache.hadoop.fs.permission.FsPermission$2. > Count of bytes read: 0 > java.lang.RuntimeException: readObject can't find class > org.apache.hadoop.fs.permission.FsPermission$2 > at > org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:233) > at org.apache.hadoop.ipc.RPC$Invocation.readFields(RPC.java:106) > at > org.apache.hadoop.ipc.Server$Connection.processData(Server.java:1347) > at > org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1326) > at > org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1226) > at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:577) > at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:384) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:701) > Caused by: java.lang.ClassNotFoundException: > org.apache.hadoop.fs.permission.FsPermission$2 > at java.net.URLClassLoader$1.run(URLClassLoader.java:217) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:205) > at java.lang.ClassLoader.loadClass(ClassLoader.java:323) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) > at java.lang.ClassLoader.loadClass(ClassLoader.java:268) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:270) > at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810) > at > org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:231) > ... 9 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4142) Bad Default for GraphLoader Edge Partitions
[ https://issues.apache.org/jira/browse/SPARK-4142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189347#comment-14189347 ] Apache Spark commented on SPARK-4142: - User 'jegonzal' has created a pull request for this issue: https://github.com/apache/spark/pull/3006 > Bad Default for GraphLoader Edge Partitions > --- > > Key: SPARK-4142 > URL: https://issues.apache.org/jira/browse/SPARK-4142 > Project: Spark > Issue Type: Bug > Components: GraphX >Reporter: Joseph E. Gonzalez > > The default number of edge partitions for the GraphLoader is set to 1 rather > than the default parallelism. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2672) Support compression in wholeFile()
[ https://issues.apache.org/jira/browse/SPARK-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189320#comment-14189320 ] Apache Spark commented on SPARK-2672: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3005 > Support compression in wholeFile() > -- > > Key: SPARK-2672 > URL: https://issues.apache.org/jira/browse/SPARK-2672 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 1.0.0, 1.0.1 >Reporter: Davies Liu >Assignee: Davies Liu > Original Estimate: 72h > Remaining Estimate: 72h > > The wholeFile() can not read compressed files, it should be, just like > textFile(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4142) Bad Default for GraphLoader Edge Partitions
Joseph E. Gonzalez created SPARK-4142: - Summary: Bad Default for GraphLoader Edge Partitions Key: SPARK-4142 URL: https://issues.apache.org/jira/browse/SPARK-4142 Project: Spark Issue Type: Bug Components: GraphX Reporter: Joseph E. Gonzalez The default number of edge partitions for the GraphLoader is set to 1 rather than the default parallelism. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189093#comment-14189093 ] Xiangrui Meng commented on SPARK-3080: -- Btw, the `ArrayIndexOutOfBoundsException` is from the driver log. Could you also check the executor logs? It may contain the root cause. > ArrayIndexOutOfBoundsException in ALS for Large datasets > > > Key: SPARK-3080 > URL: https://issues.apache.org/jira/browse/SPARK-3080 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Burak Yavuz > > The stack trace is below: > {quote} > java.lang.ArrayIndexOutOfBoundsException: 2716 > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > > org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > {quote} > This happened after the dataset was sub-sampled. > Dataset properties: ~12B ratings > Setup: 55 r3.8xlarge ec2 instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4097) Race condition in org.apache.spark.ComplexFutureAction.cancel
[ https://issues.apache.org/jira/browse/SPARK-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-4097. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Assignee: Shixiong Zhu > Race condition in org.apache.spark.ComplexFutureAction.cancel > - > > Key: SPARK-4097 > URL: https://issues.apache.org/jira/browse/SPARK-4097 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Labels: bug, race-condition > Fix For: 1.1.1, 1.2.0 > > > There is a chance that `thread` is null when calling `thread.interrupt()`. > {code:java} > override def cancel(): Unit = this.synchronized { > _cancelled = true > if (thread != null) { > thread.interrupt() > } > } > {code} > Should put `thread = null` into a `synchronized` block to fix the race > condition. > {code:java} > try { > p.success(func) > } catch { > case e: Exception => p.failure(e) > } finally { > thread = null > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189039#comment-14189039 ] Ilya Ganelin commented on SPARK-3080: - Hi all - I have managed to make some substantial progress! What I discovered is that the default parallelization setting is critical. I did two things that got me around this blocker: 1) I increased the amount of memory available to nodes - by itself this did not solve the problem 2) I set .set("spark.default.parallelism","300") I believe the latter is critical because even if I partitioned the data before feeding it into ALS.train, the internal operations would produce RDDs that are coalesced into fewer partitions. Consequently, I believe these smaller (but presumably large in memory) partitions would create memory issues ultimately leading to this and other hard to pin-down issues. Forcing default parallelism ensured that even these internal operations would shard appropriately. > ArrayIndexOutOfBoundsException in ALS for Large datasets > > > Key: SPARK-3080 > URL: https://issues.apache.org/jira/browse/SPARK-3080 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Burak Yavuz > > The stack trace is below: > {quote} > java.lang.ArrayIndexOutOfBoundsException: 2716 > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > > org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > {quote} > This happened after the dataset was sub-sampled. > Dataset properties: ~12B ratings > Setup: 55 r3.8xlarge ec2 instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4133) PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189014#comment-14189014 ] Josh Rosen commented on SPARK-4133: --- Also, could you enable debug logging and share the executor logs? If you're able to reliably reproduce this bug, please email me at joshro...@databricks.com and I'd be glad to hop on Skype to help you configure logging, etc. > PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0 > -- > > Key: SPARK-4133 > URL: https://issues.apache.org/jira/browse/SPARK-4133 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Antonio Jesus Navarro >Priority: Blocker > > Snappy related problems found when trying to upgrade existing Spark Streaming > App from 1.0.2 to 1.1.0. > We can not run an existing 1.0.2 spark app if upgraded to 1.1.0 > > IOException is thrown by snappy (parsing_error(2)) > > Only spark version changed > As far as we have checked, snappy will throw this error when dealing with > zero bytes length arrays. > We have tried: > > Changing from snappy to LZF, > > Changing broadcast.compression false > > Changing from TorrentBroadcast to HTTPBroadcast. > but with no luck for the moment. > {code} > [ERROR] 2014-10-29 11:23:26,396 [Executor task launch worker-0] > org.apache.spark.executor.Executor logError - Exception in task 0.0 in stage > 0.0 (TID 0) > java.io.IOException: PARSING_ERROR(2) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method) > at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:545) > at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:125) > at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) > at > org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:232) > at > org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:169) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4141) Hide Accumulators column on stage page when no accumulators exist
Kay Ousterhout created SPARK-4141: - Summary: Hide Accumulators column on stage page when no accumulators exist Key: SPARK-4141 URL: https://issues.apache.org/jira/browse/SPARK-4141 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Priority: Minor The task table on the details page for each stage has a column for accumulators. We should only show this column if the stage has accumulators, otherwise it clutters the UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3466) Limit size of results that a driver collects for each action
[ https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188998#comment-14188998 ] Apache Spark commented on SPARK-3466: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3003 > Limit size of results that a driver collects for each action > > > Key: SPARK-3466 > URL: https://issues.apache.org/jira/browse/SPARK-3466 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Assignee: Davies Liu >Priority: Critical > > Right now, operations like {{collect()}} and {{take()}} can crash the driver > with an OOM if they bring back too many data. We should add a > {{spark.driver.maxResultSize}} setting (or something like that) that will > make the driver abort a job if its result is too big. We can set it to some > fraction of the driver's memory by default, or to something like 100 MB. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4126) Do not set `spark.executor.instances` if not needed (yarn-cluster)
[ https://issues.apache.org/jira/browse/SPARK-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4126. Resolution: Won't Fix superseded by SPARK-4138 > Do not set `spark.executor.instances` if not needed (yarn-cluster) > -- > > Key: SPARK-4126 > URL: https://issues.apache.org/jira/browse/SPARK-4126 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > > In yarn cluster mode, we currently always set `spark.executor.instances` > regardless of whether this is set by the user. While not a huge deal, this > prevents us from knowing whether the user did specify a starting number of > executors. > This is needed in SPARK-3795 to throw the appropriate exception when this is > set AND dynamic executor allocation is turned on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3822) Expose a mechanism for SparkContext to ask for / remove Yarn containers
[ https://issues.apache.org/jira/browse/SPARK-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3822. Resolution: Fixed Fix Version/s: 1.2.0 > Expose a mechanism for SparkContext to ask for / remove Yarn containers > --- > > Key: SPARK-3822 > URL: https://issues.apache.org/jira/browse/SPARK-3822 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, YARN >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 1.2.0 > > > This is one of the core components for the umbrella issue SPARK-3174. > Currently, the only agent in Spark that communicates directly with the RM is > the AM. This means the only way for the SparkContext to ask for / remove > containers from the RM is through the AM. The communication link between the > SparkContext and the AM needs to be added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3573) Dataset
[ https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188980#comment-14188980 ] Joseph K. Bradley commented on SPARK-3573: -- [~sparks] Trying to simplify things, am I right that the main question is: _Should ML data instances/examples/rows be flat vectors or have structure?_ Breaking this down, (1) Should we allow structure? (2) Should we encourage flatness or structure, and how? (3) How does a Dataset used in a full ML pipeline resemble/differ from a Dataset used by a specific ML algorithm? My thoughts: (1) We should allow structure. For general (complicated) pipelines, it will be important to provide structure to make it easy to select groups of features. (2) We should encourage flatness where possible; e.g., unigram features from a document should be stored as a Vector instead of a bunch of Doubles in the Schema. We should encourage structure where meaningful; e.g., the output of a learning algorithm should be appended as a new column (new element in the Schema) by default, rather than being appended to a big Vector of features. (3) As in my comment for (2), a Dataset for a full pipeline should have structure where meaningful. However, I agree that most common ML algorithms expect flat Vectors of features. There needs to be an easy way to select relevant features and transform them to a Vector, LabeledPoint, etc. Having structured Datasets in the pipeline should be useful for selecting relevant features. To transform the selection, it will be important to provide helper methods for mushing the data into Vectors or other common formats. The big challenge in my mind is (2): Figuring out default behavior and perhaps column naming/selection conventions which make it easy to select subsets of features (or even have an implicit selection if possible). What do you think? > Dataset > --- > > Key: SPARK-3573 > URL: https://issues.apache.org/jira/browse/SPARK-3573 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra > ML-specific metadata embedded in its schema. > .Sample code > Suppose we have training events stored on HDFS and user/ad features in Hive, > we want to assemble features for training and then apply decision tree. > The proposed pipeline with dataset looks like the following (need more > refinements): > {code} > sqlContext.jsonFile("/path/to/training/events", > 0.01).registerTempTable("event") > val training = sqlContext.sql(""" > SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, > event.action AS label, > user.gender AS userGender, user.country AS userCountry, > user.features AS userFeatures, > ad.targetGender AS targetGender > FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = > ad.id;""").cache() > val indexer = new Indexer() > val interactor = new Interactor() > val fvAssembler = new FeatureVectorAssembler() > val treeClassifer = new DecisionTreeClassifer() > val paramMap = new ParamMap() > .put(indexer.features, Map("userCountryIndex" -> "userCountry")) > .put(indexer.sortByFrequency, true) > .put(interactor.features, Map("genderMatch" -> Array("userGender", > "targetGender"))) > .put(fvAssembler.features, Map("features" -> Array("genderMatch", > "userCountryIndex", "userFeatures"))) > .put(fvAssembler.dense, true) > .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes > "features" and "label" columns. > val pipeline = Pipeline.create(indexer, interactor, fvAssembler, > treeClassifier) > val model = pipeline.fit(training, paramMap) > sqlContext.jsonFile("/path/to/events", 0.01).registerTempTable("event") > val test = sqlContext.sql(""" > SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, > user.gender AS userGender, user.country AS userCountry, > user.features AS userFeatures, > ad.targetGender AS targetGender > FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = > ad.id;""") > val prediction = model.transform(test).select('eventId, 'prediction) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4140) Document the dynamic allocation feature
Andrew Or created SPARK-4140: Summary: Document the dynamic allocation feature Key: SPARK-4140 URL: https://issues.apache.org/jira/browse/SPARK-4140 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or This blocks on SPARK-3795 and SPARK-3822. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4138) Guard against incompatible settings on the number of executors
[ https://issues.apache.org/jira/browse/SPARK-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188960#comment-14188960 ] Apache Spark commented on SPARK-4138: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/3002 > Guard against incompatible settings on the number of executors > -- > > Key: SPARK-4138 > URL: https://issues.apache.org/jira/browse/SPARK-4138 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0 >Reporter: Andrew Or >Assignee: Andrew Or > > After SPARK-3822 and SPARK-3795, we now set a lower bound and an upper bound > for the number of executors. These settings are incompatible if the user sets > the number of executors explicitly, however. We need to add a guard against > this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4139) Start the number of executors at the max if dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188961#comment-14188961 ] Apache Spark commented on SPARK-4139: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/3002 > Start the number of executors at the max if dynamic allocation is enabled > - > > Key: SPARK-4139 > URL: https://issues.apache.org/jira/browse/SPARK-4139 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.2.0 >Reporter: Andrew Or >Assignee: Andrew Or > > SPARK-3795 allows us to dynamically scale the number of executors up and > down. We should start the number at the max instead of from 0 in the > beginning, because the first job will likely run immediately after the > SparkContext is set up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4139) Start the number of executors at the max if dynamic allocation is enabled
Andrew Or created SPARK-4139: Summary: Start the number of executors at the max if dynamic allocation is enabled Key: SPARK-4139 URL: https://issues.apache.org/jira/browse/SPARK-4139 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or SPARK-3795 allows us to dynamically scale the number of executors up and down. We should start the number at the max instead of from 0 in the beginning, because the first job will likely run immediately after the SparkContext is set up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4138) Guard against incompatible settings on the number of executors
Andrew Or created SPARK-4138: Summary: Guard against incompatible settings on the number of executors Key: SPARK-4138 URL: https://issues.apache.org/jira/browse/SPARK-4138 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or After SPARK-3822 and SPARK-3795, we now set a lower bound and an upper bound for the number of executors. These settings are incompatible if the user sets the number of executors explicitly, however. We need to add a guard against this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3796) Create shuffle service for external block storage
[ https://issues.apache.org/jira/browse/SPARK-3796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188947#comment-14188947 ] Apache Spark commented on SPARK-3796: - User 'aarondav' has created a pull request for this issue: https://github.com/apache/spark/pull/3001 > Create shuffle service for external block storage > - > > Key: SPARK-3796 > URL: https://issues.apache.org/jira/browse/SPARK-3796 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Aaron Davidson > > This task will be broken up into two parts -- the first, being to refactor > our internal shuffle service to use a BlockTransferService which we can > easily extract out into its own service, and then the second is to actually > do the extraction. > Here is the design document for the low-level service, nicknamed "Sluice", on > top of which will be Spark's BlockTransferService API: > https://docs.google.com/document/d/1zKf3qloBu3dmv2AFyQTwEpumWRPUT5bcAUKB5PGNfx0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188938#comment-14188938 ] Nicholas Chammas commented on SPARK-3398: - No problem. I've opened [SPARK-4137] to track this issue, and [PR 2988|https://github.com/apache/spark/pull/2988] to resolve it. > Have spark-ec2 intelligently wait for specific cluster states > - > > Key: SPARK-3398 > URL: https://issues.apache.org/jira/browse/SPARK-3398 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 1.2.0 > > > {{spark-ec2}} currently has retry logic for when it tries to install stuff on > a cluster and for when it tries to destroy security groups. > It would be better to have some logic that allows {{spark-ec2}} to explicitly > wait for when all the nodes in a cluster it is working on have reached a > specific state. > Examples: > * Wait for all nodes to be up > * Wait for all nodes to be up and accepting SSH connections (then start > installing stuff) > * Wait for all nodes to be down > * Wait for all nodes to be terminated (then delete the security groups) > Having a function in the {{spark_ec2.py}} script that blocks until the > desired cluster state is reached would reduce the need for various retry > logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4137) Relative paths don't get handled correctly by spark-ec2
Nicholas Chammas created SPARK-4137: --- Summary: Relative paths don't get handled correctly by spark-ec2 Key: SPARK-4137 URL: https://issues.apache.org/jira/browse/SPARK-4137 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: Nicholas Chammas Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188928#comment-14188928 ] Xiangrui Meng commented on SPARK-3080: -- SimpleALS is not merged yet. You need to build it and submit it as an application: http://spark.apache.org/docs/latest/submitting-applications.html > ArrayIndexOutOfBoundsException in ALS for Large datasets > > > Key: SPARK-3080 > URL: https://issues.apache.org/jira/browse/SPARK-3080 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Burak Yavuz > > The stack trace is below: > {quote} > java.lang.ArrayIndexOutOfBoundsException: 2716 > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > > org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > {quote} > This happened after the dataset was sub-sampled. > Dataset properties: ~12B ratings > Setup: 55 r3.8xlarge ec2 instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3573) Dataset
[ https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188919#comment-14188919 ] Evan Sparks commented on SPARK-3573: This comment originally appeared on the PR associated with this feature. (https://github.com/apache/spark/pull/2919): I've looked at the code here, and it basically seems reasonable. One high-level concern I have is around the programming pattern that this encourages: complex nesting of otherwise simple structure that may make it difficult to program against Datasets for sufficiently complicated applications. A 'dataset' is now a collection of Row, where we have the guarantee that all rows in a Dataset conform to the same schema. A schema is a list of (name, type) pairs which describe the attributes available in the dataset. This seems like a good thing to me, and is pretty much what we described in MLI (and how conventional databases have been structured forever). So far, so good. The concern that I have is that we are now encouraging these attributes to be complex types. For example, where I might have had val x = Schema(('a', classOf[String]), ('b', classOf[Double]), ..., ("z", classOf[Double])) This would become val x = Schema(('a', classOf[String]), ('bGroup', classOf[Vector]), .., ("zGroup", classOf[Vector])) So, great, my schema now has these vector things in them, which I can create separately, pass around, etc. This clearly has its merits: 1) Features are groups together logically based on the process that creates them. 2) Managing one short schema where each record is comprised of a few large objects (say, 4 vectors, each of length 1000) is probably easier than managing a really big schema comprised of lots small objects (say, 4000 doubles). But, there are some major drawbacks 1) Why only stop at one level of nesting? Why not have Vector[Vector]? 2) How do learning algorithms, like SVM or PCA deal with these Datasets? Is there an implicit conversion that flattens these things to RDD[LabeledPoint]? Do we want to guarantee these semantics? 3) Manipulating and subsetting nested schemas like this might be tricky. Where before I might be able to write: val x: Dataset = input.select(Seq(0,1,2,4,180,181,1000,1001,1002)) now I might have to write val groupSelections = Seq(Seq(0,1,2,4),Seq(0,1),Seq(0,1,2)) val x: Dataset = groupSelections.zip(input.columns).map {case (gs, col) => col(gs) } Ignoring raw syntax and semantics of how you might actually map an operation over the columns of a Dataset and get back a well-structured dataset, I think this makes two conflicting points: 1) In the first example - presumably all the work goes into figuring out what the subset of features you want is in this really wide feature space. 2) In the second example - there’s a lot of gymnastics that goes into subsetting feature groups. I think it’s clear that working with lots of feature groups might get unreasonable pretty quickly. If we look at R or pandas/scikit-learn as examples of projects that have (arguably quite successfully) dealt with these interface issues, there is one basic pattern: learning algorithms expect big tables of numbers as input. Even here, there are some important differences: For example, in scikit-learn, categorical features aren’t supported directly by most learning algorithms. Instead, users are responsible for getting data from “table with heterogenously typed columns” to “table of numbers.” with something like OneHotEncoder and other feature transformers. In R, on the other hand, categorical features are (sometimes frustratingly) first class citizens by virtue of the “factor” data type - which is essentially and enum. Most out-of-the-box learning algorithms (like glm()) accept data frames with categorical inputs and handle them sensibly - implicitly one hot encoding (or creating dummy variables, if you prefer) the categorical features. While I have a slight preference for representing things as big flat tables, I would be fine coding either way - but I wanted to raise the issue for discussion here before the interfaces are set in stone. > Dataset > --- > > Key: SPARK-3573 > URL: https://issues.apache.org/jira/browse/SPARK-3573 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra > ML-specific metadata embedded in its schema. > .Sample code > Suppose we have training events stored on HDFS and user/ad features in Hive, > we want to assemble features for training and then apply decision tree. > The proposed pipeline with dataset looks like the following (need more > refinements): > {code} > sqlContext.jsonFile("/path/to/training/events", > 0.
[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188916#comment-14188916 ] Josh Rosen commented on SPARK-4105: --- It seems plausible that SPARK-4107 could have caused this issue, but I'm waiting for confirmation that it fixes this issue. > FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based > shuffle > - > > Key: SPARK-4105 > URL: https://issues.apache.org/jira/browse/SPARK-4105 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during > shuffle read. Here's a sample stacktrace from an executor: > {code} > 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID > 33053) > java.io.IOException: FAILED_TO_UNCOMPRESS(5) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) > at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) > at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) > at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) > at > org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Here's another occurrence of a similar error: > {code} >
[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188910#comment-14188910 ] Josh Rosen commented on SPARK-3630: --- *Decompression errors during shuffle fetching*: If you've seen errors like {{FAILED_TO_UNCOMPRESS(5)}} during shuffle fetching, then please see SPARK-4105. We believe that this might be fixed by SPARK-4107, but we're awaiting confirmation from the folks that have been able to reproduce these errors. > Identify cause of Kryo+Snappy PARSING_ERROR > --- > > Key: SPARK-3630 > URL: https://issues.apache.org/jira/browse/SPARK-3630 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: Andrew Ash >Assignee: Josh Rosen > > A recent GraphX commit caused non-deterministic exceptions in unit tests so > it was reverted (see SPARK-3400). > Separately, [~aash] observed the same exception stacktrace in an > application-specific Kryo registrator: > {noformat} > com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to > uncompress the chunk: PARSING_ERROR(2) > com.esotericsoftware.kryo.io.Input.fill(Input.java:142) > com.esotericsoftware.kryo.io.Input.require(Input.java:169) > com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) > com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) > com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127) > > com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117) > > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) > > com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) > > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > ... > {noformat} > This ticket is to identify the cause of the exception in the GraphX commit so > the faulty commit can be fixed and merged back into master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4136) Under dynamic allocation, cancel outstanding executor requests when pending task queue is empty
[ https://issues.apache.org/jira/browse/SPARK-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4136: - Affects Version/s: (was: 1.1.0) 1.2.0 > Under dynamic allocation, cancel outstanding executor requests when pending > task queue is empty > --- > > Key: SPARK-4136 > URL: https://issues.apache.org/jira/browse/SPARK-4136 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.2.0 >Reporter: Sandy Ryza > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4136) Under dynamic allocation, cancel outstanding executor requests when pending task queue is empty
Sandy Ryza created SPARK-4136: - Summary: Under dynamic allocation, cancel outstanding executor requests when pending task queue is empty Key: SPARK-4136 URL: https://issues.apache.org/jira/browse/SPARK-4136 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.1.0 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4135) Error reading Parquet file generated with SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-4135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-4135: -- Attachment: _metadata part-r-1.parquet Files generated by SparkSQL that cannot be read. > Error reading Parquet file generated with SparkSQL > -- > > Key: SPARK-4135 > URL: https://issues.apache.org/jira/browse/SPARK-4135 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Hossein Falaki > Attachments: _metadata, part-r-1.parquet > > > I read a tsv version of the one million songs dataset (available here: > http://tbmmsd.s3.amazonaws.com/) > After reading it I create a SchemaRDD with following schema: > {code} > root > |-- track_id: string (nullable = true) > |-- analysis_sample_rate: string (nullable = true) > |-- artist_7digitalid: string (nullable = true) > |-- artist_familiarity: double (nullable = true) > |-- artist_hotness: double (nullable = true) > |-- artist_id: string (nullable = true) > |-- artist_latitude: string (nullable = true) > |-- artist_location: string (nullable = true) > |-- artist_longitude: string (nullable = true) > |-- artist_mbid: string (nullable = true) > |-- artist_mbtags: array (nullable = true) > ||-- element: string (containsNull = true) > |-- artist_mbtags_count: array (nullable = true) > ||-- element: double (containsNull = true) > |-- artist_name: string (nullable = true) > |-- artist_playmeid: string (nullable = true) > |-- artist_terms: array (nullable = true) > ||-- element: string (containsNull = true) > |-- artist_terms_freq: array (nullable = true) > ||-- element: double (containsNull = true) > |-- artist_terms_weight: array (nullable = true) > ||-- element: double (containsNull = true) > |-- audio_md5: string (nullable = true) > |-- bars_confidence: array (nullable = true) > ||-- element: double (containsNull = true) > |-- bars_start: array (nullable = true) > ||-- element: double (containsNull = true) > |-- beats_confidence: array (nullable = true) > ||-- element: double (containsNull = true) > |-- beats_start: array (nullable = true) > ||-- element: double (containsNull = true) > |-- danceability: double (nullable = true) > |-- duration: double (nullable = true) > |-- end_of_fade_in: double (nullable = true) > |-- energy: double (nullable = true) > |-- key: string (nullable = true) > |-- key_confidence: double (nullable = true) > |-- loudness: double (nullable = true) > |-- mode: double (nullable = true) > |-- mode_confidence: double (nullable = true) > |-- release: string (nullable = true) > |-- release_7digitalid: string (nullable = true) > |-- sections_confidence: array (nullable = true) > ||-- element: double (containsNull = true) > |-- sections_start: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_confidence: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_loudness_max: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_loudness_max_time: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_loudness_start: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_pitches: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_start: array (nullable = true) > ||-- element: double (containsNull = true) > |-- segments_timbre: array (nullable = true) > ||-- element: double (containsNull = true) > |-- similar_artists: array (nullable = true) > ||-- element: string (containsNull = true) > |-- song_hotness: double (nullable = true) > |-- song_id: string (nullable = true) > |-- start_of_fade_out: double (nullable = true) > |-- tatums_confidence: array (nullable = true) > ||-- element: double (containsNull = true) > |-- tatums_start: array (nullable = true) > ||-- element: double (containsNull = true) > |-- tempo: double (nullable = true) > |-- time_signature: double (nullable = true) > |-- time_signature_confidence: double (nullable = true) > |-- title: string (nullable = true) > |-- track_7digitalid: string (nullable = true) > |-- year: double (nullable = true) > {code} > I select a single record from it and save it using saveAsParquetFile(). > When I read it later and try to query it I get the following exception: > {code} > Error in SQL statement: java.lang.RuntimeException: > java.lang.reflect.InvocationTargetException > at sun.reflect.GeneratedMethodAccessor208.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(
[jira] [Commented] (SPARK-4133) PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188815#comment-14188815 ] Josh Rosen commented on SPARK-4133: --- Also, can you paste more of the log leading up to the error? It would be helpful to see any other log messages from broadcast, such as messages about it fetching pieces / blocks. > PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0 > -- > > Key: SPARK-4133 > URL: https://issues.apache.org/jira/browse/SPARK-4133 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Antonio Jesus Navarro >Priority: Blocker > > Snappy related problems found when trying to upgrade existing Spark Streaming > App from 1.0.2 to 1.1.0. > We can not run an existing 1.0.2 spark app if upgraded to 1.1.0 > > IOException is thrown by snappy (parsing_error(2)) > > Only spark version changed > As far as we have checked, snappy will throw this error when dealing with > zero bytes length arrays. > We have tried: > > Changing from snappy to LZF, > > Changing broadcast.compression false > > Changing from TorrentBroadcast to HTTPBroadcast. > but with no luck for the moment. > {code} > [ERROR] 2014-10-29 11:23:26,396 [Executor task launch worker-0] > org.apache.spark.executor.Executor logError - Exception in task 0.0 in stage > 0.0 (TID 0) > java.io.IOException: PARSING_ERROR(2) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method) > at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:545) > at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:125) > at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) > at > org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:232) > at > org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:169) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4135) Error reading Parquet file generated with SparkSQL
Hossein Falaki created SPARK-4135: - Summary: Error reading Parquet file generated with SparkSQL Key: SPARK-4135 URL: https://issues.apache.org/jira/browse/SPARK-4135 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Hossein Falaki I read a tsv version of the one million songs dataset (available here: http://tbmmsd.s3.amazonaws.com/) After reading it I create a SchemaRDD with following schema: {code} root |-- track_id: string (nullable = true) |-- analysis_sample_rate: string (nullable = true) |-- artist_7digitalid: string (nullable = true) |-- artist_familiarity: double (nullable = true) |-- artist_hotness: double (nullable = true) |-- artist_id: string (nullable = true) |-- artist_latitude: string (nullable = true) |-- artist_location: string (nullable = true) |-- artist_longitude: string (nullable = true) |-- artist_mbid: string (nullable = true) |-- artist_mbtags: array (nullable = true) ||-- element: string (containsNull = true) |-- artist_mbtags_count: array (nullable = true) ||-- element: double (containsNull = true) |-- artist_name: string (nullable = true) |-- artist_playmeid: string (nullable = true) |-- artist_terms: array (nullable = true) ||-- element: string (containsNull = true) |-- artist_terms_freq: array (nullable = true) ||-- element: double (containsNull = true) |-- artist_terms_weight: array (nullable = true) ||-- element: double (containsNull = true) |-- audio_md5: string (nullable = true) |-- bars_confidence: array (nullable = true) ||-- element: double (containsNull = true) |-- bars_start: array (nullable = true) ||-- element: double (containsNull = true) |-- beats_confidence: array (nullable = true) ||-- element: double (containsNull = true) |-- beats_start: array (nullable = true) ||-- element: double (containsNull = true) |-- danceability: double (nullable = true) |-- duration: double (nullable = true) |-- end_of_fade_in: double (nullable = true) |-- energy: double (nullable = true) |-- key: string (nullable = true) |-- key_confidence: double (nullable = true) |-- loudness: double (nullable = true) |-- mode: double (nullable = true) |-- mode_confidence: double (nullable = true) |-- release: string (nullable = true) |-- release_7digitalid: string (nullable = true) |-- sections_confidence: array (nullable = true) ||-- element: double (containsNull = true) |-- sections_start: array (nullable = true) ||-- element: double (containsNull = true) |-- segments_confidence: array (nullable = true) ||-- element: double (containsNull = true) |-- segments_loudness_max: array (nullable = true) ||-- element: double (containsNull = true) |-- segments_loudness_max_time: array (nullable = true) ||-- element: double (containsNull = true) |-- segments_loudness_start: array (nullable = true) ||-- element: double (containsNull = true) |-- segments_pitches: array (nullable = true) ||-- element: double (containsNull = true) |-- segments_start: array (nullable = true) ||-- element: double (containsNull = true) |-- segments_timbre: array (nullable = true) ||-- element: double (containsNull = true) |-- similar_artists: array (nullable = true) ||-- element: string (containsNull = true) |-- song_hotness: double (nullable = true) |-- song_id: string (nullable = true) |-- start_of_fade_out: double (nullable = true) |-- tatums_confidence: array (nullable = true) ||-- element: double (containsNull = true) |-- tatums_start: array (nullable = true) ||-- element: double (containsNull = true) |-- tempo: double (nullable = true) |-- time_signature: double (nullable = true) |-- time_signature_confidence: double (nullable = true) |-- title: string (nullable = true) |-- track_7digitalid: string (nullable = true) |-- year: double (nullable = true) {code} I select a single record from it and save it using saveAsParquetFile(). When I read it later and try to query it I get the following exception: {code} Error in SQL statement: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor208.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat$$anonfun$getSplits$1.apply(ParquetTableOperations.scala:472) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat$$anonfun$getSplits$1.apply(ParquetTableOperations.scala:457) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.sca
[jira] [Updated] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3080: - Target Version/s: 1.2.0 Affects Version/s: 1.1.0 > ArrayIndexOutOfBoundsException in ALS for Large datasets > > > Key: SPARK-3080 > URL: https://issues.apache.org/jira/browse/SPARK-3080 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Burak Yavuz > > The stack trace is below: > {quote} > java.lang.ArrayIndexOutOfBoundsException: 2716 > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > > org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > {quote} > This happened after the dataset was sub-sampled. > Dataset properties: ~12B ratings > Setup: 55 r3.8xlarge ec2 instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4081) Categorical feature indexing
[ https://issues.apache.org/jira/browse/SPARK-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188810#comment-14188810 ] Apache Spark commented on SPARK-4081: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/3000 > Categorical feature indexing > > > Key: SPARK-4081 > URL: https://issues.apache.org/jira/browse/SPARK-4081 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > DecisionTree and RandomForest require that categorical features and labels be > indexed 0,1,2 There is currently no code to aid with indexing a dataset. > This is a proposal for a helper class for computing indices (and also > deciding which features to treat as categorical). > Proposed functionality: > * This helps process a dataset of unknown vectors into a dataset with some > continuous features and some categorical features. The choice between > continuous and categorical is based upon a maxCategories parameter. > * This can also map categorical feature values to 0-based indices. > Usage: > {code} > val myData1: RDD[Vector] = ... > val myData2: RDD[Vector] = ... > val datasetIndexer = new DatasetIndexer(maxCategories) > datasetIndexer.fit(myData1) > val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1) > datasetIndexer.fit(myData2) > val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2) > val categoricalFeaturesInfo: Map[Double, Int] = > datasetIndexer.getCategoricalFeatureIndexes() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4003) Add {Big Decimal, Timestamp, Date} types to Java SqlContext
[ https://issues.apache.org/jira/browse/SPARK-4003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4003. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2850 [https://github.com/apache/spark/pull/2850] > Add {Big Decimal, Timestamp, Date} types to Java SqlContext > --- > > Key: SPARK-4003 > URL: https://issues.apache.org/jira/browse/SPARK-4003 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Adrian Wang >Assignee: Adrian Wang > Fix For: 1.2.0 > > > in JavaSqlContext, we need to let java program use big decimal, timestamp, > date types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4081) Categorical feature indexing
[ https://issues.apache.org/jira/browse/SPARK-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-4081: - Description: DecisionTree and RandomForest require that categorical features and labels be indexed 0,1,2 There is currently no code to aid with indexing a dataset. This is a proposal for a helper class for computing indices (and also deciding which features to treat as categorical). Proposed functionality: * This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter. * This can also map categorical feature values to 0-based indices. Usage: {code} val myData1: RDD[Vector] = ... val myData2: RDD[Vector] = ... val datasetIndexer = new DatasetIndexer(maxCategories) datasetIndexer.fit(myData1) val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1) datasetIndexer.fit(myData2) val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2) val categoricalFeaturesInfo: Map[Double, Int] = datasetIndexer.getCategoricalFeatureIndexes() {code} was: DecisionTree and RandomForest require that categorical features and labels be indexed 0,1,2 There is currently no code to aid with indexing a dataset. This is a proposal for a helper class for computing indices (and also deciding which features to treat as categorical). Proposed functionality: * This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter. * This can also map categorical feature values to 0-based indices. Usage: {code} val myData1: RDD[Vector] = ... val myData2: RDD[Vector] = ... val datasetIndexer = new DatasetIndexer(maxCategories) datasetIndexer.fit(myData1) val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1) datasetIndexer.fit(myData2) val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2) val categoricalFeaturesInfo: Map[Int, Int] = datasetIndexer.getCategoricalFeaturesInfo() {code} > Categorical feature indexing > > > Key: SPARK-4081 > URL: https://issues.apache.org/jira/browse/SPARK-4081 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > DecisionTree and RandomForest require that categorical features and labels be > indexed 0,1,2 There is currently no code to aid with indexing a dataset. > This is a proposal for a helper class for computing indices (and also > deciding which features to treat as categorical). > Proposed functionality: > * This helps process a dataset of unknown vectors into a dataset with some > continuous features and some categorical features. The choice between > continuous and categorical is based upon a maxCategories parameter. > * This can also map categorical feature values to 0-based indices. > Usage: > {code} > val myData1: RDD[Vector] = ... > val myData2: RDD[Vector] = ... > val datasetIndexer = new DatasetIndexer(maxCategories) > datasetIndexer.fit(myData1) > val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1) > datasetIndexer.fit(myData2) > val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2) > val categoricalFeaturesInfo: Map[Double, Int] = > datasetIndexer.getCategoricalFeatureIndexes() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3958) Possible stream-corruption issues in TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3958: -- Affects Version/s: 1.1.0 Adding 1.1.0 as an affected version, since a user has observed this in 1.1.0, too; see SPARK-4133. > Possible stream-corruption issues in TorrentBroadcast > - > > Key: SPARK-3958 > URL: https://issues.apache.org/jira/browse/SPARK-3958 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > TorrentBroadcast deserialization sometimes fails with decompression errors, > which are most likely caused by stream-corruption exceptions. For example, > this can manifest itself as a Snappy PARSING_ERROR when deserializing a > broadcasted task: > {code} > 14/10/14 17:20:55.016 DEBUG BlockManager: Getting local block broadcast_8 > 14/10/14 17:20:55.016 DEBUG BlockManager: Block broadcast_8 not registered > locally > 14/10/14 17:20:55.016 INFO TorrentBroadcast: Started reading broadcast > variable 8 > 14/10/14 17:20:55.017 INFO TorrentBroadcast: Reading broadcast variable 8 > took 5.3433E-5 s > 14/10/14 17:20:55.017 ERROR Executor: Exception in task 2.0 in stage 8.0 (TID > 18) > java.io.IOException: PARSING_ERROR(2) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) > at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method) > at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594) > at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:125) > at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) > at > org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:216) > at > org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:170) > at sun.reflect.GeneratedMethodAccessor92.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:164) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > SPARK-3630 is an umbrella ticket for investigating all causes of these Kryo > and Snappy deserialization errors. This ticket is for a more > narrowly-focused exploration of the TorrentBroadcast version of these errors, > since the similar errors that we've seen in sort-based shuffle seem to be > explained by a different cause (see SPARK-3948). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4133) PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188665#comment-14188665 ] Josh Rosen commented on SPARK-4133: --- Since you mentioned that you see a similar issue when using HTTPBroadcast, could you post the stacktrace from that case, too? Similarly, can you post the stacktrace when broadcast compression is disabled? > PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0 > -- > > Key: SPARK-4133 > URL: https://issues.apache.org/jira/browse/SPARK-4133 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Antonio Jesus Navarro >Priority: Blocker > > Snappy related problems found when trying to upgrade existing Spark Streaming > App from 1.0.2 to 1.1.0. > We can not run an existing 1.0.2 spark app if upgraded to 1.1.0 > > IOException is thrown by snappy (parsing_error(2)) > > Only spark version changed > As far as we have checked, snappy will throw this error when dealing with > zero bytes length arrays. > We have tried: > > Changing from snappy to LZF, > > Changing broadcast.compression false > > Changing from TorrentBroadcast to HTTPBroadcast. > but with no luck for the moment. > {code} > [ERROR] 2014-10-29 11:23:26,396 [Executor task launch worker-0] > org.apache.spark.executor.Executor logError - Exception in task 0.0 in stage > 0.0 (TID 0) > java.io.IOException: PARSING_ERROR(2) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method) > at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:545) > at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:125) > at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) > at > org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:232) > at > org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:169) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4129) Performance tuning in MultivariateOnlineSummarizer
[ https://issues.apache.org/jira/browse/SPARK-4129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4129: - Assignee: DB Tsai > Performance tuning in MultivariateOnlineSummarizer > -- > > Key: SPARK-4129 > URL: https://issues.apache.org/jira/browse/SPARK-4129 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: DB Tsai >Assignee: DB Tsai > Fix For: 1.2.0 > > > In MultivariateOnlineSummarizer, breeze's activeIterator is used to loop > through the nonZero elements in the vector. However, activeIterator doesn't > perform well due to lots of overhead. In this PR, native while loop is used > for both DenseVector and SparseVector. > The benchmark result with 20 executors using mnist8m dataset: > Before: > DenseVector: 48.2 seconds > SparseVector: 16.3 seconds > After: > DenseVector: 17.8 seconds > SparseVector: 11.2 seconds > Since MultivariateOnlineSummarizer is used in several places, the overall > performance gain in mllib library will be significant with this PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4129) Performance tuning in MultivariateOnlineSummarizer
[ https://issues.apache.org/jira/browse/SPARK-4129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4129. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2992 [https://github.com/apache/spark/pull/2992] > Performance tuning in MultivariateOnlineSummarizer > -- > > Key: SPARK-4129 > URL: https://issues.apache.org/jira/browse/SPARK-4129 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: DB Tsai >Assignee: DB Tsai > Fix For: 1.2.0 > > > In MultivariateOnlineSummarizer, breeze's activeIterator is used to loop > through the nonZero elements in the vector. However, activeIterator doesn't > perform well due to lots of overhead. In this PR, native while loop is used > for both DenseVector and SparseVector. > The benchmark result with 20 executors using mnist8m dataset: > Before: > DenseVector: 48.2 seconds > SparseVector: 16.3 seconds > After: > DenseVector: 17.8 seconds > SparseVector: 11.2 seconds > Since MultivariateOnlineSummarizer is used in several places, the overall > performance gain in mllib library will be significant with this PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3182) Twitter Streaming Geoloaction Filter
[ https://issues.apache.org/jira/browse/SPARK-3182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188544#comment-14188544 ] Brennon York commented on SPARK-3182: - Hey all, looking to contribute back to Spark :) Would like to take this as a first issue. Could you please assign to me? Thanks! > Twitter Streaming Geoloaction Filter > > > Key: SPARK-3182 > URL: https://issues.apache.org/jira/browse/SPARK-3182 > Project: Spark > Issue Type: Wish > Components: Streaming >Affects Versions: 1.0.0, 1.0.2 >Reporter: Daniel Kershaw > Labels: features > Fix For: 1.2.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Add a geolocation filter to the Twitter Streaming Component. > This should take a sequence of double to indicate the bounding box for the > stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4134) Tone down scary executor lost messages when killing on purpose
Andrew Or created SPARK-4134: Summary: Tone down scary executor lost messages when killing on purpose Key: SPARK-4134 URL: https://issues.apache.org/jira/browse/SPARK-4134 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or After SPARK-3822 goes in, we are now able to dynamically kill executors after an application has started. However, when we do that we get a ton of scary error messages telling us that we've done wrong somehow. It would be good to detect when this is the case and prevent these messages from surfacing. This maybe difficult, however, because the connection manager tends to be quite verbose in unconditionally logging disconnection messages. This is a very nice-to-have for 1.2 but certainly not a blocker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188357#comment-14188357 ] Michael Griffiths edited comment on SPARK-3398 at 10/29/14 1:58 PM: Hi Nicholas, Thanks for the thorough investigation! Making the path absolute does work for me, when called with spark-ec2. Thanks! was (Author: michael.griffiths): Hi Nicholas, Thanks for the thorough investigation! Making the path absolute does work for me, when called with spark-ec2. > Have spark-ec2 intelligently wait for specific cluster states > - > > Key: SPARK-3398 > URL: https://issues.apache.org/jira/browse/SPARK-3398 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 1.2.0 > > > {{spark-ec2}} currently has retry logic for when it tries to install stuff on > a cluster and for when it tries to destroy security groups. > It would be better to have some logic that allows {{spark-ec2}} to explicitly > wait for when all the nodes in a cluster it is working on have reached a > specific state. > Examples: > * Wait for all nodes to be up > * Wait for all nodes to be up and accepting SSH connections (then start > installing stuff) > * Wait for all nodes to be down > * Wait for all nodes to be terminated (then delete the security groups) > Having a function in the {{spark_ec2.py}} script that blocks until the > desired cluster state is reached would reduce the need for various retry > logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188357#comment-14188357 ] Michael Griffiths commented on SPARK-3398: -- Hi Nicholas, Thanks for the thorough investigation! Making the path absolute does work for me, when called with spark-ec2. > Have spark-ec2 intelligently wait for specific cluster states > - > > Key: SPARK-3398 > URL: https://issues.apache.org/jira/browse/SPARK-3398 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 1.2.0 > > > {{spark-ec2}} currently has retry logic for when it tries to install stuff on > a cluster and for when it tries to destroy security groups. > It would be better to have some logic that allows {{spark-ec2}} to explicitly > wait for when all the nodes in a cluster it is working on have reached a > specific state. > Examples: > * Wait for all nodes to be up > * Wait for all nodes to be up and accepting SSH connections (then start > installing stuff) > * Wait for all nodes to be down > * Wait for all nodes to be terminated (then delete the security groups) > Having a function in the {{spark_ec2.py}} script that blocks until the > desired cluster state is reached would reduce the need for various retry > logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188354#comment-14188354 ] Ilya Ganelin commented on SPARK-3080: - Hello Xiangrui - happy to hear that you're on this! With regards to the first question, I have not seen any spillage to disk but I have seen executor loss (on a relatively frequent basis). I have not known whether this is a function of use on our cluster or an internal spark issue. With regards to upgrading ALS, can I simply replace the old SimpleALS.scala with the new one or will there be additional dependencies? I am interested in doing a piece-meal upgrade of ML Lib (without upgrading the rest of Spark from version 1.1). I want to do this to maintain compatibility with CDH 5.2. Please let me know, thank you. > ArrayIndexOutOfBoundsException in ALS for Large datasets > > > Key: SPARK-3080 > URL: https://issues.apache.org/jira/browse/SPARK-3080 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Burak Yavuz > > The stack trace is below: > {quote} > java.lang.ArrayIndexOutOfBoundsException: 2716 > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > > org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > {quote} > This happened after the dataset was sub-sampled. > Dataset properties: ~12B ratings > Setup: 55 r3.8xlarge ec2 instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188325#comment-14188325 ] RJ Nowling commented on SPARK-2429: --- The sparsity tests look good. Have you compared training and assignment time to KMeans yet? An improvement in the assignment time will be important. Also, I don't see a breakdown of the total time by splitting clusters, assignments, etc. Doesn't need to be for every combination of parameters just one or two. That would be very helpful. Thanks! > Hierarchical Implementation of KMeans > - > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: Yu Ishikawa >Priority: Minor > Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The > Result of Benchmarking a Hierarchical Clustering.pdf, > benchmark-result.2014-10-29.html, benchmark2.html > > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Ishikawa updated SPARK-2429: --- Attachment: benchmark-result.2014-10-29.html I added a new performance test results named `benchmark-result.2014-10-29.html`. The main change from the previous result is that I added the benchmark result about vector sparsity. Please check it. > Hierarchical Implementation of KMeans > - > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: Yu Ishikawa >Priority: Minor > Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The > Result of Benchmarking a Hierarchical Clustering.pdf, > benchmark-result.2014-10-29.html, benchmark2.html > > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4133) PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188284#comment-14188284 ] Antonio Jesus Navarro commented on SPARK-4133: -- Existing Spark Streaming app can not be upgraded from 1.0.2 to 1.1.0 > PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0 > -- > > Key: SPARK-4133 > URL: https://issues.apache.org/jira/browse/SPARK-4133 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Antonio Jesus Navarro >Priority: Blocker > > Snappy related problems found when trying to upgrade existing Spark Streaming > App from 1.0.2 to 1.1.0. > We can not run an existing 1.0.2 spark app if upgraded to 1.1.0 > > IOException is thrown by snappy (parsing_error(2)) > > Only spark version changed > As far as we have checked, snappy will throw this error when dealing with > zero bytes length arrays. > We have tried: > > Changing from snappy to LZF, > > Changing broadcast.compression false > > Changing from TorrentBroadcast to HTTPBroadcast. > but with no luck for the moment. > {code} > [ERROR] 2014-10-29 11:23:26,396 [Executor task launch worker-0] > org.apache.spark.executor.Executor logError - Exception in task 0.0 in stage > 0.0 (TID 0) > java.io.IOException: PARSING_ERROR(2) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method) > at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:545) > at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:125) > at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) > at > org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:232) > at > org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:169) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4133) PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0
Antonio Jesus Navarro created SPARK-4133: Summary: PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0 Key: SPARK-4133 URL: https://issues.apache.org/jira/browse/SPARK-4133 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.0 Reporter: Antonio Jesus Navarro Priority: Blocker Snappy related problems found when trying to upgrade existing Spark Streaming App from 1.0.2 to 1.1.0. We can not run an existing 1.0.2 spark app if upgraded to 1.1.0 > IOException is thrown by snappy (parsing_error(2)) > Only spark version changed As far as we have checked, snappy will throw this error when dealing with zero bytes length arrays. We have tried: > Changing from snappy to LZF, > Changing broadcast.compression false > Changing from TorrentBroadcast to HTTPBroadcast. but with no luck for the moment. {code} [ERROR] 2014-10-29 11:23:26,396 [Executor task launch worker-0] org.apache.spark.executor.Executor logError - Exception in task 0.0 in stage 0.0 (TID 0) java.io.IOException: PARSING_ERROR(2) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method) at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:545) at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:125) at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:232) at org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:169) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None
[ https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188227#comment-14188227 ] Tamas Jambor commented on SPARK-3683: - OK, makes sense. Thanks. > PySpark Hive query generates "NULL" instead of None > --- > > Key: SPARK-3683 > URL: https://issues.apache.org/jira/browse/SPARK-3683 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.1.0 >Reporter: Tamas Jambor >Assignee: Davies Liu > > When I run a Hive query in Spark SQL, I get the new Row object, where it does > not convert Hive NULL into Python None instead it keeps it string 'NULL'. > It's only an issue with String type, works with other types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3683) PySpark Hive query generates "NULL" instead of None
[ https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tamas Jambor closed SPARK-3683. --- Resolution: Not a Problem > PySpark Hive query generates "NULL" instead of None > --- > > Key: SPARK-3683 > URL: https://issues.apache.org/jira/browse/SPARK-3683 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.1.0 >Reporter: Tamas Jambor >Assignee: Davies Liu > > When I run a Hive query in Spark SQL, I get the new Row object, where it does > not convert Hive NULL into Python None instead it keeps it string 'NULL'. > It's only an issue with String type, works with other types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None
[ https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188194#comment-14188194 ] Cheng Lian commented on SPARK-3683: --- [~jamborta] Your concern is legitimate. However, unfortunately we have to take Hive compatibility into consideration in this case, otherwise people who run legacy Hive scripts with Spark SQL may get wrong query result. > PySpark Hive query generates "NULL" instead of None > --- > > Key: SPARK-3683 > URL: https://issues.apache.org/jira/browse/SPARK-3683 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.1.0 >Reporter: Tamas Jambor >Assignee: Davies Liu > > When I run a Hive query in Spark SQL, I get the new Row object, where it does > not convert Hive NULL into Python None instead it keeps it string 'NULL'. > It's only an issue with String type, works with other types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None
[ https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188185#comment-14188185 ] Tamas Jambor commented on SPARK-3683: - Thanks for the comments. >From my perspective this is a matter of inconsistency, as all the other types >are represented as none in python, except string. So I run another pass on the >data, and convert all the NULL values to none. I think the problem with the literal string "NULL", you cannot build the logic to handle that in the consecutive steps, as it is not represented in the appropriate way (i.e. missing values are usually handled as a special case). > PySpark Hive query generates "NULL" instead of None > --- > > Key: SPARK-3683 > URL: https://issues.apache.org/jira/browse/SPARK-3683 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.1.0 >Reporter: Tamas Jambor >Assignee: Davies Liu > > When I run a Hive query in Spark SQL, I get the new Row object, where it does > not convert Hive NULL into Python None instead it keeps it string 'NULL'. > It's only an issue with String type, works with other types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-683) Spark 0.7 with Hadoop 1.0 does not work with current AMI's HDFS installation
[ https://issues.apache.org/jira/browse/SPARK-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188160#comment-14188160 ] Sean Owen commented on SPARK-683: - PS I think this also turns out to be the same as SPARK-4078 > Spark 0.7 with Hadoop 1.0 does not work with current AMI's HDFS installation > > > Key: SPARK-683 > URL: https://issues.apache.org/jira/browse/SPARK-683 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 0.7.0 >Reporter: Tathagata Das > > A simple saveAsObjectFile() leads to the following error. > org.apache.hadoop.ipc.RemoteException: java.io.IOException: > java.lang.NoSuchMethodException: > org.apache.hadoop.hdfs.protocol.ClientProtocol.create(java.lang.String, > org.apache.hadoop.fs.permission.FsPermission, java.lang.String, boolean, > boolean, short, long) > at java.lang.Class.getMethod(Class.java:1622) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:416) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4132) Spark uses incompatible HDFS API
[ https://issues.apache.org/jira/browse/SPARK-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4132. -- Resolution: Duplicate I'm all but certain you're describing the same thing as SPARK-4078 > Spark uses incompatible HDFS API > > > Key: SPARK-4132 > URL: https://issues.apache.org/jira/browse/SPARK-4132 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 > Environment: Spark1.1.0 on Hadoop1.2.1 > CentOS 6.3 64bit >Reporter: kuromatsu nobuyuki >Priority: Minor > > When I enable event logging and set it to output to HDFS, initialization > fails with 'java.lang.ClassNotFoundException' (see trace below). > I found that an API incompatibility in > org.apache.hadoop.fs.permission.FsPermission between Hadoop 1.0.4 and Hadoop > 1.1.0 (and above) causes this error > (org.apache.hadoop.fs.permission.FsPermission$2 is used in 1.0.4 but doesn't > exist in my 1.2.1 environment). > I think that the Spark jar file pre-built for Hadoop1.X should be built on > Hadoop Stable version(Hadoop 1.2.1). > 2014-10-24 10:43:22,893 INFO org.apache.hadoop.ipc.Server: IPC Server > listener on 9000: > readAndProcess threw exception java.lang.RuntimeException: > readObject can't find class org.apache.hadoop.fs.permission.FsPermission$2. > Count of bytes read: 0 > java.lang.RuntimeException: readObject can't find class > org.apache.hadoop.fs.permission.FsPermission$2 > at > org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:233) > at org.apache.hadoop.ipc.RPC$Invocation.readFields(RPC.java:106) > at > org.apache.hadoop.ipc.Server$Connection.processData(Server.java:1347) > at > org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1326) > at > org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1226) > at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:577) > at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:384) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:701) > Caused by: java.lang.ClassNotFoundException: > org.apache.hadoop.fs.permission.FsPermission$2 > at java.net.URLClassLoader$1.run(URLClassLoader.java:217) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:205) > at java.lang.ClassLoader.loadClass(ClassLoader.java:323) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) > at java.lang.ClassLoader.loadClass(ClassLoader.java:268) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:270) > at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810) > at > org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:231) > ... 9 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4131) Support "Writing data into the filesystem from queries"
[ https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiaoJing wang updated SPARK-4131: - Description: Writing data into the filesystem from queries,SparkSql is not support . eg: insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views; was: Writing data into the filesystem from queries,SparkSql is not support . eg: insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views; out: java.lang.RuntimeException: Unsupported language features in query: insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME page_views TOK_INSERT TOK_DESTINATION TOK_LOCAL_DIR '/data1/wangxj/sql_spark' TOK_SELECT TOK_SELEXPR TOK_ALLCOLREF > Support "Writing data into the filesystem from queries" > --- > > Key: SPARK-4131 > URL: https://issues.apache.org/jira/browse/SPARK-4131 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.1.0 >Reporter: XiaoJing wang >Priority: Critical > Fix For: 1.3.0 > > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > Writing data into the filesystem from queries,SparkSql is not support . > eg: > insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select > * from page_views; > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4131) Support "Writing data into the filesystem from queries"
[ https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188158#comment-14188158 ] Apache Spark commented on SPARK-4131: - User 'wangxiaojing' has created a pull request for this issue: https://github.com/apache/spark/pull/2997 > Support "Writing data into the filesystem from queries" > --- > > Key: SPARK-4131 > URL: https://issues.apache.org/jira/browse/SPARK-4131 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.1.0 >Reporter: XiaoJing wang >Priority: Critical > Fix For: 1.3.0 > > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > Writing data into the filesystem from queries,SparkSql is not support . > eg: > insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * > from page_views; > out: > > java.lang.RuntimeException: > Unsupported language features in query: insert overwrite LOCAL DIRECTORY > '/data1/wangxj/sql_spark' select * from page_views > TOK_QUERY > TOK_FROM > TOK_TABREF > TOK_TABNAME > page_views > TOK_INSERT > TOK_DESTINATION > TOK_LOCAL_DIR > '/data1/wangxj/sql_spark' > TOK_SELECT > TOK_SELEXPR > TOK_ALLCOLREF > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4131) Support "Writing data into the filesystem from queries"
[ https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiaoJing wang updated SPARK-4131: - Description: Writing data into the filesystem from queries,SparkSql is not support . eg: insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views; out: java.lang.RuntimeException: Unsupported language features in query: insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME page_views TOK_INSERT TOK_DESTINATION TOK_LOCAL_DIR '/data1/wangxj/sql_spark' TOK_SELECT TOK_SELEXPR TOK_ALLCOLREF was: Writing data into the filesystem from queries,SparkSql is not support . eg: insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views; out: java.lang.RuntimeException: Unsupported language features in query: insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME page_views TOK_INSERT TOK_DESTINATION TOK_LOCAL_DIR '/data1/wangxj/sql_spark' TOK_SELECT TOK_SELEXPR TOK_ALLCOLREF > Support "Writing data into the filesystem from queries" > --- > > Key: SPARK-4131 > URL: https://issues.apache.org/jira/browse/SPARK-4131 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.1.0 >Reporter: XiaoJing wang >Priority: Critical > Fix For: 1.3.0 > > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > Writing data into the filesystem from queries,SparkSql is not support . > eg: > insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * > from page_views; > out: > > java.lang.RuntimeException: > Unsupported language features in query: insert overwrite LOCAL DIRECTORY > '/data1/wangxj/sql_spark' select * from page_views > TOK_QUERY > TOK_FROM > TOK_TABREF > TOK_TABNAME > page_views > TOK_INSERT > TOK_DESTINATION > TOK_LOCAL_DIR > '/data1/wangxj/sql_spark' > TOK_SELECT > TOK_SELEXPR > TOK_ALLCOLREF > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4132) Spark uses incompatible HDFS API
kuromatsu nobuyuki created SPARK-4132: - Summary: Spark uses incompatible HDFS API Key: SPARK-4132 URL: https://issues.apache.org/jira/browse/SPARK-4132 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: Spark1.1.0 on Hadoop1.2.1 CentOS 6.3 64bit Reporter: kuromatsu nobuyuki Priority: Minor When I enable event logging and set it to output to HDFS, initialization fails with 'java.lang.ClassNotFoundException' (see trace below). I found that an API incompatibility in org.apache.hadoop.fs.permission.FsPermission between Hadoop 1.0.4 and Hadoop 1.1.0 (and above) causes this error (org.apache.hadoop.fs.permission.FsPermission$2 is used in 1.0.4 but doesn't exist in my 1.2.1 environment). I think that the Spark jar file pre-built for Hadoop1.X should be built on Hadoop Stable version(Hadoop 1.2.1). 2014-10-24 10:43:22,893 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 9000: readAndProcess threw exception java.lang.RuntimeException: readObject can't find class org.apache.hadoop.fs.permission.FsPermission$2. Count of bytes read: 0 java.lang.RuntimeException: readObject can't find class org.apache.hadoop.fs.permission.FsPermission$2 at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:233) at org.apache.hadoop.ipc.RPC$Invocation.readFields(RPC.java:106) at org.apache.hadoop.ipc.Server$Connection.processData(Server.java:1347) at org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1326) at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1226) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:577) at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:384) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:701) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.permission.FsPermission$2 at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:323) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:268) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:231) ... 9 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4131) Support "Writing data into the filesystem from queries"
[ https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiaoJing wang updated SPARK-4131: - Description: Writing data into the filesystem from queries,SparkSql is not support . eg: insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views; out: java.lang.RuntimeException: Unsupported language features in query: insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME page_views TOK_INSERT TOK_DESTINATION TOK_LOCAL_DIR '/data1/wangxj/sql_spark' TOK_SELECT TOK_SELEXPR TOK_ALLCOLREF was: Writing data into the filesystem from queries,SparkSql is not support . eg: insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views; out: java.lang.RuntimeException: Unsupported language features in query: insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME page_views TOK_INSERT TOK_DESTINATION TOK_LOCAL_DIR '/data1/wangxj/sql_spark' TOK_SELECT TOK_SELEXPR TOK_ALLCOLREF > Support "Writing data into the filesystem from queries" > --- > > Key: SPARK-4131 > URL: https://issues.apache.org/jira/browse/SPARK-4131 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.1.0 >Reporter: XiaoJing wang >Priority: Critical > Fix For: 1.3.0 > > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > Writing data into the filesystem from queries,SparkSql is not support . > eg: > > insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * > from page_views; > out: > > > java.lang.RuntimeException: > Unsupported language features in query: insert overwrite LOCAL DIRECTORY > '/data1/wangxj/sql_spark' select * from page_views > TOK_QUERY > TOK_FROM > TOK_TABREF > TOK_TABNAME > page_views > TOK_INSERT > TOK_DESTINATION > TOK_LOCAL_DIR > '/data1/wangxj/sql_spark' > TOK_SELECT > TOK_SELEXPR > TOK_ALLCOLREF > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4131) Support "Writing data into the filesystem from queries"
[ https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiaoJing wang updated SPARK-4131: - Description: Writing data into the filesystem from queries,SparkSql is not support . eg: insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views; out: java.lang.RuntimeException: Unsupported language features in query: insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME page_views TOK_INSERT TOK_DESTINATION TOK_LOCAL_DIR '/data1/wangxj/sql_spark' TOK_SELECT TOK_SELEXPR TOK_ALLCOLREF was: Writing data into the filesystem from queries,SparkSql is not support . eg: insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views; out: java.lang.RuntimeException: Unsupported language features in query: insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME page_views TOK_INSERT TOK_DESTINATION TOK_LOCAL_DIR '/data1/wangxj/sql_spark' TOK_SELECT TOK_SELEXPR TOK_ALLCOLREF > Support "Writing data into the filesystem from queries" > --- > > Key: SPARK-4131 > URL: https://issues.apache.org/jira/browse/SPARK-4131 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.1.0 >Reporter: XiaoJing wang >Priority: Critical > Fix For: 1.3.0 > > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > Writing data into the filesystem from queries,SparkSql is not support . > eg: > > insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * > from page_views; > out: > > java.lang.RuntimeException: > Unsupported language features in query: insert overwrite LOCAL DIRECTORY > '/data1/wangxj/sql_spark' select * from page_views > TOK_QUERY > TOK_FROM > TOK_TABREF > TOK_TABNAME > page_views > TOK_INSERT > TOK_DESTINATION > TOK_LOCAL_DIR > '/data1/wangxj/sql_spark' > TOK_SELECT > TOK_SELEXPR > TOK_ALLCOLREF > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4131) Support "Writing data into the filesystem from queries"
[ https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188147#comment-14188147 ] Ravindra Pesala commented on SPARK-4131: I will work on this issue. > Support "Writing data into the filesystem from queries" > --- > > Key: SPARK-4131 > URL: https://issues.apache.org/jira/browse/SPARK-4131 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.1.0 >Reporter: XiaoJing wang >Priority: Critical > Fix For: 1.3.0 > > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > Writing data into the filesystem from queries,SparkSql is not support . > eg: > insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * > from page_views; > out: > > java.lang.RuntimeException: > Unsupported language features in query: insert overwrite LOCAL DIRECTORY > '/data1/wangxj/sql_spark' select * from page_views > TOK_QUERY > TOK_FROM > TOK_TABREF > TOK_TABNAME > page_views > TOK_INSERT > TOK_DESTINATION > TOK_LOCAL_DIR > '/data1/wangxj/sql_spark' > TOK_SELECT > TOK_SELEXPR > TOK_ALLCOLREF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4131) Support "Writing data into the filesystem from queries"
[ https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiaoJing wang updated SPARK-4131: - Summary: Support "Writing data into the filesystem from queries" (was: support “Writing data into the filesystem from queries”) > Support "Writing data into the filesystem from queries" > --- > > Key: SPARK-4131 > URL: https://issues.apache.org/jira/browse/SPARK-4131 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.1.0 >Reporter: XiaoJing wang >Priority: Critical > Fix For: 1.3.0 > > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > Writing data into the filesystem from queries,SparkSql is not support . > eg: > insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * > from page_views; > out: > > java.lang.RuntimeException: > Unsupported language features in query: insert overwrite LOCAL DIRECTORY > '/data1/wangxj/sql_spark' select * from page_views > TOK_QUERY > TOK_FROM > TOK_TABREF > TOK_TABNAME > page_views > TOK_INSERT > TOK_DESTINATION > TOK_LOCAL_DIR > '/data1/wangxj/sql_spark' > TOK_SELECT > TOK_SELEXPR > TOK_ALLCOLREF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4131) support “Writing data into the filesystem from queries”
[ https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiaoJing wang updated SPARK-4131: - Summary: support “Writing data into the filesystem from queries” (was: support “insert overwrite LOCAL DIRECTORY ‘dir’ select * from tablename;”) > support “Writing data into the filesystem from queries” > --- > > Key: SPARK-4131 > URL: https://issues.apache.org/jira/browse/SPARK-4131 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.1.0 >Reporter: XiaoJing wang >Priority: Critical > Fix For: 1.3.0 > > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > Writing data into the filesystem from queries,SparkSql is not support . > eg: > insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * > from page_views; > out: > > java.lang.RuntimeException: > Unsupported language features in query: insert overwrite LOCAL DIRECTORY > '/data1/wangxj/sql_spark' select * from page_views > TOK_QUERY > TOK_FROM > TOK_TABREF > TOK_TABNAME > page_views > TOK_INSERT > TOK_DESTINATION > TOK_LOCAL_DIR > '/data1/wangxj/sql_spark' > TOK_SELECT > TOK_SELEXPR > TOK_ALLCOLREF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4131) support “insert overwrite LOCAL DIRECTORY ‘dir’ select * from tablename;”
XiaoJing wang created SPARK-4131: Summary: support “insert overwrite LOCAL DIRECTORY ‘dir’ select * from tablename;” Key: SPARK-4131 URL: https://issues.apache.org/jira/browse/SPARK-4131 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.1.0 Reporter: XiaoJing wang Priority: Critical Fix For: 1.3.0 Writing data into the filesystem from queries,SparkSql is not support . eg: insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views; out: java.lang.RuntimeException: Unsupported language features in query: insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME page_views TOK_INSERT TOK_DESTINATION TOK_LOCAL_DIR '/data1/wangxj/sql_spark' TOK_SELECT TOK_SELEXPR TOK_ALLCOLREF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1442) Add Window function support
[ https://issues.apache.org/jira/browse/SPARK-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guowei updated SPARK-1442: -- Attachment: Window Function.pdf > Add Window function support > --- > > Key: SPARK-1442 > URL: https://issues.apache.org/jira/browse/SPARK-1442 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Chengxiang Li > Attachments: Window Function.pdf > > > similiar to Hive, add window function support for catalyst. > https://issues.apache.org/jira/browse/HIVE-4197 > https://issues.apache.org/jira/browse/HIVE-896 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1442) Add Window function support
[ https://issues.apache.org/jira/browse/SPARK-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guowei updated SPARK-1442: -- Attachment: (was: Window Function.pdf) > Add Window function support > --- > > Key: SPARK-1442 > URL: https://issues.apache.org/jira/browse/SPARK-1442 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Chengxiang Li > Attachments: Window Function.pdf > > > similiar to Hive, add window function support for catalyst. > https://issues.apache.org/jira/browse/HIVE-4197 > https://issues.apache.org/jira/browse/HIVE-896 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4130) loadLibSVMFile does not handle extra whitespace
[ https://issues.apache.org/jira/browse/SPARK-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188110#comment-14188110 ] Apache Spark commented on SPARK-4130: - User 'jegonzal' has created a pull request for this issue: https://github.com/apache/spark/pull/2996 > loadLibSVMFile does not handle extra whitespace > --- > > Key: SPARK-4130 > URL: https://issues.apache.org/jira/browse/SPARK-4130 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Joseph E. Gonzalez > > When testing MLlib on the splice site data > (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site) > the loadSVM. To reproduce in spark shell: > {code:scala} > import org.apache.spark.mllib.util.MLUtils > val data = MLUtils.loadLibSVMFile(sc, > "hdfs://ec2-54-200-69-227.us-west-2.compute.amazonaws.com:9000/splice_site.t") > {code} > generates the error: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task > 0.0:73 failed 4 times, most recent failure: Exception failure in TID 335 on > host ip-172-31-31-54.us-west-2.compute.internal: > java.lang.NumberFormatException: For input string: "" > > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > java.lang.Integer.parseInt(Integer.java:504) > java.lang.Integer.parseInt(Integer.java:527) > > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) > scala.collection.immutable.StringOps.toInt(StringOps.scala:31) > > org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:81) > > org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:79) > > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79) > org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:76) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > scala.collection.Iterator$class.foreach(Iterator.scala:727) > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107) > org.apache.spark.rdd.RDD.iterator(RDD.scala:227) > org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) > org.apache.spark.scheduler.Task.run(Task.scala:51) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None
[ https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188105#comment-14188105 ] Davies Liu commented on SPARK-3683: --- [~jamborta] It seems that this is a feature, not a bug. Does this work for you? > PySpark Hive query generates "NULL" instead of None > --- > > Key: SPARK-3683 > URL: https://issues.apache.org/jira/browse/SPARK-3683 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.1.0 >Reporter: Tamas Jambor >Assignee: Davies Liu > > When I run a Hive query in Spark SQL, I get the new Row object, where it does > not convert Hive NULL into Python None instead it keeps it string 'NULL'. > It's only an issue with String type, works with other types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4124) Simplify serialization and call API in MLlib Python
[ https://issues.apache.org/jira/browse/SPARK-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188104#comment-14188104 ] Apache Spark commented on SPARK-4124: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2995 > Simplify serialization and call API in MLlib Python > --- > > Key: SPARK-4124 > URL: https://issues.apache.org/jira/browse/SPARK-4124 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Reporter: Davies Liu >Assignee: Davies Liu > > There are much repeated code to similar things, convert RDD into Java > object, convert arguments into java, convert to result rdd/object into > python, they could be simplified to share the same code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4130) loadLibSVMFile does not handle extra whitespace
[ https://issues.apache.org/jira/browse/SPARK-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph E. Gonzalez updated SPARK-4130: -- Description: When testing MLlib on the splice site data (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site) the loadSVM. To reproduce in spark shell: {code:scala} import org.apache.spark.mllib.util.MLUtils val data = MLUtils.loadLibSVMFile(sc, "hdfs://ec2-54-200-69-227.us-west-2.compute.amazonaws.com:9000/splice_site.t") {code} generates the error: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:73 failed 4 times, most recent failure: Exception failure in TID 335 on host ip-172-31-31-54.us-west-2.compute.internal: java.lang.NumberFormatException: For input string: "" java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) java.lang.Integer.parseInt(Integer.java:504) java.lang.Integer.parseInt(Integer.java:527) scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) scala.collection.immutable.StringOps.toInt(StringOps.scala:31) org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:81) org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:79) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79) org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:76) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} was: When testing MLlib on the splice site data (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site) the loadSVM. To reproduce in spark shell: {code:scala} import org.apache.spark.mllib.util.MLUtils val data = MLUtils.loadLibSVMFile(sc, "hdfs://ec2-54-200-69-227.us-west-2.compute.amazonaws.com:9000/splice_site.t") {code} generates the error: {{ org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:73 failed 4 times, most recent failure: Exception failure in TID 335 on host ip-172-31-31-54.us-west-2.compute.internal: java.lang.NumberFormatException: For input string: "" java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) java.lang.Integer.parseInt(Integer.java:504) java.lang.Integer.parseInt(Integer.java:527) scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) scala.collection.immutable.StringOps.toInt(StringOps.scala:31) org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:81) org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:79) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79) org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:
[jira] [Updated] (SPARK-4130) loadLibSVMFile does not handle extra whitespace
[ https://issues.apache.org/jira/browse/SPARK-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph E. Gonzalez updated SPARK-4130: -- Description: When testing MLlib on the splice site data (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site) the loadSVM. To reproduce in spark shell: {code:scala} import org.apache.spark.mllib.util.MLUtils val data = MLUtils.loadLibSVMFile(sc, "hdfs://ec2-54-200-69-227.us-west-2.compute.amazonaws.com:9000/splice_site.t") {code} generates the error: {{ org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:73 failed 4 times, most recent failure: Exception failure in TID 335 on host ip-172-31-31-54.us-west-2.compute.internal: java.lang.NumberFormatException: For input string: "" java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) java.lang.Integer.parseInt(Integer.java:504) java.lang.Integer.parseInt(Integer.java:527) scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) scala.collection.immutable.StringOps.toInt(StringOps.scala:31) org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:81) org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:79) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79) org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:76) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) }} was: When testing MLlib on the splice site data (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site) the loadSVM. To reproduce in spark shell: {code:scala} import org.apache.spark.mllib.util.MLUtils val data = MLUtils.loadLibSVMFile(sc, "hdfs://ec2-54-200-69-227.us-west-2.compute.amazonaws.com:9000/splice_site.t") {code} generates the error: ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:73 failed 4 times, most recent failure: Exception failure in TID 335 on host ip-172-31-31-54.us-west-2.compute.internal: java.lang.NumberFormatException: For input string: "" java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) java.lang.Integer.parseInt(Integer.java:504) java.lang.Integer.parseInt(Integer.java:527) scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) scala.collection.immutable.StringOps.toInt(StringOps.scala:31) org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:81) org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:79) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79) org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:76)
[jira] [Updated] (SPARK-4130) loadLibSVMFile does not handle extra whitespace
[ https://issues.apache.org/jira/browse/SPARK-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph E. Gonzalez updated SPARK-4130: -- Description: When testing MLlib on the splice site data (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site) the loadSVM. To reproduce in spark shell: {code:scala} import org.apache.spark.mllib.util.MLUtils val data = MLUtils.loadLibSVMFile(sc, "hdfs://ec2-54-200-69-227.us-west-2.compute.amazonaws.com:9000/splice_site.t") {code} generates the error: ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:73 failed 4 times, most recent failure: Exception failure in TID 335 on host ip-172-31-31-54.us-west-2.compute.internal: java.lang.NumberFormatException: For input string: "" java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) java.lang.Integer.parseInt(Integer.java:504) java.lang.Integer.parseInt(Integer.java:527) scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) scala.collection.immutable.StringOps.toInt(StringOps.scala:31) org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:81) org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:79) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79) org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:76) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) ``` was: When testing MLlib on the splice site data (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site) the loadSVM. To reproduce in spark shell: ``` import org.apache.spark.mllib.util.MLUtils val data = MLUtils.loadLibSVMFile(sc, "hdfs://ec2-54-200-69-227.us-west-2.compute.amazonaws.com:9000/splice_site.t") ``` generates the error: ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:73 failed 4 times, most recent failure: Exception failure in TID 335 on host ip-172-31-31-54.us-west-2.compute.internal: java.lang.NumberFormatException: For input string: "" java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) java.lang.Integer.parseInt(Integer.java:504) java.lang.Integer.parseInt(Integer.java:527) scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) scala.collection.immutable.StringOps.toInt(StringOps.scala:31) org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:81) org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:79) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79) org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:76) scala.
[jira] [Updated] (SPARK-4130) loadLibSVMFile does not handle extra whitespace
[ https://issues.apache.org/jira/browse/SPARK-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph E. Gonzalez updated SPARK-4130: -- Description: When testing MLlib on the splice site data (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site) the loadSVM. To reproduce in spark shell: ``` import org.apache.spark.mllib.util.MLUtils val data = MLUtils.loadLibSVMFile(sc, "hdfs://ec2-54-200-69-227.us-west-2.compute.amazonaws.com:9000/splice_site.t") ``` generates the error: ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:73 failed 4 times, most recent failure: Exception failure in TID 335 on host ip-172-31-31-54.us-west-2.compute.internal: java.lang.NumberFormatException: For input string: "" java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) java.lang.Integer.parseInt(Integer.java:504) java.lang.Integer.parseInt(Integer.java:527) scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) scala.collection.immutable.StringOps.toInt(StringOps.scala:31) org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:81) org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:79) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79) org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:76) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) ``` was: When testing MLlib on the splice site data (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site) the loadSVM. To reproduce in spark shell: import org.apache.spark.mllib.util.MLUtils val data = MLUtils.loadLibSVMFile(sc, "hdfs://ec2-54-200-69-227.us-west-2.compute.amazonaws.com:9000/splice_site.t") generates the error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:73 failed 4 times, most recent failure: Exception failure in TID 335 on host ip-172-31-31-54.us-west-2.compute.internal: java.lang.NumberFormatException: For input string: "" java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) java.lang.Integer.parseInt(Integer.java:504) java.lang.Integer.parseInt(Integer.java:527) scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) scala.collection.immutable.StringOps.toInt(StringOps.scala:31) org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:81) org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:79) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79) org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:76) scala.collection.Iterator$$ano
[jira] [Created] (SPARK-4130) loadLibSVMFile does not handle extra whitespace
Joseph E. Gonzalez created SPARK-4130: - Summary: loadLibSVMFile does not handle extra whitespace Key: SPARK-4130 URL: https://issues.apache.org/jira/browse/SPARK-4130 Project: Spark Issue Type: Bug Components: MLlib Reporter: Joseph E. Gonzalez When testing MLlib on the splice site data (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site) the loadSVM. To reproduce in spark shell: import org.apache.spark.mllib.util.MLUtils val data = MLUtils.loadLibSVMFile(sc, "hdfs://ec2-54-200-69-227.us-west-2.compute.amazonaws.com:9000/splice_site.t") generates the error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:73 failed 4 times, most recent failure: Exception failure in TID 335 on host ip-172-31-31-54.us-west-2.compute.internal: java.lang.NumberFormatException: For input string: "" java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) java.lang.Integer.parseInt(Integer.java:504) java.lang.Integer.parseInt(Integer.java:527) scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) scala.collection.immutable.StringOps.toInt(StringOps.scala:31) org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:81) org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:79) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79) org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:76) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org