[jira] [Created] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching
Nitin Goyal created SPARK-4849: -- Summary: Pass partitioning information (distribute by) to In-memory caching Key: SPARK-4849 URL: https://issues.apache.org/jira/browse/SPARK-4849 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: Nitin Goyal Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching
[ https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nitin Goyal updated SPARK-4849: --- Description: HQL distribute by column_name partitions data based on specified column values. We can pass this information to in-memory caching for further performance improvements. e..g. in Joins, an extra partition step can be saved based on this information. Refer - http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html Pass partitioning information (distribute by) to In-memory caching -- Key: SPARK-4849 URL: https://issues.apache.org/jira/browse/SPARK-4849 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: Nitin Goyal Priority: Minor HQL distribute by column_name partitions data based on specified column values. We can pass this information to in-memory caching for further performance improvements. e..g. in Joins, an extra partition step can be saved based on this information. Refer - http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4850) GROUP BY can't work if the schema of SchemaRDD contains struct or array type
Chaozhong Yang created SPARK-4850: - Summary: GROUP BY can't work if the schema of SchemaRDD contains struct or array type Key: SPARK-4850 URL: https://issues.apache.org/jira/browse/SPARK-4850 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.1.2 Reporter: Chaozhong Yang In Spark Shell as follows: ``` val sqlContext = new org.apache.spark.sql.SQLContext(sc) val path = path/to/json sqlContext.jsonFile(path).register(Table) val t = sqlContext.sql(select * from Table group by a) t.collect ``` Let's look into the schema of `Table` root |-- a: integer (nullable = true) |-- arr: array (nullable = true) ||-- element: integer (containsNull = false) |-- createdAt: string (nullable = true) |-- f: struct (nullable = true) ||-- __type: string (nullable = true) ||-- className: string (nullable = true) ||-- objectId: string (nullable = true) |-- objectId: string (nullable = true) |-- s: string (nullable = true) |-- updatedAt: string (nullable = true) Exception will be throwed: ``` org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: arr#9, tree: Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14] Subquery TestImport LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], MappedRDD[18] at map at JsonRDD.scala:47 at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125) at scala.Option.foreach(Option.scala:236) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at $iwC$$iwC$$iwC$$iwC.init(console:17) at $iwC$$iwC$$iwC.init(console:22) at $iwC$$iwC.init(console:24) at $iwC.init(console:26) at init(console:28) at .init(console:32) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at
[jira] [Updated] (SPARK-4850) GROUP BY can't work if the schema of SchemaRDD contains struct or array type
[ https://issues.apache.org/jira/browse/SPARK-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chaozhong Yang updated SPARK-4850: -- Description: Code in Spark Shell as follows: ``` val sqlContext = new org.apache.spark.sql.SQLContext(sc) val path = path/to/json sqlContext.jsonFile(path).register(Table) val t = sqlContext.sql(select * from Table group by a) t.collect ``` Let's look into the schema of `Table` root |-- a: integer (nullable = true) |-- arr: array (nullable = true) ||-- element: integer (containsNull = false) |-- createdAt: string (nullable = true) |-- f: struct (nullable = true) ||-- __type: string (nullable = true) ||-- className: string (nullable = true) ||-- objectId: string (nullable = true) |-- objectId: string (nullable = true) |-- s: string (nullable = true) |-- updatedAt: string (nullable = true) Exception will be throwed: ``` org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: arr#9, tree: Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14] Subquery TestImport LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], MappedRDD[18] at map at JsonRDD.scala:47 at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125) at scala.Option.foreach(Option.scala:236) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at $iwC$$iwC$$iwC$$iwC.init(console:17) at $iwC$$iwC$$iwC.init(console:22) at $iwC$$iwC.init(console:24) at $iwC.init(console:26) at init(console:28) at .init(console:32) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852) at
[jira] [Updated] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhang, Liye updated SPARK-4740: --- Attachment: repartition test.7z Hi [~rxin], [~adav], I made several tests on HDDs and ramdisk with *repartition(192)* to test shuffle performance for NIO and Netty, with the same dataset as before (400GB). I uploaded archived file *repatition test.7z*, in which there are 6 tests' results: 1, NIO on ramdisk 2, NIO on HDDs 3, Netty on ramdisk with connectionPerPeer set to 1 4, Netty on ramdisk with connectionPerPeer set to 8 5, Netty on HDDs with connectionPerPeer set to 1 6, Netty on HDDs with connectionPerPeer set to 8 P.S. in the attached htmls, unit of IO throughput is requests instead of byte. From the 6 tests, it's very obvious that the reduce performance increases a lot by setting *connectionPerPeer* from 1 to 8. Both with Ramdisk and HDDs. For HDDs, the reduce time of Netty with *connectionPerPeer=8* is about the same with NIO (about 6.7 mins). For Ramdisk, Netty outperforms NIO even with *connectionPerPeer=1*. That is because the memory bandwidth has reaches bound for NIO, it's memory bandwidth bound, which I have confirmed with other tools. That's why the CPU utilization of NIO in reduce phase is only about 50%. While Netty still can get some performance gain by increasing *connectionPerPeer*'s value. This is execpected because NIO need some extra memory copy than Netty. Before these 6 tests, I have monitored the IO with *iostat* for HDDs case. When keeping *connectionPerPeer* as default (=1), Netty's read requests queue size, read requests, await, %util are all smaller than NIO, which means Netty's read parallelism is not well profiled. Till now, we can confirm that Netty doesn't get good read concurrency for small size cluster with many disks (if not set *connetionPerPeer*), but still we can not make a conclusion that Netty can run faster than NIO on HDDs. Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey Key: SPARK-4740 URL: https://issues.apache.org/jira/browse/SPARK-4740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Zhang, Liye Assignee: Reynold Xin Attachments: (rxin patch better executor)TestRunner sort-by-key - Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, TestRunner sort-by-key - Thread dump for executor 1_files (Nio-48 cores per node).zip, repartition test.7z, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO. We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4845) Adding a parallelismRatio to control the partitions num of shuffledRDD
[ https://issues.apache.org/jira/browse/SPARK-4845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246655#comment-14246655 ] Sean Owen commented on SPARK-4845: -- I'm interested in the motivation for this. Mapping down to fewer partitions will increase the amount of shuffling, right? when is this preferable over simply repartitioning directly? Adding a parallelismRatio to control the partitions num of shuffledRDD -- Key: SPARK-4845 URL: https://issues.apache.org/jira/browse/SPARK-4845 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.3.0 Adding parallelismRatio to control the partitions num of shuffledRDD, the rule is: Math.max(1, parallelismRatio * number of partitions of the largest upstream RDD) The ratio is 1.0 by default to make it compatible with the old version. When we have a good experience on it, we can change this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4844) SGD should support custom sampling.
[ https://issues.apache.org/jira/browse/SPARK-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246657#comment-14246657 ] Sean Owen commented on SPARK-4844: -- Hm, in what case would you want to not sample the minibatch uniformly at random in SGD? SGD should support custom sampling. --- Key: SPARK-4844 URL: https://issues.apache.org/jira/browse/SPARK-4844 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Guoqiang Li Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit
[ https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246665#comment-14246665 ] Sean Owen commented on SPARK-4846: -- I think you're just running out of memory on your driver. It fails to have enough memory to copy and serialize two data structures, syn0Global and syn1Global which contain (vocab size * vector length) floats. With a default vector length of 100, and 10M vocab, that's at least 8GB of RAM, and the default for the driver isn't nearly that big. I think this is just a matter of increasing your driver memory. I imagine you will need 16GB+ When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit --- Key: SPARK-4846 URL: https://issues.apache.org/jira/browse/SPARK-4846 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one partition. The corpus contains about 300 million words and its vocabulary size is about 10 million. Reporter: Joseph Tang Priority: Critical Exception in thread Driver java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) at org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit
[ https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246732#comment-14246732 ] Sean Owen commented on SPARK-4846: -- But being lazy doesn't really change whether it is serialized, right? one way or the other the recipients of the higher-order function have to get the same data. The function does use the data structures; it's not a question of simply keeping something out of the closure that shouldn't be there. Is the problem that only part of this large data structure should go to each partition? When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit --- Key: SPARK-4846 URL: https://issues.apache.org/jira/browse/SPARK-4846 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one partition. The corpus contains about 300 million words and its vocabulary size is about 10 million. Reporter: Joseph Tang Priority: Critical Exception in thread Driver java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) at org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4844) SGD should support custom sampling.
[ https://issues.apache.org/jira/browse/SPARK-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246738#comment-14246738 ] Guoqiang Li commented on SPARK-4844: The main reason is that {{RDD.sample}} is not efficient. {{RDD.sample}} loads all data into memory. See https://github.com/witgo/spark/compare/SPARK-4844 SGD should support custom sampling. --- Key: SPARK-4844 URL: https://issues.apache.org/jira/browse/SPARK-4844 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Guoqiang Li Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4844) SGD should support custom sampling.
[ https://issues.apache.org/jira/browse/SPARK-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246749#comment-14246749 ] Sean Owen commented on SPARK-4844: -- No, it definitely does not. See {{PartitionwiseSampledRDD}}, and how it uses {{BernoulliSampler}} and {{PoissonSampler}}, which are already pluggable if you want. They use gap-sampling iterators. If that's the only change, I would close this. The PR reinvents some of the classes above. SGD should support custom sampling. --- Key: SPARK-4844 URL: https://issues.apache.org/jira/browse/SPARK-4844 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Guoqiang Li Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4547) OOM when making bins in BinaryClassificationMetrics
[ https://issues.apache.org/jira/browse/SPARK-4547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246794#comment-14246794 ] Apache Spark commented on SPARK-4547: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/3702 OOM when making bins in BinaryClassificationMetrics --- Key: SPARK-4547 URL: https://issues.apache.org/jira/browse/SPARK-4547 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Also following up on http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3CCAMAsSdK4s4TNkf3_ecLC6yD-pLpys_PpT3WB7Tp6=yoxuxf...@mail.gmail.com%3E -- this one I intend to make a PR for a bit later. The conversation was basically: {quote} Recently I was using BinaryClassificationMetrics to build an AUC curve for a classifier over a reasonably large number of points (~12M). The scores were all probabilities, so tended to be almost entirely unique. The computation does some operations by key, and this ran out of memory. It's something you can solve with more than the default amount of memory, but in this case, it seemed unuseful to create an AUC curve with such fine-grained resolution. I ended up just binning the scores so there were ~1000 unique values and then it was fine. {quote} and: {quote} Yes, if there are many distinct values, we need binning to compute the AUC curve. Usually, the scores are not evenly distribution, we cannot simply truncate the digits. Estimating the quantiles for binning is necessary, similar to RangePartitioner: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L104 Limiting the number of bins is definitely useful. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4852) Hive query plan deserialization failure caused by shaded hive-exec jar file when generating golden answers
[ https://issues.apache.org/jira/browse/SPARK-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-4852: -- Priority: Minor (was: Major) Hive query plan deserialization failure caused by shaded hive-exec jar file when generating golden answers -- Key: SPARK-4852 URL: https://issues.apache.org/jira/browse/SPARK-4852 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Cheng Lian Priority: Minor When adding Hive 0.13.1 support for Spark SQL Thrift server in PR [2685|https://github.com/apache/spark/pull/2685], Kryo 2.22 used by original hive-exec-0.13.1.jar was shaded by Kryo 2.21 used by Spark SQL because of dependency hell. Unfortunately, Kryo 2.21 has a known bug that may cause Hive query plan deserialization failure. This bug was fixed in Kryo 2.22. Normally, this issue doesn't affect Spark SQL because we don't even generate Hive query plan. But when running Hive test suites like {{HiveCompatibilitySuite}}, golden answer files must be generated by Hive, and thus triggers this issue. A workaround is to replace {{hive-exec-0.13.1.jar}} under {{$HIVE_HOME/lib}} with Spark's {{hive-exec-0.13.1a.jar}} and {{kryo-2.21.jar}} under {{$SPARK_DEV_HOME/lib_managed/jars}}. Then add {{$HIVE_HOME/lib}} to {{$HADOOP_CLASSPATH}}. Upgrading to some newer version of Kryo which is binary compatible with Kryo 2.22 (if there is one) may fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4852) Hive query plan deserialization failure caused by shaded hive-exec jar file when generating golden answers
[ https://issues.apache.org/jira/browse/SPARK-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246799#comment-14246799 ] Cheng Lian commented on SPARK-4852: --- Lowered priority to Minor since this issue only affects Spark SQL developers. Hive query plan deserialization failure caused by shaded hive-exec jar file when generating golden answers -- Key: SPARK-4852 URL: https://issues.apache.org/jira/browse/SPARK-4852 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Cheng Lian Priority: Minor When adding Hive 0.13.1 support for Spark SQL Thrift server in PR [2685|https://github.com/apache/spark/pull/2685], Kryo 2.22 used by original hive-exec-0.13.1.jar was shaded by Kryo 2.21 used by Spark SQL because of dependency hell. Unfortunately, Kryo 2.21 has a known bug that may cause Hive query plan deserialization failure. This bug was fixed in Kryo 2.22. Normally, this issue doesn't affect Spark SQL because we don't even generate Hive query plan. But when running Hive test suites like {{HiveCompatibilitySuite}}, golden answer files must be generated by Hive, and thus triggers this issue. A workaround is to replace {{hive-exec-0.13.1.jar}} under {{$HIVE_HOME/lib}} with Spark's {{hive-exec-0.13.1a.jar}} and {{kryo-2.21.jar}} under {{$SPARK_DEV_HOME/lib_managed/jars}}. Then add {{$HIVE_HOME/lib}} to {{$HADOOP_CLASSPATH}}. Upgrading to some newer version of Kryo which is binary compatible with Kryo 2.22 (if there is one) may fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688
[ https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246831#comment-14246831 ] Jing Dong commented on SPARK-3619: -- [~tnachen] Will this be released with Spark 1.2.0? I also noticed the documentation on Spark saying Mesos compatibility is 0.18.1. Is this up-to-date? Upgrade to Mesos 0.21 to work around MESOS-1688 --- Key: SPARK-3619 URL: https://issues.apache.org/jira/browse/SPARK-3619 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Matei Zaharia When Mesos 0.21 comes out, it will have a fix for https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1442) Add Window function support
[ https://issues.apache.org/jira/browse/SPARK-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246910#comment-14246910 ] Apache Spark commented on SPARK-1442: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/3703 Add Window function support --- Key: SPARK-1442 URL: https://issues.apache.org/jira/browse/SPARK-1442 Project: Spark Issue Type: New Feature Components: SQL Reporter: Chengxiang Li Attachments: Window Function.pdf similiar to Hive, add window function support for catalyst. https://issues.apache.org/jira/browse/HIVE-4197 https://issues.apache.org/jira/browse/HIVE-896 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-4740. Resolution: Fixed Fix Version/s: 1.2.0 Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey Key: SPARK-4740 URL: https://issues.apache.org/jira/browse/SPARK-4740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Zhang, Liye Assignee: Reynold Xin Fix For: 1.2.0 Attachments: (rxin patch better executor)TestRunner sort-by-key - Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, TestRunner sort-by-key - Thread dump for executor 1_files (Nio-48 cores per node).zip, repartition test.7z, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO. We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000. --- Reynold update on Dec 15, 2014: The problem is that in NIO we have multiple connections between two nodes, but in Netty we only had one. We introduced a new config option spark.shuffle.io.numConnectionsPerPeer to allow users to explicitly increase the number of connections between two nodes. SPARK-4853 is a follow-up ticket to investigate setting this automatically by Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4837) NettyBlockTransferService does not abide by spark.blockManager.port config option
[ https://issues.apache.org/jira/browse/SPARK-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246996#comment-14246996 ] Patrick Wendell commented on SPARK-4837: Hey [~aash] because there is a work around (you can simply switch back to the old IO mode) we probably won't block on it, but I can include it in the release notes as a known issue. We can also spin a bug-fix release to address this in a week or two. It is indeed an annoying issue and will be bad for usability if someone upgrades. NettyBlockTransferService does not abide by spark.blockManager.port config option - Key: SPARK-4837 URL: https://issues.apache.org/jira/browse/SPARK-4837 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Blocker The NettyBlockTransferService always binds to a random port, and does not use the spark.blockManager.port config as specified. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4837) NettyBlockTransferService does not abide by spark.blockManager.port config option
[ https://issues.apache.org/jira/browse/SPARK-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4837: --- Target Version/s: 1.2.1 NettyBlockTransferService does not abide by spark.blockManager.port config option - Key: SPARK-4837 URL: https://issues.apache.org/jira/browse/SPARK-4837 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Blocker The NettyBlockTransferService always binds to a random port, and does not use the spark.blockManager.port config as specified. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4837) NettyBlockTransferService does not abide by spark.blockManager.port config option
[ https://issues.apache.org/jira/browse/SPARK-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247030#comment-14247030 ] Andrew Ash commented on SPARK-4837: --- Ok that's fair -- a release note and targeting 1.2.1 sounds good. Draft language for that release note could be: - Spark 1.2.0 changes the default block transfer service to NettyBlockTransferService, a higher performance block transfer service than the old XYZBlockTransferService. The new transfer service does not yet respect `spark.blockManager.port` so deployments needing full control of Spark's network ports in 1.2.0 should temporarily set `spark.abc=xyz` and watch SPARK-4837 NettyBlockTransferService does not abide by spark.blockManager.port config option - Key: SPARK-4837 URL: https://issues.apache.org/jira/browse/SPARK-4837 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Blocker The NettyBlockTransferService always binds to a random port, and does not use the spark.blockManager.port config as specified. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: java.lang.IllegalStateException: File exists and there is no append support!
[ https://issues.apache.org/jira/browse/SPARK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247043#comment-14247043 ] Patrick Wendell commented on SPARK-4826: I pushed a hotfix disabling these tests, but let's re-enable them once things are working. Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: java.lang.IllegalStateException: File exists and there is no append support! Key: SPARK-4826 URL: https://issues.apache.org/jira/browse/SPARK-4826 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0, 1.3.0 Reporter: Josh Rosen Assignee: Tathagata Das Labels: flaky-test I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite where four tests failed with the same exception. [Link to test result (this will eventually break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/]. In case that link breaks: The failed tests: {code} org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data available only in block manager, not in write ahead log org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data available only in write ahead log, not in block manager org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data available only in write ahead log, and test storing in block manager org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data with partially available in block manager, and rest in write ahead log {code} The error messages are all (essentially) the same: {code} java.lang.IllegalStateException: File exists and there is no append support! at org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33) at org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34) at org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34) at org.apache.spark.streaming.util.WriteAheadLogWriter.init(WriteAheadLogWriter.scala:42) at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140) at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95) at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67) at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67) at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at
[jira] [Commented] (SPARK-4810) Failed to run collect
[ https://issues.apache.org/jira/browse/SPARK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247081#comment-14247081 ] Patrick Wendell commented on SPARK-4810: Actually can I suggest we move this to the spark users list? This JIRA we use primarily for tracking of identified bugs. For information how to join the user list see this page: http://spark.apache.org/community.html Failed to run collect - Key: SPARK-4810 URL: https://issues.apache.org/jira/browse/SPARK-4810 Project: Spark Issue Type: Question Environment: Spark 1.1.1 prebuilt for hadoop 2.4.0 Reporter: newjunwei my application failed like below.i want to know the possible reason.Not enough memory may cause this? Evironment: Spark 1.1.1 prebuilt for hadoop 2.4.0, standalone deploying mode. But no problem when running using local master for test or running to process another smaller size data. I am sure my real data to process is large which is about 200 million key-value data.The smaller size data is about one tenth of the real. I got my result by collect, and the result will be very large size too. Now, i consider this problem is caused of so many failed task when to collect a large result. Is it the truth? 2014-12-09 21:51:47,830 WARN org.apache.spark.Logging$class.logWarning(Logging.scala:71) - Lost task 60.1 in stage 1.1 (TID 566, server-21): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_4_piece0 of broadcast_4 org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:930) org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:155) sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:597) java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:969) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1871) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1775) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1969) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1775) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327) java.io.ObjectInputStream.readObject(ObjectInputStream.java:349) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:160) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) java.lang.Thread.run(Thread.java:662) 2014-12-09 21:51:49,460 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59) - Starting task 60.2 in stage 1.1 (TID 603, server-11, PROCESS_LOCAL, 1295 bytes) 2014-12-09 21:51:49,461 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59) - Lost task 9.3 in stage 1.1 (TID 579) on executor server-11: java.io.IOException (org.apache.spark.SparkException: Failed to get broadcast_4_piece0 of broadcast_4) [duplicate 1] 2014-12-09 21:51:49,487 ERROR org.apache.spark.Logging$class.logError(Logging.scala:75) - Task 9 in stage 1.1 failed 4 times; aborting job 2014-12-09 21:51:49,494 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59) - Cancelling stage 1 2014-12-09 21:51:49,498 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59) - Stage 1 was cancelled 2014-12-09 21:51:49,511 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59) - Failed to run collect at StatVideoService.scala:62 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: java.lang.IllegalStateException: File exists and there is no append support!
[ https://issues.apache.org/jira/browse/SPARK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247082#comment-14247082 ] Apache Spark commented on SPARK-4826: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/3704 Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: java.lang.IllegalStateException: File exists and there is no append support! Key: SPARK-4826 URL: https://issues.apache.org/jira/browse/SPARK-4826 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0, 1.3.0 Reporter: Josh Rosen Assignee: Tathagata Das Labels: flaky-test I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite where four tests failed with the same exception. [Link to test result (this will eventually break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/]. In case that link breaks: The failed tests: {code} org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data available only in block manager, not in write ahead log org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data available only in write ahead log, not in block manager org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data available only in write ahead log, and test storing in block manager org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data with partially available in block manager, and rest in write ahead log {code} The error messages are all (essentially) the same: {code} java.lang.IllegalStateException: File exists and there is no append support! at org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33) at org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34) at org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34) at org.apache.spark.streaming.util.WriteAheadLogWriter.init(WriteAheadLogWriter.scala:42) at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140) at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95) at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67) at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67) at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at
[jira] [Updated] (SPARK-4841) Batch serializer bug in PySpark's RDD.zip
[ https://issues.apache.org/jira/browse/SPARK-4841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4841: - Assignee: Davies Liu Batch serializer bug in PySpark's RDD.zip - Key: SPARK-4841 URL: https://issues.apache.org/jira/browse/SPARK-4841 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Xiangrui Meng Assignee: Davies Liu {code} t = sc.textFile(README.md) t.zip(t).count() {code} {code} Py4JJavaError Traceback (most recent call last) ipython-input-6-60fdeb8339fd in module() 1 readme.zip(readme).count() /Users/meng/src/spark/python/pyspark/rdd.pyc in count(self) 817 3 818 -- 819 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() 820 821 def stats(self): /Users/meng/src/spark/python/pyspark/rdd.pyc in sum(self) 808 6.0 809 -- 810 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) 811 812 def count(self): /Users/meng/src/spark/python/pyspark/rdd.pyc in reduce(self, f) 713 yield reduce(f, iterator, initial) 714 -- 715 vals = self.mapPartitions(func).collect() 716 if vals: 717 return reduce(f, vals) /Users/meng/src/spark/python/pyspark/rdd.pyc in collect(self) 674 675 with SCCallSiteSync(self.context) as css: -- 676 bytesInJava = self._jrdd.collect().iterator() 677 return list(self._collect_iterator_through_file(bytesInJava)) 678 /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o69.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 2, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /Users/meng/src/spark/python/pyspark/worker.py, line 107, in main process() File /Users/meng/src/spark/python/pyspark/worker.py, line 98, in process serializer.dump_stream(func(split_index, iterator), outfile) File /Users/meng/src/spark/python/pyspark/serializers.py, line 198, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /Users/meng/src/spark/python/pyspark/serializers.py, line 81, in dump_stream raise NotImplementedError NotImplementedError at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137) at org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:174) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at
[jira] [Commented] (SPARK-2121) Not fully cached when there is enough memory in ALS
[ https://issues.apache.org/jira/browse/SPARK-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247148#comment-14247148 ] sam commented on SPARK-2121: I seem to be getting this problem too, I have a job that should be caching a reasonably small data set (using _AND_DISK just in case, but it really is a small data set). Unfortunately my executors get killed and my job re-runs around half of the job to recover the lost data. Now it's a very expensive job, although it's a small data set it iterates over a large broadcasted variable many many times, so I really don't want it to get recomputed (difference between a 6 hour job and a 9 hour job). Is there currently any work around? Like just increasing some of the many many configurables? Not fully cached when there is enough memory in ALS --- Key: SPARK-2121 URL: https://issues.apache.org/jira/browse/SPARK-2121 Project: Spark Issue Type: Bug Components: Block Manager, MLlib, Spark Core Affects Versions: 1.0.0 Reporter: Shuo Xiang While factorizing a large matrix using the latest Alternating Least Squares (ALS) in mllib, from sparkUI it looks like that spark fail to cache all the partitions of some RDD while memory is sufficient. Please find [this post](http://apache-spark-user-list.1001560.n3.nabble.com/Not-fully-cached-when-there-is-enough-memory-tt7429.html) for screenshots. This may cause subsequent job failures while executing `userOut.Count()` or `productsOut.count`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2121) Not fully cached when there is enough memory in ALS
[ https://issues.apache.org/jira/browse/SPARK-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247167#comment-14247167 ] sam commented on SPARK-2121: I've tried increasing spark.yarn.executor.memoryOverhead but I get: 14/12/15 20:11:20 WARN cluster.YarnClientClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory I've checked the UI and it doesn't really help me determine whether I have sufficient memory. I'm thinking the only work around is to write out the results to disk since the caching functionality is not behaving. Not fully cached when there is enough memory in ALS --- Key: SPARK-2121 URL: https://issues.apache.org/jira/browse/SPARK-2121 Project: Spark Issue Type: Bug Components: Block Manager, MLlib, Spark Core Affects Versions: 1.0.0 Reporter: Shuo Xiang While factorizing a large matrix using the latest Alternating Least Squares (ALS) in mllib, from sparkUI it looks like that spark fail to cache all the partitions of some RDD while memory is sufficient. Please find [this post](http://apache-spark-user-list.1001560.n3.nabble.com/Not-fully-cached-when-there-is-enough-memory-tt7429.html) for screenshots. This may cause subsequent job failures while executing `userOut.Count()` or `productsOut.count`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching
[ https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247184#comment-14247184 ] Michael Armbrust commented on SPARK-4849: - The trick here will be to make sure that the outputPartitioning is correctly output from the InMemoryColumnarTableScan. Pass partitioning information (distribute by) to In-memory caching -- Key: SPARK-4849 URL: https://issues.apache.org/jira/browse/SPARK-4849 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: Nitin Goyal Priority: Minor HQL distribute by column_name partitions data based on specified column values. We can pass this information to in-memory caching for further performance improvements. e..g. in Joins, an extra partition step can be saved based on this information. Refer - http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4605) Proposed Contribution: Spark Kernel to enable interactive Spark applications
[ https://issues.apache.org/jira/browse/SPARK-4605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247235#comment-14247235 ] Chip Senkbeil commented on SPARK-4605: -- [~rdhyee], the short answer is no. The Spark Kernel is a pure Scala kernel that can be connected to from IPython and supports Scala. PySpark is a way to connect to the Spark cluster using a Python environment. Proposed Contribution: Spark Kernel to enable interactive Spark applications Key: SPARK-4605 URL: https://issues.apache.org/jira/browse/SPARK-4605 Project: Spark Issue Type: New Feature Reporter: Chip Senkbeil Attachments: Kernel Architecture Widescreen.pdf, Kernel Architecture.pdf Project available on Github: https://github.com/ibm-et/spark-kernel This architecture is describing running kernel code that was demonstrated at the StrataConf in Barcelona, Spain. Enables applications to interact with a Spark cluster using Scala in several ways: * Defining and running core Spark Tasks * Collecting results from a cluster without needing to write to external data store ** Ability to stream results using well-defined protocol * Arbitrary Scala code definition and execution (without submitting heavy-weight jars) Applications can be hosted and managed separate from the Spark cluster using the kernel as a proxy to communicate requests. The Spark Kernel implements the server side of the IPython Kernel protocol, the rising “de-facto” protocol for language (Python, Haskell, etc.) execution. Inherits a suite of industry adopted clients such as the IPython Notebook. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4510) Add k-medoids Partitioning Around Medoids (PAM) algorithm
[ https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247269#comment-14247269 ] Xiangrui Meng commented on SPARK-4510: -- The N^2 factor was what I was worried about. MLlib is supposed to deal with distributed datasets, which usually means very large N. Given the N^2 factor, the k-medroids implementation won't scale to even one million examples. Add k-medoids Partitioning Around Medoids (PAM) algorithm - Key: SPARK-4510 URL: https://issues.apache.org/jira/browse/SPARK-4510 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Assignee: Fan Jiang Labels: features Original Estimate: 0h Remaining Estimate: 0h PAM (k-medoids) is more robust to noise and outliers as compared to k-means because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances. A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal i.e. it is a most centrally located point in the cluster. The most common realisation of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm and is as follows: Initialize: randomly select (without replacement) k of the n data points as the medoids Associate each data point to the closest medoid. (closest here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance) For each medoid m For each non-medoid data point o Swap m and o and compute the total cost of the configuration Select the configuration with the lowest cost. Repeat steps 2 to 4 until there is no change in the medoid. The new feature for MLlib will contain 5 new files /main/scala/org/apache/spark/mllib/clustering/PAM.scala /main/scala/org/apache/spark/mllib/clustering/PAMModel.scala /main/scala/org/apache/spark/mllib/clustering/LocalPAM.scala /test/scala/org/apache/spark/mllib/clustering/PAMSuite.scala /main/scala/org/apache/spark/examples/mllib/KMedoids.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4494) IDFModel.transform() add support for single vector
[ https://issues.apache.org/jira/browse/SPARK-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4494. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3603 [https://github.com/apache/spark/pull/3603] IDFModel.transform() add support for single vector -- Key: SPARK-4494 URL: https://issues.apache.org/jira/browse/SPARK-4494 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.1, 1.2.0 Reporter: Jean-Philippe Quemener Priority: Minor Fix For: 1.3.0 For now when using the tfidf implementation of mllib you have no other possibility to map your data back onto i.e. labels or ids than use a hackish way with ziping: {quote} 1. Persist input RDD. 2. Transform it to just vectors and apply IDFModel 3. zip with original RDD 4. transform label and new vector to LabeledPoint{quote} Source:[http://stackoverflow.com/questions/26897908/spark-mllib-tfidf-implementation-for-logisticregression] I think as in production alot of users want to map their data back to some identifier, it would be a good imporvement to allow using a single vector on IDFModel.transform() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4494) IDFModel.transform() add support for single vector
[ https://issues.apache.org/jira/browse/SPARK-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4494: - Assignee: Yu Ishikawa IDFModel.transform() add support for single vector -- Key: SPARK-4494 URL: https://issues.apache.org/jira/browse/SPARK-4494 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.1, 1.2.0 Reporter: Jean-Philippe Quemener Assignee: Yu Ishikawa Priority: Minor Fix For: 1.3.0 For now when using the tfidf implementation of mllib you have no other possibility to map your data back onto i.e. labels or ids than use a hackish way with ziping: {quote} 1. Persist input RDD. 2. Transform it to just vectors and apply IDFModel 3. zip with original RDD 4. transform label and new vector to LabeledPoint{quote} Source:[http://stackoverflow.com/questions/26897908/spark-mllib-tfidf-implementation-for-logisticregression] I think as in production alot of users want to map their data back to some identifier, it would be a good imporvement to allow using a single vector on IDFModel.transform() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4841) Batch serializer bug in PySpark's RDD.zip
[ https://issues.apache.org/jira/browse/SPARK-4841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247376#comment-14247376 ] Apache Spark commented on SPARK-4841: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3706 Batch serializer bug in PySpark's RDD.zip - Key: SPARK-4841 URL: https://issues.apache.org/jira/browse/SPARK-4841 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Xiangrui Meng Assignee: Davies Liu {code} t = sc.textFile(README.md) t.zip(t).count() {code} {code} Py4JJavaError Traceback (most recent call last) ipython-input-6-60fdeb8339fd in module() 1 readme.zip(readme).count() /Users/meng/src/spark/python/pyspark/rdd.pyc in count(self) 817 3 818 -- 819 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() 820 821 def stats(self): /Users/meng/src/spark/python/pyspark/rdd.pyc in sum(self) 808 6.0 809 -- 810 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) 811 812 def count(self): /Users/meng/src/spark/python/pyspark/rdd.pyc in reduce(self, f) 713 yield reduce(f, iterator, initial) 714 -- 715 vals = self.mapPartitions(func).collect() 716 if vals: 717 return reduce(f, vals) /Users/meng/src/spark/python/pyspark/rdd.pyc in collect(self) 674 675 with SCCallSiteSync(self.context) as css: -- 676 bytesInJava = self._jrdd.collect().iterator() 677 return list(self._collect_iterator_through_file(bytesInJava)) 678 /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o69.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 2, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /Users/meng/src/spark/python/pyspark/worker.py, line 107, in main process() File /Users/meng/src/spark/python/pyspark/worker.py, line 98, in process serializer.dump_stream(func(split_index, iterator), outfile) File /Users/meng/src/spark/python/pyspark/serializers.py, line 198, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /Users/meng/src/spark/python/pyspark/serializers.py, line 81, in dump_stream raise NotImplementedError NotImplementedError at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137) at org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:174) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
[jira] [Updated] (SPARK-1037) the name of findTaskFromList findTask in TaskSetManager.scala is confusing
[ https://issues.apache.org/jira/browse/SPARK-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-1037: -- Description: the name of these two functions is confusing though in the comments the author said that the method does dequeue tasks from the list but from the name, it is not explicitly indicating that the method will mutate the parameter in {code} private def findTaskFromList(list: ArrayBuffer[Int]): Option[Int] = { while (!list.isEmpty) { val index = list.last list.trimEnd(1) if (copiesRunning(index) == 0 !successful(index)) { return Some(index) } } None } {code} was: the name of these two functions is confusing though in the comments the author said that the method does dequeue tasks from the list but from the name, it is not explicitly indicating that the method will mutate the parameter in private def findTaskFromList(list: ArrayBuffer[Int]): Option[Int] = { while (!list.isEmpty) { val index = list.last list.trimEnd(1) if (copiesRunning(index) == 0 !successful(index)) { return Some(index) } } None } the name of findTaskFromList findTask in TaskSetManager.scala is confusing Key: SPARK-1037 URL: https://issues.apache.org/jira/browse/SPARK-1037 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.9.1, 1.0.0 Reporter: Nan Zhu Priority: Minor Labels: starter the name of these two functions is confusing though in the comments the author said that the method does dequeue tasks from the list but from the name, it is not explicitly indicating that the method will mutate the parameter in {code} private def findTaskFromList(list: ArrayBuffer[Int]): Option[Int] = { while (!list.isEmpty) { val index = list.last list.trimEnd(1) if (copiesRunning(index) == 0 !successful(index)) { return Some(index) } } None } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1037) the name of findTaskFromList findTask in TaskSetManager.scala is confusing
[ https://issues.apache.org/jira/browse/SPARK-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-1037. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3665 [https://github.com/apache/spark/pull/3665] the name of findTaskFromList findTask in TaskSetManager.scala is confusing Key: SPARK-1037 URL: https://issues.apache.org/jira/browse/SPARK-1037 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.9.1, 1.0.0 Reporter: Nan Zhu Priority: Minor Labels: starter Fix For: 1.3.0 the name of these two functions is confusing though in the comments the author said that the method does dequeue tasks from the list but from the name, it is not explicitly indicating that the method will mutate the parameter in {code} private def findTaskFromList(list: ArrayBuffer[Int]): Option[Int] = { while (!list.isEmpty) { val index = list.last list.trimEnd(1) if (copiesRunning(index) == 0 !successful(index)) { return Some(index) } } None } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1037) the name of findTaskFromList findTask in TaskSetManager.scala is confusing
[ https://issues.apache.org/jira/browse/SPARK-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-1037: -- Assignee: Ilya Ganelin (was: Josh Rosen) the name of findTaskFromList findTask in TaskSetManager.scala is confusing Key: SPARK-1037 URL: https://issues.apache.org/jira/browse/SPARK-1037 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.9.1, 1.0.0 Reporter: Nan Zhu Assignee: Ilya Ganelin Priority: Minor Labels: starter Fix For: 1.3.0 the name of these two functions is confusing though in the comments the author said that the method does dequeue tasks from the list but from the name, it is not explicitly indicating that the method will mutate the parameter in {code} private def findTaskFromList(list: ArrayBuffer[Int]): Option[Int] = { while (!list.isEmpty) { val index = list.last list.trimEnd(1) if (copiesRunning(index) == 0 !successful(index)) { return Some(index) } } None } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4841) Batch serializer bug in PySpark's RDD.zip
[ https://issues.apache.org/jira/browse/SPARK-4841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4841: - Priority: Blocker (was: Major) Batch serializer bug in PySpark's RDD.zip - Key: SPARK-4841 URL: https://issues.apache.org/jira/browse/SPARK-4841 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Xiangrui Meng Assignee: Davies Liu Priority: Blocker {code} t = sc.textFile(README.md) t.zip(t).count() {code} {code} Py4JJavaError Traceback (most recent call last) ipython-input-6-60fdeb8339fd in module() 1 readme.zip(readme).count() /Users/meng/src/spark/python/pyspark/rdd.pyc in count(self) 817 3 818 -- 819 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() 820 821 def stats(self): /Users/meng/src/spark/python/pyspark/rdd.pyc in sum(self) 808 6.0 809 -- 810 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) 811 812 def count(self): /Users/meng/src/spark/python/pyspark/rdd.pyc in reduce(self, f) 713 yield reduce(f, iterator, initial) 714 -- 715 vals = self.mapPartitions(func).collect() 716 if vals: 717 return reduce(f, vals) /Users/meng/src/spark/python/pyspark/rdd.pyc in collect(self) 674 675 with SCCallSiteSync(self.context) as css: -- 676 bytesInJava = self._jrdd.collect().iterator() 677 return list(self._collect_iterator_through_file(bytesInJava)) 678 /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o69.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 2, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /Users/meng/src/spark/python/pyspark/worker.py, line 107, in main process() File /Users/meng/src/spark/python/pyspark/worker.py, line 98, in process serializer.dump_stream(func(split_index, iterator), outfile) File /Users/meng/src/spark/python/pyspark/serializers.py, line 198, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /Users/meng/src/spark/python/pyspark/serializers.py, line 81, in dump_stream raise NotImplementedError NotImplementedError at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137) at org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:174) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[jira] [Commented] (SPARK-1216) Add a OneHotEncoder for handling categorical features
[ https://issues.apache.org/jira/browse/SPARK-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247414#comment-14247414 ] Joseph K. Bradley commented on SPARK-1216: -- (Addressing old comments I just saw now...) [~jaggi] I'd recommend keeping this separate from [https://issues.apache.org/jira/browse/SPARK-1303] since it is a different kind of transformation. [~srowen] This is a bit different from [SPARK-4081]; see the comment in my PR: [https://github.com/apache/spark/pull/3000#issuecomment-62630207]. I'd recommend keeping them separate. I don't immediately see a good way to organize these, but it could be worth discussing. Add a OneHotEncoder for handling categorical features - Key: SPARK-1216 URL: https://issues.apache.org/jira/browse/SPARK-1216 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 0.9.0 Reporter: Sandy Pérez González Assignee: Sandy Ryza It would be nice to add something to MLLib to make it easy to do one-of-K encoding of categorical features. Something like: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4501) Create build/mvn to automatically download maven/zinc/scalac
[ https://issues.apache.org/jira/browse/SPARK-4501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247448#comment-14247448 ] Apache Spark commented on SPARK-4501: - User 'brennonyork' has created a pull request for this issue: https://github.com/apache/spark/pull/3707 Create build/mvn to automatically download maven/zinc/scalac Key: SPARK-4501 URL: https://issues.apache.org/jira/browse/SPARK-4501 Project: Spark Issue Type: New Feature Components: Build Reporter: Patrick Wendell Assignee: Prashant Sharma For a long time we've had the sbt/sbt and this works well for users who want to build Spark with minimal dependencies (only Java). It would be nice to generalize this to maven as well and have build/sbt and build/mvn, where build/mvn was a script that downloaded Maven, Zinc, and Scala locally and set them up correctly. This would be totally opt in and people using system maven would be able to continue doing so. My sense is that very few maven users are currently using Zinc even though from some basic tests I saw a huge improvement from using this. Also, having a simple way to use Zinc would make it easier to use Maven on our jenkins test machines. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-785) ClosureCleaner not invoked on most PairRDDFunctions
[ https://issues.apache.org/jira/browse/SPARK-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247473#comment-14247473 ] Josh Rosen commented on SPARK-785: -- I've merged https://github.com/apache/spark/pull/3690 to fix this in the maintenance branches and have tagged this for a 1.2.1 backport. ClosureCleaner not invoked on most PairRDDFunctions --- Key: SPARK-785 URL: https://issues.apache.org/jira/browse/SPARK-785 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Matei Zaharia Assignee: Sean Owen Labels: backport-needed Fix For: 1.1.1, 0.9.3, 1.0.3, 1.3.0 It's pretty weird that we've missed this so far, but it seems to be the case. Unfortunately it may not be good to fix this in 0.7.3 because it could change behavior in unexpected ways; I haven't decided yet. But we should definitely do it for 0.8, and add tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-785) ClosureCleaner not invoked on most PairRDDFunctions
[ https://issues.apache.org/jira/browse/SPARK-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-785: - Labels: backport-needed (was: ) ClosureCleaner not invoked on most PairRDDFunctions --- Key: SPARK-785 URL: https://issues.apache.org/jira/browse/SPARK-785 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Matei Zaharia Assignee: Sean Owen Labels: backport-needed Fix For: 1.1.1, 0.9.3, 1.0.3, 1.3.0 It's pretty weird that we've missed this so far, but it seems to be the case. Unfortunately it may not be good to fix this in 0.7.3 because it could change behavior in unexpected ways; I haven't decided yet. But we should definitely do it for 0.8, and add tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-785) ClosureCleaner not invoked on most PairRDDFunctions
[ https://issues.apache.org/jira/browse/SPARK-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-785: - Target Version/s: 1.2.1 Assignee: Sean Owen ClosureCleaner not invoked on most PairRDDFunctions --- Key: SPARK-785 URL: https://issues.apache.org/jira/browse/SPARK-785 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Matei Zaharia Assignee: Sean Owen Labels: backport-needed Fix For: 1.1.1, 0.9.3, 1.0.3, 1.3.0 It's pretty weird that we've missed this so far, but it seems to be the case. Unfortunately it may not be good to fix this in 0.7.3 because it could change behavior in unexpected ways; I haven't decided yet. But we should definitely do it for 0.8, and add tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4320) JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object
[ https://issues.apache.org/jira/browse/SPARK-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4320: -- Fix Version/s: (was: 1.1.1) 1.1.2 JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object Key: SPARK-4320 URL: https://issues.apache.org/jira/browse/SPARK-4320 Project: Spark Issue Type: Improvement Components: Input/Output, Spark Core Reporter: Corey J. Nolet Fix For: 1.2.0, 1.1.2 I am outputting data to Accumulo using a custom OutputFormat. I have tried using saveAsNewHadoopFile() and that works- though passing an empty path is a bit weird. Being that it isn't really a file I'm storing, but rather a generic Pair dataset, I'd be inclined to use the saveAsHadoopDataset() method, though I'm not at all interested in using the legacy mapred API. Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think there should be two ways of calling into this method. Instead of forcing the user to always set up the Job object explicitly, I'm in the camp of having the following method signature: saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : Class[? extends OutputFormat], conf : Configuration). This way, if I'm writing spark jobs that are going from Hadoop back into Hadoop, I can construct my Configuration once. Perhaps an overloaded method signature could be: saveAsNewHadoopDataset(job : Job) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4232) Truncate table not works when specific the table from non-current database session
[ https://issues.apache.org/jira/browse/SPARK-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4232: -- Fix Version/s: (was: 1.1.1) 1.1.2 Truncate table not works when specific the table from non-current database session -- Key: SPARK-4232 URL: https://issues.apache.org/jira/browse/SPARK-4232 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: shengli Priority: Minor Fix For: 1.2.0, 1.1.2 Currently the truncate table works fine in the situation that current table is in the corresponding database session. But it doesn't work in the scenario which the table is not in the corresponding database session. What I mean is : Assume we have two database:default, dw. A table named test_table in database dw By default we login as default database session. So I run: use dw; truncate table test_table [partions..]; is OK. If I just use the default database default to run: use default; truncate table dw.test_table; It will throw exception Failed to parse: truncate table dw.test_table. line 1:17 missing EOF at '.' near 'dw' It's a bug when parsing the truncate table xxx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4355) OnlineSummarizer doesn't merge mean correctly
[ https://issues.apache.org/jira/browse/SPARK-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4355: -- Fix Version/s: (was: 1.1.1) 1.1.2 OnlineSummarizer doesn't merge mean correctly - Key: SPARK-4355 URL: https://issues.apache.org/jira/browse/SPARK-4355 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2, 1.1.1, 1.2.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.2.0, 1.1.2 It happens when the mean on one side is zero. I will send an PR with some code clean-up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4265) Better extensibility for TaskEndReason
[ https://issues.apache.org/jira/browse/SPARK-4265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4265: -- Fix Version/s: (was: 1.1.1) 1.1.2 Better extensibility for TaskEndReason -- Key: SPARK-4265 URL: https://issues.apache.org/jira/browse/SPARK-4265 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Priority: Minor Labels: api-change Fix For: 1.1.2 Now all subclasses of TaskEndReason are case classes. As per discussion in https://github.com/apache/spark/pull/3073#discussion_r19920257 , it's hard to extend them (such as add/remove fields) without breaks compatibility for matching. It's better to change them to regular classes for extensibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-785) ClosureCleaner not invoked on most PairRDDFunctions
[ https://issues.apache.org/jira/browse/SPARK-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-785: - Fix Version/s: (was: 1.1.1) 1.1.2 ClosureCleaner not invoked on most PairRDDFunctions --- Key: SPARK-785 URL: https://issues.apache.org/jira/browse/SPARK-785 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Matei Zaharia Assignee: Sean Owen Labels: backport-needed Fix For: 0.9.3, 1.0.3, 1.3.0, 1.1.2 It's pretty weird that we've missed this so far, but it seems to be the case. Unfortunately it may not be good to fix this in 0.7.3 because it could change behavior in unexpected ways; I haven't decided yet. But we should definitely do it for 0.8, and add tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice
[ https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4006: -- Fix Version/s: (was: 1.1.1) 1.1.2 Spark Driver crashes whenever an Executor is registered twice - Key: SPARK-4006 URL: https://issues.apache.org/jira/browse/SPARK-4006 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Affects Versions: 0.9.2, 1.0.2, 1.1.0, 1.2.0 Environment: Mesos, Coarse Grained Reporter: Tal Sliwowicz Assignee: Tal Sliwowicz Priority: Critical Fix For: 1.2.0, 1.1.2 This is a huge robustness issue for us (Taboola), in mission critical , time sensitive (real time) spark jobs. We have long running spark drivers and even though we have state of the art hardware, from time to time executors disconnect. In many cases, the RemoveExecutor is not received, and when the new executor registers, the driver crashes. In mesos coarse grained, executor ids are fixed. The issue is with the System.exit(1) in BlockManagerMasterActor {code} private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: ActorRef) { if (!blockManagerInfo.contains(id)) { blockManagerIdByExecutor.get(id.executorId) match { case Some(manager) = // A block manager of the same executor already exists. // This should never happen. Let's just quit. logError(Got two different block manager registrations on + id.executorId) System.exit(1) case None = blockManagerIdByExecutor(id.executorId) = id } logInfo(Registering block manager %s with %s RAM.format( id.hostPort, Utils.bytesToString(maxMemSize))) blockManagerInfo(id) = new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, slaveActor) } listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize)) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3987) NNLS generates incorrect result
[ https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3987: -- Fix Version/s: (was: 1.1.2) 1.1.1 NNLS generates incorrect result --- Key: SPARK-3987 URL: https://issues.apache.org/jira/browse/SPARK-3987 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Debasish Das Assignee: Shuo Xiang Fix For: 1.1.1, 1.2.0 Hi, Please see the example gram matrix and linear term: val P2 = new DoubleMatrix(20, 20, 333907.312770, -60814.043975, 207935.829941, -162881.367739, -43730.396770, 17511.428983, -243340.496449, -225245.957922, 104700.445881, 32430.845099, 336378.693135, -373497.970207, -41147.159621, 53928.060360, -293517.883778, 53105.278068, 0.00, -85257.781696, 84913.970469, -10584.080103, -60814.043975, 13826.806664, -38032.612640, 33475.833875, 10791.916809, -1040.950810, 48106.552472, 45390.073380, -16310.282190, -2861.455903, -60790.833191, 73109.516544, 9826.614644, -8283.992464, 56991.742991, -6171.366034, 0.00, 19152.382499, -13218.721710, 2793.734234, 207935.829941, -38032.612640, 129661.677608, -101682.098412, -27401.299347, 10787.713362, -151803.006149, -140563.601672, 65067.935324, 20031.263383, 209521.268600, -232958.054688, -25764.179034, 33507.951918, -183046.845592, 32884.782835, 0.00, -53315.811196, 52770.762546, -6642.187643, -162881.367739, 33475.833875, -101682.098412, 85094.407608, 25422.850782, -5437.646141, 124197.166330, 116206.265909, -47093.484134, -11420.168521, -163429.436848, 189574.783900, 23447.172314, -24087.375367, 148311.355507, -20848.385466, 0.00, 46835.814559, -38180.352878, 6415.873901, -43730.396770, 10791.916809, -27401.299347, 25422.850782, 8882.869799, 15.638084, 35933.473986, 34186.371325, -10745.330690, -974.314375, -43537.709621, 54371.010558, 7894.453004, -5408.929644, 42231.381747, -3192.010574, 0.00, 15058.753110, -8704.757256, 2316.581535, 17511.428983, -1040.950810, 10787.713362, -5437.646141, 15.638084, 2794.949847, -9681.950987, -8258.171646, 7754.358930, 4193.359412, 18052.143842, -15456.096769, -253.356253, 4089.672804, -12524.380088, 5651.579348, 0.00, -1513.302547, 6296.461898, 152.427321, -243340.496449, 48106.552472, -151803.006149, 124197.166330, 35933.473986, -9681.950987, 182931.600236, 170454.352953, -72361.174145, -19270.461728, -244518.179729, 279551.060579, 33340.452802, -37103.267653, 219025.288975, -33687.141423, 0.00, 67347.950443, -58673.009647, 8957.800259, -225245.957922, 45390.073380, -140563.601672, 116206.265909, 34186.371325, -8258.171646, 170454.352953, 159322.942894, -66074.960534, -16839.743193, -226173.967766, 260421.044094, 31624.194003, -33839.612565, 203889.695169, -30034.828909, 0.00, 63525.040745, -53572.741748, 8575.071847, 104700.445881, -16310.282190, 65067.935324, -47093.484134, -10745.330690, 7754.358930, -72361.174145, -66074.960534, 35869.598076, 13378.653317, 106033.647837, -111831.682883, -10455.465743, 18537.392481, -88370.612394, 20344.288488, 0.00, -22935.482766, 29004.543704, -2409.461759, 32430.845099, -2861.455903, 20031.263383, -11420.168521, -974.314375, 4193.359412, -19270.461728, -16839.743193, 13378.653317, 6802.081898, 33256.395091, -30421.985199, -1296.785870, 7026.518692, -24443.378205, 9221.982599, 0.00, -4088.076871, 10861.014242, -25.092938, 336378.693135, -60790.833191, 209521.268600, -163429.436848, -43537.709621, 18052.143842, -244518.179729, -226173.967766, 106033.647837, 33256.395091, 339200.268106, -375442.716811, -41027.594509, 54636.778527, -295133.248586, 54177.278365, 0.00, -85237.666701, 85996.957056, -10503.209968, -373497.970207, 73109.516544, -232958.054688, 189574.783900, 54371.010558, -15456.096769, 279551.060579, 260421.044094, -111831.682883, -30421.985199, -375442.716811, 427793.208465, 50528.074431, -57375.986301, 335203.382015, -52676.385869, 0.00, 102368.307670, -90679.792485, 13509.390393, -41147.159621, 9826.614644, -25764.179034, 23447.172314, 7894.453004, -253.356253, 33340.452802, 31624.194003, -10455.465743, -1296.785870, -41027.594509, 50528.074431, 7255.977434, -5281.636812, 39298.355527, -3440.450858, 0.00, 13717.870243, -8471.405582, 2071.812204, 53928.060360, -8283.992464, 33507.951918, -24087.375367, -5408.929644, 4089.672804, -37103.267653, -33839.612565, 18537.392481, 7026.518692, 54636.778527, -57375.986301, -5281.636812, 9735.061160, -45360.674033, 10634.633559, 0.00, -11652.364691, 15039.566630, -1202.539106, -293517.883778, 56991.742991, -183046.845592, 148311.355507, 42231.381747, -12524.380088, 219025.288975, 203889.695169, -88370.612394, -24443.378205, -295133.248586,
[jira] [Updated] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice
[ https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4006: -- Fix Version/s: (was: 1.1.2) 1.1.1 Spark Driver crashes whenever an Executor is registered twice - Key: SPARK-4006 URL: https://issues.apache.org/jira/browse/SPARK-4006 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Affects Versions: 0.9.2, 1.0.2, 1.1.0, 1.2.0 Environment: Mesos, Coarse Grained Reporter: Tal Sliwowicz Assignee: Tal Sliwowicz Priority: Critical Fix For: 1.1.1, 1.2.0 This is a huge robustness issue for us (Taboola), in mission critical , time sensitive (real time) spark jobs. We have long running spark drivers and even though we have state of the art hardware, from time to time executors disconnect. In many cases, the RemoveExecutor is not received, and when the new executor registers, the driver crashes. In mesos coarse grained, executor ids are fixed. The issue is with the System.exit(1) in BlockManagerMasterActor {code} private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: ActorRef) { if (!blockManagerInfo.contains(id)) { blockManagerIdByExecutor.get(id.executorId) match { case Some(manager) = // A block manager of the same executor already exists. // This should never happen. Let's just quit. logError(Got two different block manager registrations on + id.executorId) System.exit(1) case None = blockManagerIdByExecutor(id.executorId) = id } logInfo(Registering block manager %s with %s RAM.format( id.hostPort, Utils.bytesToString(maxMemSize))) blockManagerInfo(id) = new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, slaveActor) } listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize)) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2823) GraphX jobs throw IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247480#comment-14247480 ] Josh Rosen commented on SPARK-2823: --- I've removed the Fix Versions from this JIRA because its fix was reverted. GraphX jobs throw IllegalArgumentException -- Key: SPARK-2823 URL: https://issues.apache.org/jira/browse/SPARK-2823 Project: Spark Issue Type: Bug Components: GraphX Reporter: Lu Lu If the users set “spark.default.parallelism” and the value is different with the EdgeRDD partition number, GraphX jobs will throw IllegalArgumentException: 14/07/26 21:06:51 WARN DAGScheduler: Creating new stage failed due to exception - job: 1 java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:54) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:1 97) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:272) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:269) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:274) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:269) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:274) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:269) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269) at org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:279) at org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:219) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:672) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1184) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Updated] (SPARK-2823) GraphX jobs throw IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2823: -- Fix Version/s: (was: 1.1.2) (was: 1.0.3) (was: 1.2.0) GraphX jobs throw IllegalArgumentException -- Key: SPARK-2823 URL: https://issues.apache.org/jira/browse/SPARK-2823 Project: Spark Issue Type: Bug Components: GraphX Reporter: Lu Lu If the users set “spark.default.parallelism” and the value is different with the EdgeRDD partition number, GraphX jobs will throw IllegalArgumentException: 14/07/26 21:06:51 WARN DAGScheduler: Creating new stage failed due to exception - job: 1 java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:54) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:1 97) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:272) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:269) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:274) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:269) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:274) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:269) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269) at org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:279) at org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:219) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:672) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1184) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Updated] (SPARK-2823) GraphX jobs throw IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2823: -- Target Version/s: 1.0.3, 1.3.0, 1.1.2, 1.2.1 GraphX jobs throw IllegalArgumentException -- Key: SPARK-2823 URL: https://issues.apache.org/jira/browse/SPARK-2823 Project: Spark Issue Type: Bug Components: GraphX Reporter: Lu Lu If the users set “spark.default.parallelism” and the value is different with the EdgeRDD partition number, GraphX jobs will throw IllegalArgumentException: 14/07/26 21:06:51 WARN DAGScheduler: Creating new stage failed due to exception - job: 1 java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:54) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:1 97) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:272) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:269) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:274) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:269) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:274) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:269) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269) at org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:279) at org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:219) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:672) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1184) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4355) OnlineSummarizer doesn't merge mean correctly
[ https://issues.apache.org/jira/browse/SPARK-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4355: -- Fix Version/s: (was: 1.1.2) 1.1.1 OnlineSummarizer doesn't merge mean correctly - Key: SPARK-4355 URL: https://issues.apache.org/jira/browse/SPARK-4355 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2, 1.1.1, 1.2.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.1.1, 1.2.0 It happens when the mean on one side is zero. I will send an PR with some code clean-up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4320) JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object
[ https://issues.apache.org/jira/browse/SPARK-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4320: -- Target Version/s: 1.1.2, 1.2.1 Fix Version/s: (was: 1.1.2) (was: 1.2.0) I've changed this issue's Fix Version/s into Target Version/s since it hasn't actually been fixed yet. JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object Key: SPARK-4320 URL: https://issues.apache.org/jira/browse/SPARK-4320 Project: Spark Issue Type: Improvement Components: Input/Output, Spark Core Reporter: Corey J. Nolet I am outputting data to Accumulo using a custom OutputFormat. I have tried using saveAsNewHadoopFile() and that works- though passing an empty path is a bit weird. Being that it isn't really a file I'm storing, but rather a generic Pair dataset, I'd be inclined to use the saveAsHadoopDataset() method, though I'm not at all interested in using the legacy mapred API. Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think there should be two ways of calling into this method. Instead of forcing the user to always set up the Job object explicitly, I'm in the camp of having the following method signature: saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : Class[? extends OutputFormat], conf : Configuration). This way, if I'm writing spark jobs that are going from Hadoop back into Hadoop, I can construct my Configuration once. Perhaps an overloaded method signature could be: saveAsNewHadoopDataset(job : Job) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4265) Better extensibility for TaskEndReason
[ https://issues.apache.org/jira/browse/SPARK-4265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4265: -- Fix Version/s: (was: 1.1.2) Removing the Fix Version/s field, since we reserve that for versions where a fix was actually applied. Please use Target Version/s to plan future fixes. Better extensibility for TaskEndReason -- Key: SPARK-4265 URL: https://issues.apache.org/jira/browse/SPARK-4265 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Priority: Minor Labels: api-change Now all subclasses of TaskEndReason are case classes. As per discussion in https://github.com/apache/spark/pull/3073#discussion_r19920257 , it's hard to extend them (such as add/remove fields) without breaks compatibility for matching. It's better to change them to regular classes for extensibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4151) Add string operation function trim, ltrim, rtrim, length to support SparkSql (HiveQL)
[ https://issues.apache.org/jira/browse/SPARK-4151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4151: -- Target Version/s: 1.1.2 (was: 1.1.0) Fix Version/s: (was: 1.1.2) (was: 1.1.0) Removing the Fix Version/s field, since we only use that to indicate where fixes have been applied, not where we plan to apply them; moved those versions over to Target Version/s instead. Add string operation function trim, ltrim, rtrim, length to support SparkSql (HiveQL) -- Key: SPARK-4151 URL: https://issues.apache.org/jira/browse/SPARK-4151 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: shengli Priority: Minor Original Estimate: 72h Remaining Estimate: 72h Add three string operation functions to support spark sql and hiveql. eg: sql(select trim(' a b ') from src ).collect() -- 'a b' sql(select ltrim(' a b ') from src ).collect() -- 'a b ' sql(select rtrim(' a b ') from src ).collect() -- ' a b' sql(select length('ab') from src ).collect() -- 2 And Rename the trait of stringOperations.scala. I prefer to rename trait CaseConversionExpression to StringTransformationExpression, it is more make sence than before so that this trait can support more string transformation but not only caseconversion. And also add a trait StringCalculationExpression that do string computation like length, indexof etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit
[ https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247487#comment-14247487 ] Joseph K. Bradley commented on SPARK-4846: -- I agree with [~srowen] that the current implementation has to serialize those big data structures, no matter what. Splitting the big syn0Global and syn1Global data structures across partitions sounds possible, but I would guess that the Word2VecModel itself would then need to be distributed as well since it occupies the same order of memory. A distributed Word2VecModel sounds like a much bigger PR. In the meantime, a simpler faster solution might be nice. The easiest would be to catch the error and print a warning. A fancier but better solution might be to automatically minCount as much as necessary (and print a warning about this automatic change). When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit --- Key: SPARK-4846 URL: https://issues.apache.org/jira/browse/SPARK-4846 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one partition. The corpus contains about 300 million words and its vocabulary size is about 10 million. Reporter: Joseph Tang Priority: Critical Exception in thread Driver java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) at org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4232) Truncate table not works when specific the table from non-current database session
[ https://issues.apache.org/jira/browse/SPARK-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4232: -- Target Version/s: 1.1.2, 1.2.1 (was: 1.1.0) Fix Version/s: (was: 1.1.2) (was: 1.2.0) Removing the Fix Version/s field, since we only use that to indicate where fixes have been applied, not where we plan to apply them; moved those versions over to Target Version/s instead. Truncate table not works when specific the table from non-current database session -- Key: SPARK-4232 URL: https://issues.apache.org/jira/browse/SPARK-4232 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: shengli Priority: Minor Currently the truncate table works fine in the situation that current table is in the corresponding database session. But it doesn't work in the scenario which the table is not in the corresponding database session. What I mean is : Assume we have two database:default, dw. A table named test_table in database dw By default we login as default database session. So I run: use dw; truncate table test_table [partions..]; is OK. If I just use the default database default to run: use default; truncate table dw.test_table; It will throw exception Failed to parse: truncate table dw.test_table. line 1:17 missing EOF at '.' near 'dw' It's a bug when parsing the truncate table xxx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4323) Utils#fetchFile method should close lock file certainly
[ https://issues.apache.org/jira/browse/SPARK-4323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4323. --- Resolution: Not a Problem Resolving this as Not a Problem since the pull request was closed after some discussion. Utils#fetchFile method should close lock file certainly --- Key: SPARK-4323 URL: https://issues.apache.org/jira/browse/SPARK-4323 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Kousuke Saruta In Utils#fetchFile method, lock file is created like as follows. {code} val raf = new RandomAccessFile(lockFile, rw) // Only one executor entry. // The FileLock is only used to control synchronization for executors download file, // it's always safe regardless of lock type (mandatory or advisory). val lock = raf.getChannel().lock() val cachedFile = new File(localDir, cachedFileName) try { if (!cachedFile.exists()) { doFetchFile(url, localDir, cachedFileName, conf, securityMgr, hadoopConf) } } finally { lock.release() } {code} If some error occurs between opening RandomAccessFile and getting lock, lock file can be not closed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4793) way to find assembly jar is too strict
[ https://issues.apache.org/jira/browse/SPARK-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4793: -- Labels: backport-needed (was: ) way to find assembly jar is too strict -- Key: SPARK-4793 URL: https://issues.apache.org/jira/browse/SPARK-4793 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.1.0 Reporter: Adrian Wang Assignee: Adrian Wang Priority: Minor Labels: backport-needed Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2980) Python support for chi-squared test
[ https://issues.apache.org/jira/browse/SPARK-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247505#comment-14247505 ] Joseph K. Bradley commented on SPARK-2980: -- Duplicated by later JIRA which has been fixed Python support for chi-squared test --- Key: SPARK-2980 URL: https://issues.apache.org/jira/browse/SPARK-2980 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Doris Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2980) Python support for chi-squared test
[ https://issues.apache.org/jira/browse/SPARK-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-2980. Resolution: Duplicate Fix Version/s: 1.2.0 Assignee: Davies Liu Python support for chi-squared test --- Key: SPARK-2980 URL: https://issues.apache.org/jira/browse/SPARK-2980 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Doris Xin Assignee: Davies Liu Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4855) Python tests for hypothesis testing
Joseph K. Bradley created SPARK-4855: Summary: Python tests for hypothesis testing Key: SPARK-4855 URL: https://issues.apache.org/jira/browse/SPARK-4855 Project: Spark Issue Type: Test Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Ben Cook Priority: Minor Add Python unit tests for Chi-Squared hypothesis testing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4814) Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger
[ https://issues.apache.org/jira/browse/SPARK-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4814: -- Fix Version/s: 1.2.1 1.1.2 1.3.0 Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger Key: SPARK-4814 URL: https://issues.apache.org/jira/browse/SPARK-4814 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.1.0 Reporter: Sean Owen Fix For: 1.3.0, 1.1.2, 1.2.1 Follow up to SPARK-4159, wherein we noticed that Java tests weren't running in Maven, in part because a Java test actually fails with {{AssertionError}}. That code/test was fixed in SPARK-4850. The reason it wasn't caught by SBT tests was that they don't run with assertions on, and Maven's surefire does. Turning on assertions in the SBT build is trivial, adding one line: {code} javaOptions in Test += -ea, {code} This reveals a test failure in Scala test suites though: {code} [info] - alter_merge_2 *** FAILED *** (1 second, 305 milliseconds) [info] Failed to execute query using catalyst: [info] Error: Job aborted due to stage failure: Task 1 in stage 551.0 failed 1 times, most recent failure: Lost task 1.0 in stage 551.0 (TID 1532, localhost): java.lang.AssertionError [info]at org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryInteger.init(LazyBinaryInteger.java:51) [info]at org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:110) [info]at org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:171) [info]at org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:166) [info]at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:318) [info]at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314) [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) [info]at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:132) [info]at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128) [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615) [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615) [info]at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) [info]at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) [info]at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) [info]at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) [info]at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) [info]at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) [info]at org.apache.spark.scheduler.Task.run(Task.scala:56) [info]at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) [info]at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [info]at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [info]at java.lang.Thread.run(Thread.java:745) {code} The items for this JIRA are therefore: - Enable assertions in SBT - Fix this failure - Figure out why Maven scalatest didn't trigger it - may need assertions explicitly turned on too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4814) Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger
[ https://issues.apache.org/jira/browse/SPARK-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4814: -- Target Version/s: 1.0.3 (was: 1.3.0) Assignee: Sean Owen Labels: backport-needed (was: ) Alright, I've merged Sean's PR into master, branch-1.2, and branch-1.1. There's a large-ish merge conflict that we'll have to fix to get this into branch-1.0 (or I can just manually fix things up there). Tagging this as {{backport-needed}} so I remember to come back and do that. Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger Key: SPARK-4814 URL: https://issues.apache.org/jira/browse/SPARK-4814 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.1.0 Reporter: Sean Owen Assignee: Sean Owen Labels: backport-needed Fix For: 1.3.0, 1.1.2, 1.2.1 Follow up to SPARK-4159, wherein we noticed that Java tests weren't running in Maven, in part because a Java test actually fails with {{AssertionError}}. That code/test was fixed in SPARK-4850. The reason it wasn't caught by SBT tests was that they don't run with assertions on, and Maven's surefire does. Turning on assertions in the SBT build is trivial, adding one line: {code} javaOptions in Test += -ea, {code} This reveals a test failure in Scala test suites though: {code} [info] - alter_merge_2 *** FAILED *** (1 second, 305 milliseconds) [info] Failed to execute query using catalyst: [info] Error: Job aborted due to stage failure: Task 1 in stage 551.0 failed 1 times, most recent failure: Lost task 1.0 in stage 551.0 (TID 1532, localhost): java.lang.AssertionError [info]at org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryInteger.init(LazyBinaryInteger.java:51) [info]at org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:110) [info]at org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:171) [info]at org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:166) [info]at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:318) [info]at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314) [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) [info]at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:132) [info]at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128) [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615) [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615) [info]at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) [info]at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) [info]at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) [info]at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) [info]at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) [info]at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) [info]at org.apache.spark.scheduler.Task.run(Task.scala:56) [info]at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) [info]at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [info]at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [info]at java.lang.Thread.run(Thread.java:745) {code} The items for this JIRA are therefore: - Enable assertions in SBT - Fix this failure - Figure out why Maven scalatest didn't trigger it - may need assertions explicitly turned on too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring
Cheng Hao created SPARK-4856: Summary: Null empty string should not be considered as StringType at begining in Json schema inferring Key: SPARK-4856 URL: https://issues.apache.org/jira/browse/SPARK-4856 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao We have data like: {panel} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {panel} As empty string will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type in line 1), and also take the line 1 (the headers) as String Type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring
[ https://issues.apache.org/jira/browse/SPARK-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-4856: - Description: We have data like: {panel} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {panel} As empty string (the headers) will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type headers in line 1), and also take the line 1 (the headers) as String Type, which is not our expected. was: We have data like: {panel} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {panel} As empty string will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type in line 1), and also take the line 1 (the headers) as String Type. Null empty string should not be considered as StringType at begining in Json schema inferring --- Key: SPARK-4856 URL: https://issues.apache.org/jira/browse/SPARK-4856 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao We have data like: {panel} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {panel} As empty string (the headers) will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type headers in line 1), and also take the line 1 (the headers) as String Type, which is not our expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring
[ https://issues.apache.org/jira/browse/SPARK-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-4856: - Description: We have data like: {code:java} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {code:xml} As empty string (the headers) will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type headers in line 1), and also take the line 1 (the headers) as String Type, which is not our expected. was: We have data like: {panel} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {panel} As empty string (the headers) will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type headers in line 1), and also take the line 1 (the headers) as String Type, which is not our expected. Null empty string should not be considered as StringType at begining in Json schema inferring --- Key: SPARK-4856 URL: https://issues.apache.org/jira/browse/SPARK-4856 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao We have data like: {code:java} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {code:xml} As empty string (the headers) will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type headers in line 1), and also take the line 1 (the headers) as String Type, which is not our expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring
[ https://issues.apache.org/jira/browse/SPARK-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247571#comment-14247571 ] Apache Spark commented on SPARK-4856: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/3708 Null empty string should not be considered as StringType at begining in Json schema inferring --- Key: SPARK-4856 URL: https://issues.apache.org/jira/browse/SPARK-4856 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao We have data like: {code:java} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {code:xml} As empty string (the headers) will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type headers in line 1), and also take the line 1 (the headers) as String Type, which is not our expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4857) Add Executor Events to SparkListener
[ https://issues.apache.org/jira/browse/SPARK-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247578#comment-14247578 ] Kostas Sakellis commented on SPARK-4857: I'll work on this. Add Executor Events to SparkListener Key: SPARK-4857 URL: https://issues.apache.org/jira/browse/SPARK-4857 Project: Spark Issue Type: Improvement Reporter: Kostas Sakellis We need to add events to the SparkListener to indicate an executor has been added or removed with corresponding information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4857) Add Executor Events to SparkListener
Kostas Sakellis created SPARK-4857: -- Summary: Add Executor Events to SparkListener Key: SPARK-4857 URL: https://issues.apache.org/jira/browse/SPARK-4857 Project: Spark Issue Type: Improvement Reporter: Kostas Sakellis We need to add events to the SparkListener to indicate an executor has been added or removed with corresponding information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models
[ https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247582#comment-14247582 ] Joseph K. Bradley commented on SPARK-3702: -- I'm canceling my WIP PR for this since I have begun breaking that PR into smaller PRs. The WIP PR branch is in [my ml-api branch | https://github.com/jkbradley/spark/tree/ml-api]. Here's the description of the WIP PR: This is WIP effort to standardize abstractions and developer API for prediction tasks (classification and regression) for the new ML api (org.apache.spark.ml). * Please comment on: ** abstractions, class hierarchy ** functionality required by each abstraction ** naming of types and methods ** ease of use for developers ** ease of use for users migrating from org.apache.spark.mllib * Please ignore for now: ** missing tests and examples ** private/public API (I will make more things private to ml after writing tests and examples.) ** style and other details ** the many TODO items noted in the code Please refer to [https://issues.apache.org/jira/browse/SPARK-3702] for some discussion on design, and [this design doc | https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/] for major design decisions. This is not intended to cover all algorithms; e.g., one big missing item is porting the GeneralizedLinearModel class to the new API. But it hopefully lays a fair amount of groundwork. I have included a limited number of concrete classes in this WIP PR, for purposes of illustration: * LogisticRegression (edited, to show effects of abstract classes) * NaiveBayes (simple to show ease of use for developers) * AdaBoost (demonstration of meta-algorithms taking advantage of abstractions) ** (Note discussion of strong vs. weak types for ensemble methods in design doc.) ** This implementation is very incomplete but illustrates using the abstractions. * LinearRegression (example of Regressor, for completeness) * evaluators (to provide default evaluators in the class hierarchy) * IterativeSolver and IterativeEstimator (to expose iterative algorithms) * LabeledPoint (Q: Should this include an instance weight?) Items remaining: - [ ] helper method for simulating a distribution over weighted instances by subsampling (for algorithms which do not support instance weights) - [ ] several TODO items noted in the code - [ ] add tests and examples - [ ] general cleanup - [ ] make more of hierarchy private to ml - [ ] split into several smaller PRs General plan for splitting into multiple PRs, in order: 1. Simple class hierarchy 2. Evaluators 3. IterativeEstimator 4. AdaBoost 5. NaiveBayes (Any time after Evaluators) Thanks to @epahomov and @BigCrunsh for input, including from [https://github.com/apache/spark/pull/2137] which improves upon the org.apache.spark.mllib APIs. Standardize MLlib classes for learners, models -- Key: SPARK-3702 URL: https://issues.apache.org/jira/browse/SPARK-3702 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Blocker Summary: Create a class hierarchy for learning algorithms and the models those algorithms produce. This is a super-task of several sub-tasks (but JIRA does not allow subtasks of subtasks). See the requires links below for subtasks. Goals: * give intuitive structure to API, both for developers and for generated documentation * support meta-algorithms (e.g., boosting) * support generic functionality (e.g., evaluation) * reduce code duplication across classes [Design doc for class hierarchy | https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4838) StackOverflowError when serialization task
[ https://issues.apache.org/jira/browse/SPARK-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247629#comment-14247629 ] Hong Shen commented on SPARK-4838: -- Here is the sql, it contain 2928 partitions. {code:title=Bar.java|borderStyle=solid} SELECT DISTINCT a.uin FROM ( SELECT DISTINCT uin AS uin FROM hlw :: t_dw_pf00135 a1 WHERE imp_date BETWEEN 2014081008 AND 2014121007 AND ( oper1 LIKE '%设置气泡%' AND oper2 '0' ) ) a {code} Here is full stack, Error message from spark is:Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
[jira] [Commented] (SPARK-4844) SGD should support custom sampling.
[ https://issues.apache.org/jira/browse/SPARK-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247652#comment-14247652 ] Guoqiang Li commented on SPARK-4844: Sorry, I mean, all of the data need to be serialized in {{RDD.sample}}, it is very inefficient. SGD should support custom sampling. --- Key: SPARK-4844 URL: https://issues.apache.org/jira/browse/SPARK-4844 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Guoqiang Li Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4838) StackOverflowError when serialization task
[ https://issues.apache.org/jira/browse/SPARK-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247662#comment-14247662 ] Sean Owen commented on SPARK-4838: -- [~shenhong] This stack trace is very large but does not show any new information. What I meant was, is there anything different at the root? or was it not present in your logs? Obviously the problem is a serialization graph that is way too deeply nested, so more copies of these 5 lines aren't needed, but it might help to show where the call originated. StackOverflowError when serialization task -- Key: SPARK-4838 URL: https://issues.apache.org/jira/browse/SPARK-4838 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.1.0 Reporter: Hong Shen When run a sql with more than 2000 partitions, each partition a HadoopRDD, it will cause java.lang.StackOverflowError at serialize task. Error message from spark is:Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) .. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4855) Python tests for hypothesis testing
[ https://issues.apache.org/jira/browse/SPARK-4855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247665#comment-14247665 ] Apache Spark commented on SPARK-4855: - User 'jbencook' has created a pull request for this issue: https://github.com/apache/spark/pull/3679 Python tests for hypothesis testing --- Key: SPARK-4855 URL: https://issues.apache.org/jira/browse/SPARK-4855 Project: Spark Issue Type: Test Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Ben Cook Priority: Minor Add Python unit tests for Chi-Squared hypothesis testing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4838) StackOverflowError when serialization task
[ https://issues.apache.org/jira/browse/SPARK-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247676#comment-14247676 ] Hong Shen commented on SPARK-4838: -- This is the whole stack. All we can know is it thow from DAGScheduler.submitMissingTasks, when serialize stage.rdd. {code} var taskBinary: Broadcast[Array[Byte]] = null try { // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep). // For ResultTask, serialize and broadcast (rdd, func). val taskBinaryBytes: Array[Byte] = if (stage.isShuffleMap) { closureSerializer.serialize((stage.rdd, stage.shuffleDep.get) : AnyRef).array() } else { closureSerializer.serialize((stage.rdd, stage.resultOfJob.get.func) : AnyRef).array() } taskBinary = sc.broadcast(taskBinaryBytes) } catch { // In the case of a failure during serialization, abort the stage. case e: NotSerializableException = abortStage(stage, Task not serializable: + e.toString) runningStages -= stage return case NonFatal(e) = abortStage(stage, sTask serialization failed: $e\n${e.getStackTraceString}) runningStages -= stage return } {code} StackOverflowError when serialization task -- Key: SPARK-4838 URL: https://issues.apache.org/jira/browse/SPARK-4838 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.1.0 Reporter: Hong Shen When run a sql with more than 2000 partitions, each partition a HadoopRDD, it will cause java.lang.StackOverflowError at serialize task. Error message from spark is:Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) .. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4858) Add an option to turn off a progress bar in spark-shell
Takeshi Yamamuro created SPARK-4858: --- Summary: Add an option to turn off a progress bar in spark-shell Key: SPARK-4858 URL: https://issues.apache.org/jira/browse/SPARK-4858 Project: Spark Issue Type: New Feature Components: Spark Core, Spark Shell Reporter: Takeshi Yamamuro Priority: Minor Add an '--no-progress-bar' option to easily turn off a progress bar in spark-shell for users who'd like to look into debug logs or something. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4858) Add an option to turn off a progress bar in spark-shell
[ https://issues.apache.org/jira/browse/SPARK-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247722#comment-14247722 ] Apache Spark commented on SPARK-4858: - User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/3709 Add an option to turn off a progress bar in spark-shell --- Key: SPARK-4858 URL: https://issues.apache.org/jira/browse/SPARK-4858 Project: Spark Issue Type: New Feature Components: Spark Core, Spark Shell Reporter: Takeshi Yamamuro Priority: Minor Add an '--no-progress-bar' option to easily turn off a progress bar in spark-shell for users who'd like to look into debug logs or something. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4859) Improve StreamingListenerBus
Shixiong Zhu created SPARK-4859: --- Summary: Improve StreamingListenerBus Key: SPARK-4859 URL: https://issues.apache.org/jira/browse/SPARK-4859 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Shixiong Zhu Priority: Minor Fix the race condition of `queueFullErrorMessageLogged`. Log the error from listener rather than crashing `listenerThread`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4859) Improve StreamingListenerBus
[ https://issues.apache.org/jira/browse/SPARK-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247801#comment-14247801 ] Apache Spark commented on SPARK-4859: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/3710 Improve StreamingListenerBus Key: SPARK-4859 URL: https://issues.apache.org/jira/browse/SPARK-4859 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Shixiong Zhu Priority: Minor Fix the race condition of `queueFullErrorMessageLogged`. Log the error from listener rather than crashing `listenerThread`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4792) Add some checks and messages on making local dir
[ https://issues.apache.org/jira/browse/SPARK-4792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4792. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3635 [https://github.com/apache/spark/pull/3635] Add some checks and messages on making local dir Key: SPARK-4792 URL: https://issues.apache.org/jira/browse/SPARK-4792 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: meiyoula Priority: Minor Fix For: 1.3.0 Lacking judgument if making local dir successfully. If unsuccessfully, should log some error/warning message. Also, should judge the dir is exist or not before making it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4792) Add some checks and messages on making local dir
[ https://issues.apache.org/jira/browse/SPARK-4792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4792: -- Assignee: meiyoula Add some checks and messages on making local dir Key: SPARK-4792 URL: https://issues.apache.org/jira/browse/SPARK-4792 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: meiyoula Assignee: meiyoula Priority: Minor Fix For: 1.3.0 Lacking judgument if making local dir successfully. If unsuccessfully, should log some error/warning message. Also, should judge the dir is exist or not before making it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4841) Batch serializer bug in PySpark's RDD.zip
[ https://issues.apache.org/jira/browse/SPARK-4841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4841: -- Fix Version/s: 1.3.0 Labels: backport-needed (was: ) I've merged https://github.com/apache/spark/pull/3706 into master; adding a {{backport-needed}} label so that this makes it into 1.2.1. Batch serializer bug in PySpark's RDD.zip - Key: SPARK-4841 URL: https://issues.apache.org/jira/browse/SPARK-4841 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Xiangrui Meng Assignee: Davies Liu Priority: Blocker Labels: backport-needed Fix For: 1.3.0 {code} t = sc.textFile(README.md) t.zip(t).count() {code} {code} Py4JJavaError Traceback (most recent call last) ipython-input-6-60fdeb8339fd in module() 1 readme.zip(readme).count() /Users/meng/src/spark/python/pyspark/rdd.pyc in count(self) 817 3 818 -- 819 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() 820 821 def stats(self): /Users/meng/src/spark/python/pyspark/rdd.pyc in sum(self) 808 6.0 809 -- 810 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) 811 812 def count(self): /Users/meng/src/spark/python/pyspark/rdd.pyc in reduce(self, f) 713 yield reduce(f, iterator, initial) 714 -- 715 vals = self.mapPartitions(func).collect() 716 if vals: 717 return reduce(f, vals) /Users/meng/src/spark/python/pyspark/rdd.pyc in collect(self) 674 675 with SCCallSiteSync(self.context) as css: -- 676 bytesInJava = self._jrdd.collect().iterator() 677 return list(self._collect_iterator_through_file(bytesInJava)) 678 /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o69.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 2, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /Users/meng/src/spark/python/pyspark/worker.py, line 107, in main process() File /Users/meng/src/spark/python/pyspark/worker.py, line 98, in process serializer.dump_stream(func(split_index, iterator), outfile) File /Users/meng/src/spark/python/pyspark/serializers.py, line 198, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /Users/meng/src/spark/python/pyspark/serializers.py, line 81, in dump_stream raise NotImplementedError NotImplementedError at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137) at org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:174) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at
[jira] [Commented] (SPARK-4790) Flaky test in ReceivedBlockTrackerSuite: block addition, block to batch allocation, and cleanup with write ahead log
[ https://issues.apache.org/jira/browse/SPARK-4790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247900#comment-14247900 ] Josh Rosen commented on SPARK-4790: --- /cc [~hshreedharan], you might want to take a look at this issue, too. This test seems to fail intermittently on Jenkins. Since this is a new test, I think we should fix its flakiness in its own PR. Flaky test in ReceivedBlockTrackerSuite: block addition, block to batch allocation, and cleanup with write ahead log -- Key: SPARK-4790 URL: https://issues.apache.org/jira/browse/SPARK-4790 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Josh Rosen Assignee: Tathagata Das Labels: flaky-test Found another flaky streaming test, org.apache.spark.streaming.ReceivedBlockTrackerSuite.block addition, block to batch allocation and cleanup with write ahead log: {code} Error Message File /tmp/1418069118106-0/receivedBlockMetadata/log-0-1000 does not exist. Stacktrace sbt.ForkMain$ForkError: File /tmp/1418069118106-0/receivedBlockMetadata/log-0-1000 does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:324) at org.apache.spark.streaming.util.WriteAheadLogSuite$.getLogFilesInDirectory(WriteAheadLogSuite.scala:344) at org.apache.spark.streaming.ReceivedBlockTrackerSuite.getWriteAheadLogFiles(ReceivedBlockTrackerSuite.scala:248) at org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply$mcV$sp(ReceivedBlockTrackerSuite.scala:173) at org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply(ReceivedBlockTrackerSuite.scala:96) at org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply(ReceivedBlockTrackerSuite.scala:96) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.ReceivedBlockTrackerSuite.org$scalatest$BeforeAndAfter$$super$runTest(ReceivedBlockTrackerSuite.scala:41) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.streaming.ReceivedBlockTrackerSuite.runTest(ReceivedBlockTrackerSuite.scala:41) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.streaming.ReceivedBlockTrackerSuite.org$scalatest$BeforeAndAfter$$super$run(ReceivedBlockTrackerSuite.scala:41) at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) at