[jira] [Commented] (SPARK-5207) StandardScalerModel mean and variance re-use
[ https://issues.apache.org/jira/browse/SPARK-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285960#comment-14285960 ] Apache Spark commented on SPARK-5207: - User 'ogeagla' has created a pull request for this issue: https://github.com/apache/spark/pull/4140 StandardScalerModel mean and variance re-use Key: SPARK-5207 URL: https://issues.apache.org/jira/browse/SPARK-5207 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Octavian Geagla Assignee: Octavian Geagla From this discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-use-scaling-means-and-variances-from-StandardScalerModel-td10073.html Changing constructor to public would be a simple change, but a discussion is needed to determine what args necessary for this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5301) Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix
[ https://issues.apache.org/jira/browse/SPARK-5301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5301. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4089 [https://github.com/apache/spark/pull/4089] Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix - Key: SPARK-5301 URL: https://issues.apache.org/jira/browse/SPARK-5301 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh Fix For: 1.3.0 1) Transpose is missing from CoordinateMatrix (this is cheap to compute, so it should be there) 2) IndexedRowMatrix should be convertable to CoordinateMatrix (conversion method to be added) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5352) Add getPartitionStrategy in Graph
Takeshi Yamamuro created SPARK-5352: --- Summary: Add getPartitionStrategy in Graph Key: SPARK-5352 URL: https://issues.apache.org/jira/browse/SPARK-5352 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Takeshi Yamamuro Priority: Minor Graph remembers an applied partition strategy in paritionBy() and returns it via getPartitionStrategy(). This is useful in case of the following situation; val g1 = GraphLoader.edgeListFile(sc, graph.txt) val g2 = g1.partitionBy(EdgePartition2D, 2) // Modifiy (e.g., add, contract, ...) edges in g2 val newEdges = ... // Re-build a new graph based on g2 val g3 = Graph(g1.vertices, newEdges) // Partition edges in a similar way of g2 val g4 = g3.partitionBy(g2.getParitionStrategy, 2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce
[ https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286069#comment-14286069 ] Tobias Schlatter commented on SPARK-2620: - It is a hack in an attempt to have anonymous functions (and other classes) close over REPL state and send them over the wire (rather than re-running the static initialization code on the executor). It's up to you if you want to take the [red pill|https://github.com/apache/spark/blob/b6cf1348170951396a6a5d8a65fb670382304f5b/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L100]. case class cannot be used as key for reduce --- Key: SPARK-2620 URL: https://issues.apache.org/jira/browse/SPARK-2620 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.0.0, 1.1.0 Environment: reproduced on spark-shell local[4] Reporter: Gerard Maas Assignee: Tobias Schlatter Priority: Critical Labels: case-class, core Using a case class as a key doesn't seem to work properly on Spark 1.0.0 A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), (P(bob),1), (P(abe),1), (P(charly),1)) In contrast to the expected behavior, that should be equivalent to: sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) groupByKey and distinct also present the same behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4793) way to find assembly jar is too strict
[ https://issues.apache.org/jira/browse/SPARK-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4793: - Target Version/s: 1.3.0 (was: 1.3.0, 1.1.2, 1.2.1) way to find assembly jar is too strict -- Key: SPARK-4793 URL: https://issues.apache.org/jira/browse/SPARK-4793 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.1.0 Reporter: Adrian Wang Assignee: Adrian Wang Priority: Minor Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4759: - Fix Version/s: (was: 1.2.0) 1.2.1 Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Labels: backport-needed Fix For: 1.3.0, 1.1.2, 1.2.1 Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5355) SparkConf is not thread-safe
Davies Liu created SPARK-5355: - Summary: SparkConf is not thread-safe Key: SPARK-5355 URL: https://issues.apache.org/jira/browse/SPARK-5355 Project: Spark Issue Type: Bug Affects Versions: 1.2.0, 1.3.0 Reporter: Davies Liu Priority: Blocker The SparkConf is not thread-safe, but is accessed by many threads. The getAll() could return parts of the configs if another thread is access it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5352) Add getPartitionStrategy in Graph
[ https://issues.apache.org/jira/browse/SPARK-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285898#comment-14285898 ] Apache Spark commented on SPARK-5352: - User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/4138 Add getPartitionStrategy in Graph - Key: SPARK-5352 URL: https://issues.apache.org/jira/browse/SPARK-5352 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Takeshi Yamamuro Priority: Minor Graph remembers an applied partition strategy in partitionBy() and returns it via getPartitionStrategy(). This is useful in case of the following situation; val g1 = GraphLoader.edgeListFile(sc, graph.txt) val g2 = g1.partitionBy(EdgePartition2D, 2) // Modify (e.g., add, contract, ...) edges in g2 val newEdges = ... // Re-build a new graph based on g2 val g3 = Graph(g1.vertices, newEdges) // Partition edges in a similar way of g2 val g4 = g3.partitionBy(g2.getPartitionStrategy, 2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5346) Parquet filter pushdown is not enabled when parquet.task.side.metadata is set to true (default value)
[ https://issues.apache.org/jira/browse/SPARK-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-5346: -- Description: When computing Parquet splits, reading Parquet metadata from executor side is more memory efficient, thus Spark SQL [sets {{parquet.task.side.metadata}} to {{true}} by default|https://github.com/apache/spark/blob/v1.2.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala#L437]. However, somehow this disables filter pushdown. To workaround this issue and enable Parquet filter pushdown, users can set {{spark.sql.parquet.filterPushdown}} to {{true}} and {{parquet.task.side.metadata}} to {{false}}. However, for large Parquet files with a large number of part-files and/or columns, reading metadata from driver side eats lots of memory. The following Spark shell snippet can be useful to reproduce this issue: {code} import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) import sqlContext._ case class KeyValue(key: Int, value: String) sc. parallelize(1 to 1024). flatMap(i = Seq.fill(1024)(KeyValue(i, i.toString))). saveAsParquetFile(large.parquet) parquetFile(large.parquet).registerTempTable(large) sql(SET spark.sql.parquet.filterPushdown=true) sql(SELECT * FROM large).collect() sql(SELECT * FROM large WHERE key 200).collect() {code} Users can verify this issue by checking the input size metrics from web UI. When filter pushdown is enabled, the second query reads fewer data. Notice that {{parquet.task.side.metadata}} must be set in _Hadoop_ configuration (either via {{core-site.xml}} or {{SparkConf.hadoopConfiguration.set()}}), setting it in {{spark-defaults.conf}} or via {{SparkConf}} does NOT work. was: When computing Parquet splits, reading Parquet metadata from executor side is more memory efficient, thus Spark SQL [sets {{parquet.task.side.metadata}} to {{true}} by default|https://github.com/apache/spark/blob/v1.2.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala#L437]. However, somehow this disables filter pushdown. To workaround this issue and enable Parquet filter pushdown, users can set {{spark.sql.parquet.filterPushdown}} to {{true}} and {{parquet.task.side.metadata}} to {{false}}. However, for large Parquet files with a large number of part-files and/or columns, reading metadata from driver side eats lots of memory. The following Spark shell snippet can be useful to reproduce this issue: {code} import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) import sqlContext._ case class KeyValue(key: Int, value: String) sc. parallelize(1 to 1024). flatMap(i = Seq.fill(1024)(KeyValue(i, i.toString))). saveAsParquetFile(large.parquet) parquetFile(large.parquet).registerTempTable(large) sql(SET spark.sql.parquet.filterPushdown=true) sql(SELECT * FROM large).collect() sql(SELECT * FROM large WHERE key 200).collect() {code} Users can verify this issue by checking the input size metrics from web UI. When filter pushdown is enabled, the second query reads fewer data. Notice that {{parquet.task.side.metadata}} must be set in _Hadoop_ configuration files (e.g. core-site.xml), setting it in {{spark-defaults.conf}} or via {{SparkConf}} does NOT work. Parquet filter pushdown is not enabled when parquet.task.side.metadata is set to true (default value) - Key: SPARK-5346 URL: https://issues.apache.org/jira/browse/SPARK-5346 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Cheng Lian Priority: Critical When computing Parquet splits, reading Parquet metadata from executor side is more memory efficient, thus Spark SQL [sets {{parquet.task.side.metadata}} to {{true}} by default|https://github.com/apache/spark/blob/v1.2.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala#L437]. However, somehow this disables filter pushdown. To workaround this issue and enable Parquet filter pushdown, users can set {{spark.sql.parquet.filterPushdown}} to {{true}} and {{parquet.task.side.metadata}} to {{false}}. However, for large Parquet files with a large number of part-files and/or columns, reading metadata from driver side eats lots of memory. The following Spark shell snippet can be useful to reproduce this issue: {code} import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) import sqlContext._ case class KeyValue(key: Int, value: String) sc. parallelize(1 to 1024). flatMap(i = Seq.fill(1024)(KeyValue(i, i.toString))). saveAsParquetFile(large.parquet) parquetFile(large.parquet).registerTempTable(large) sql(SET spark.sql.parquet.filterPushdown=true)
[jira] [Closed] (SPARK-4215) Allow requesting executors only on Yarn (for now)
[ https://issues.apache.org/jira/browse/SPARK-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4215. Resolution: Fixed Target Version/s: 1.3.0 (was: 1.3.0, 1.2.1) Allow requesting executors only on Yarn (for now) - Key: SPARK-4215 URL: https://issues.apache.org/jira/browse/SPARK-4215 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical Fix For: 1.3.0 Currently if the user attempts to call `sc.requestExecutors` or enables dynamic allocation on, say, standalone mode, it just fails silently. We must at the very least log a warning to say it's only available for Yarn currently, or maybe even throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4759. Resolution: Fixed Fix Version/s: 1.2.0 Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Labels: backport-needed Fix For: 1.3.0, 1.1.2, 1.2.0 Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285980#comment-14285980 ] RJ Nowling commented on SPARK-4894: --- [~mengxr] Since [~lmcguire] has submitted the patch, can we assign the JIRA to her so she gets credit for it? Thanks! Add Bernoulli-variant of Naive Bayes Key: SPARK-4894 URL: https://issues.apache.org/jira/browse/SPARK-4894 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.2.0 Reporter: RJ Nowling Assignee: RJ Nowling MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli version of Naive Bayes is more useful for situations where the features are binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4569) Rename externalSorting in Aggregator
[ https://issues.apache.org/jira/browse/SPARK-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4569. Resolution: Fixed Fix Version/s: 1.2.1 1.1.2 Assignee: Ilya Ganelin Rename externalSorting in Aggregator -- Key: SPARK-4569 URL: https://issues.apache.org/jira/browse/SPARK-4569 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Ilya Ganelin Priority: Trivial Labels: backport-needed Fix For: 1.3.0, 1.1.2, 1.2.1 While technically all spilling in Spark does result in sorting, calling this variable externalSorting makes it seem like ExternalSorter will be used, when in fact it just means whether spilling is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2669) Hadoop configuration is not localised when submitting job in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286153#comment-14286153 ] Apache Spark commented on SPARK-2669: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/4142 Hadoop configuration is not localised when submitting job in yarn-cluster mode -- Key: SPARK-2669 URL: https://issues.apache.org/jira/browse/SPARK-2669 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Maxim Ivanov I'd like to propose a fix for a problem when Hadoop configuration is not localized when job is submitted in yarn-cluster mode. Here is a description from github pull request https://github.com/apache/spark/pull/1574 This patch fixes a problem when Spark driver is run in the container managed by YARN ResourceManager it inherits configuration from a NodeManager process, which can be different from the Hadoop configuration present on the client (submitting machine). Problem is most vivid when fs.defaultFS property differs between these two. Hadoop MR solves it by serializing client's Hadoop configuration into job.xml in application staging directory and then making Application Master to use it. That guarantees that regardless of execution nodes configurations all application containers use same config identical to one on the client side. This patch uses similar approach. YARN ClientBase serializes configuration and adds it to ClientDistributedCacheManager under job.xml link name. ClientDistributedCacheManager is then utilizes Hadoop localizer to deliver it to whatever container is started by this application, including the one running Spark driver. YARN ClientBase also adds SPARK_LOCAL_HADOOPCONF env variable to AM container request which is then used by SparkHadoopUtil.newConfiguration to trigger new behavior when machine-wide hadoop configuration is merged with application specific job.xml (exactly how it is done in Hadoop MR). SparkContext is then follows same approach, adding SPARK_LOCAL_HADOOPCONF env to all spawned containers to make them use client-side Hadopo configuration. Also all the references to new Configuration() which might be executed on YARN cluster side are changed to use SparkHadoopUtil.get.conf Please note that it fixes only core Spark, the part which I am comfortable to test and verify the result. I didn't descend into steaming/shark directories, so things might need to be changed there too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4337) Add ability to cancel pending requests to YARN
[ https://issues.apache.org/jira/browse/SPARK-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286142#comment-14286142 ] Apache Spark commented on SPARK-4337: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/4141 Add ability to cancel pending requests to YARN -- Key: SPARK-4337 URL: https://issues.apache.org/jira/browse/SPARK-4337 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza This will be useful for things like SPARK-4136 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4793) way to find assembly jar is too strict
[ https://issues.apache.org/jira/browse/SPARK-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4793: - Labels: (was: backport-needed) way to find assembly jar is too strict -- Key: SPARK-4793 URL: https://issues.apache.org/jira/browse/SPARK-4793 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.1.0 Reporter: Adrian Wang Assignee: Adrian Wang Priority: Minor Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5301) Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix
[ https://issues.apache.org/jira/browse/SPARK-5301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5301: - Assignee: Reza Zadeh Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix - Key: SPARK-5301 URL: https://issues.apache.org/jira/browse/SPARK-5301 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh Assignee: Reza Zadeh Fix For: 1.3.0 1) Transpose is missing from CoordinateMatrix (this is cheap to compute, so it should be there) 2) IndexedRowMatrix should be convertable to CoordinateMatrix (conversion method to be added) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4215) Allow requesting executors only on Yarn (for now)
[ https://issues.apache.org/jira/browse/SPARK-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4215: - Labels: (was: backport-needed) Allow requesting executors only on Yarn (for now) - Key: SPARK-4215 URL: https://issues.apache.org/jira/browse/SPARK-4215 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical Fix For: 1.3.0 Currently if the user attempts to call `sc.requestExecutors` or enables dynamic allocation on, say, standalone mode, it just fails silently. We must at the very least log a warning to say it's only available for Yarn currently, or maybe even throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5353) Log failures in ExceutorClassLoader
[ https://issues.apache.org/jira/browse/SPARK-5353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285901#comment-14285901 ] Apache Spark commented on SPARK-5353: - User 'gzm0' has created a pull request for this issue: https://github.com/apache/spark/pull/4130 Log failures in ExceutorClassLoader --- Key: SPARK-5353 URL: https://issues.apache.org/jira/browse/SPARK-5353 Project: Spark Issue Type: Improvement Components: Spark Shell Reporter: Tobias Schlatter Priority: Minor When the ExecutorClassLoader tries to load classes compiled in the Spark Shell and fails, it silently passes loading to the parent ClassLoader. It should log these failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5346) Parquet filter pushdown is not enabled when parquet.task.side.metadata is set to true (default value)
[ https://issues.apache.org/jira/browse/SPARK-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-5346: -- Priority: Blocker (was: Critical) Parquet filter pushdown is not enabled when parquet.task.side.metadata is set to true (default value) - Key: SPARK-5346 URL: https://issues.apache.org/jira/browse/SPARK-5346 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Cheng Lian Priority: Blocker When computing Parquet splits, reading Parquet metadata from executor side is more memory efficient, thus Spark SQL [sets {{parquet.task.side.metadata}} to {{true}} by default|https://github.com/apache/spark/blob/v1.2.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala#L437]. However, somehow this disables filter pushdown. To workaround this issue and enable Parquet filter pushdown, users can set {{spark.sql.parquet.filterPushdown}} to {{true}} and {{parquet.task.side.metadata}} to {{false}}. However, for large Parquet files with a large number of part-files and/or columns, reading metadata from driver side eats lots of memory. The following Spark shell snippet can be useful to reproduce this issue: {code} import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) import sqlContext._ case class KeyValue(key: Int, value: String) sc. parallelize(1 to 1024). flatMap(i = Seq.fill(1024)(KeyValue(i, i.toString))). saveAsParquetFile(large.parquet) parquetFile(large.parquet).registerTempTable(large) sql(SET spark.sql.parquet.filterPushdown=true) sql(SELECT * FROM large).collect() sql(SELECT * FROM large WHERE key 200).collect() {code} Users can verify this issue by checking the input size metrics from web UI. When filter pushdown is enabled, the second query reads fewer data. Notice that {{parquet.task.side.metadata}} must be set in _Hadoop_ configuration (either via {{core-site.xml}} or {{SparkConf.hadoopConfiguration.set()}}), setting it in {{spark-defaults.conf}} or via {{SparkConf}} does NOT work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4793) way to find assembly jar is too strict
[ https://issues.apache.org/jira/browse/SPARK-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4793. Resolution: Fixed way to find assembly jar is too strict -- Key: SPARK-4793 URL: https://issues.apache.org/jira/browse/SPARK-4793 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.1.0 Reporter: Adrian Wang Assignee: Adrian Wang Priority: Minor Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4749) Allow initializing KMeans clusters using a seed
[ https://issues.apache.org/jira/browse/SPARK-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4749. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3610 [https://github.com/apache/spark/pull/3610] Allow initializing KMeans clusters using a seed --- Key: SPARK-4749 URL: https://issues.apache.org/jira/browse/SPARK-4749 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.1.0 Reporter: Nate Crosswhite Fix For: 1.3.0 Original Estimate: 24h Remaining Estimate: 24h Add an optional seed to MLLib KMeans clustering to allow initial cluster choices to be deterministic. Update the pyspark mllib interface to also allow an optional seed parameter to be supplie. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5354) When possible, correctly set outputPartitioning for leaf SparkPlans
Yin Huai created SPARK-5354: --- Summary: When possible, correctly set outputPartitioning for leaf SparkPlans Key: SPARK-5354 URL: https://issues.apache.org/jira/browse/SPARK-5354 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Right now, Spark SQL is not aware of the partitioning scheme of a leaf SparkPlan (e.g. an input table). So, even users want to re-partitioning the data in advance, Exchange operators will still be used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5352) Add getPartitionStrategy in Graph
[ https://issues.apache.org/jira/browse/SPARK-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-5352: Description: Graph remembers an applied partition strategy in partitionBy() and returns it via getPartitionStrategy(). This is useful in case of the following situation; val g1 = GraphLoader.edgeListFile(sc, graph.txt) val g2 = g1.partitionBy(EdgePartition2D, 2) // Modify (e.g., add, contract, ...) edges in g2 val newEdges = ... // Re-build a new graph based on g2 val g3 = Graph(g1.vertices, newEdges) // Partition edges in a similar way of g2 val g4 = g3.partitionBy(g2.getPartitionStrategy, 2) was: Graph remembers an applied partition strategy in paritionBy() and returns it via getPartitionStrategy(). This is useful in case of the following situation; val g1 = GraphLoader.edgeListFile(sc, graph.txt) val g2 = g1.partitionBy(EdgePartition2D, 2) // Modifiy (e.g., add, contract, ...) edges in g2 val newEdges = ... // Re-build a new graph based on g2 val g3 = Graph(g1.vertices, newEdges) // Partition edges in a similar way of g2 val g4 = g3.partitionBy(g2.getParitionStrategy, 2) Add getPartitionStrategy in Graph - Key: SPARK-5352 URL: https://issues.apache.org/jira/browse/SPARK-5352 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Takeshi Yamamuro Priority: Minor Graph remembers an applied partition strategy in partitionBy() and returns it via getPartitionStrategy(). This is useful in case of the following situation; val g1 = GraphLoader.edgeListFile(sc, graph.txt) val g2 = g1.partitionBy(EdgePartition2D, 2) // Modify (e.g., add, contract, ...) edges in g2 val newEdges = ... // Re-build a new graph based on g2 val g3 = Graph(g1.vertices, newEdges) // Partition edges in a similar way of g2 val g4 = g3.partitionBy(g2.getPartitionStrategy, 2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5309) Reduce Binary/String conversion overhead when reading/writing Parquet files
[ https://issues.apache.org/jira/browse/SPARK-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285899#comment-14285899 ] Apache Spark commented on SPARK-5309: - User 'MickDavies' has created a pull request for this issue: https://github.com/apache/spark/pull/4139 Reduce Binary/String conversion overhead when reading/writing Parquet files --- Key: SPARK-5309 URL: https://issues.apache.org/jira/browse/SPARK-5309 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: MIchael Davies Priority: Minor Converting between Parquet Binary and Java Strings can form a significant proportion of query times. For columns which have repeated String values (which is common) the same Binary will be repeatedly being converted. A simple change to cache the last converted String per column was shown to reduce query times by 25% when grouping on a data set of 66M rows on a column with many repeated Strings. A possible optimisation would be to hand responsibility for Binary encoding/decoding over to Parquet so that it could ensure that this was done only once per Binary value. Next step is to look at Parquet code and to discuss with that project, which I will do. More details are available on this discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-td10141.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4569) Rename externalSorting in Aggregator
[ https://issues.apache.org/jira/browse/SPARK-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4569: -- Labels: (was: backport-needed) It looks like all backports have been completed, so I'm removing the {{backport-needed}} label. Rename externalSorting in Aggregator -- Key: SPARK-4569 URL: https://issues.apache.org/jira/browse/SPARK-4569 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Ilya Ganelin Priority: Trivial Fix For: 1.3.0, 1.1.2, 1.2.1 While technically all spilling in Spark does result in sorting, calling this variable externalSorting makes it seem like ExternalSorter will be used, when in fact it just means whether spilling is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1714) Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler
[ https://issues.apache.org/jira/browse/SPARK-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-1714. -- Resolution: Fixed Fix Version/s: 1.3.0 Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler Key: SPARK-1714 URL: https://issues.apache.org/jira/browse/SPARK-1714 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5349) Multiple spark shells should be able to share resources
[ https://issues.apache.org/jira/browse/SPARK-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tobias Bertelsen updated SPARK-5349: Description: The resource requirements of an interactive shell varies heavily. Sometimes heavy commands are executed, and sometimes the user is thinking, getting coffee, interrupted etc... A spark shell allocates a fixed number of worker cores (at least in standalone mode). A user thus has the choice to either block other users from the cluster by allocating all cores (default behavior), or restrict him/herself to only a few cores using the option {{--total-executor-cores}}. Either way the cores allocated to the shell has low utilization, since they will be waiting for the user a lot. Instead the spark shell allocate resources directly required to run the driver, and request worker cores only when computation is performed on the RDDs. This should allow for multiple users, to use an interactive shell concurrently while stille utilizing the entire cluster, when performing heavy operations. was: The documentation states Multiple spark shells should be able to share resources --- Key: SPARK-5349 URL: https://issues.apache.org/jira/browse/SPARK-5349 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Tobias Bertelsen The resource requirements of an interactive shell varies heavily. Sometimes heavy commands are executed, and sometimes the user is thinking, getting coffee, interrupted etc... A spark shell allocates a fixed number of worker cores (at least in standalone mode). A user thus has the choice to either block other users from the cluster by allocating all cores (default behavior), or restrict him/herself to only a few cores using the option {{--total-executor-cores}}. Either way the cores allocated to the shell has low utilization, since they will be waiting for the user a lot. Instead the spark shell allocate resources directly required to run the driver, and request worker cores only when computation is performed on the RDDs. This should allow for multiple users, to use an interactive shell concurrently while stille utilizing the entire cluster, when performing heavy operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5351) Can't zip RDDs with unequal numbers of partitions in ReplicatedVertexView.upgrade()
[ https://issues.apache.org/jira/browse/SPARK-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-5351: Description: If the value of 'spark.default.parallelism' does not match the number of partitoins in EdgePartition(EdgeRDDImpl), the following error occurs in ReplicatedVertexView.scala:72; object GraphTest extends Logging { def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = { graph.aggregateMessages[Int]( ctx = { ctx.sendToSrc(1) ctx.sendToDst(2) }, _ + _) } } val g = GraphLoader.edgeListFile(sc, graph.txt) val rdd = GraphTest.run(g) java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.ShuffleDependency.init(Dependency.scala:82) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:193) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191) ... was: If the value of 'spark.default.parallelism' do not match the number of partitoins in EdgePartition(EdgeRDDImpl), the following error occurs in ReplicatedVertexView.scala:72; object GraphTest extends Logging { def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = { graph.aggregateMessages[Int]( ctx = { ctx.sendToSrc(1) ctx.sendToDst(2) }, _ + _) } } val g = GraphLoader.edgeListFile(sc, graph.txt) val rdd = GraphTest.run(g) java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.ShuffleDependency.init(Dependency.scala:82) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:193) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191) ... Can't zip RDDs with unequal numbers of partitions in ReplicatedVertexView.upgrade() --- Key: SPARK-5351 URL: https://issues.apache.org/jira/browse/SPARK-5351 Project: Spark Issue Type: Bug Components: GraphX Reporter: Takeshi Yamamuro If the value of 'spark.default.parallelism' does not match the number of partitoins in EdgePartition(EdgeRDDImpl), the following error occurs in ReplicatedVertexView.scala:72; object GraphTest extends Logging { def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = { graph.aggregateMessages[Int]( ctx = { ctx.sendToSrc(1) ctx.sendToDst(2) }, _ + _) } } val g = GraphLoader.edgeListFile(sc, graph.txt) val rdd = GraphTest.run(g) java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at
[jira] [Commented] (SPARK-5351) Can't zip RDDs with unequal numbers of partitions in ReplicatedVertexView.upgrade()
[ https://issues.apache.org/jira/browse/SPARK-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285820#comment-14285820 ] Apache Spark commented on SPARK-5351: - User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/4136 Can't zip RDDs with unequal numbers of partitions in ReplicatedVertexView.upgrade() --- Key: SPARK-5351 URL: https://issues.apache.org/jira/browse/SPARK-5351 Project: Spark Issue Type: Bug Components: GraphX Reporter: Takeshi Yamamuro If the value of 'spark.default.parallelism' does not match the number of partitoins in EdgePartition(EdgeRDDImpl), the following error occurs in ReplicatedVertexView.scala:72; object GraphTest extends Logging { def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = { graph.aggregateMessages[Int]( ctx = { ctx.sendToSrc(1) ctx.sendToDst(2) }, _ + _) } } val g = GraphLoader.edgeListFile(sc, graph.txt) val rdd = GraphTest.run(g) java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.ShuffleDependency.init(Dependency.scala:82) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:193) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191) ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5176) Thrift server fails with confusing error message when deploy-mode is cluster
[ https://issues.apache.org/jira/browse/SPARK-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285843#comment-14285843 ] Apache Spark commented on SPARK-5176: - User 'tpanningnextcen' has created a pull request for this issue: https://github.com/apache/spark/pull/4137 Thrift server fails with confusing error message when deploy-mode is cluster Key: SPARK-5176 URL: https://issues.apache.org/jira/browse/SPARK-5176 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.2.0 Reporter: Tom Panning Labels: starter With Spark 1.2.0, when I try to run {noformat} $SPARK_HOME/sbin/start-thriftserver.sh --deploy-mode cluster --master spark://xd-spark.xdata.data-tactics-corp.com:7077 {noformat} The log output is {noformat} Spark assembly has been built with Hive, including Datanucleus jars on classpath Spark Command: /usr/java/latest/bin/java -cp ::/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar -XX:MaxPermSize=128m -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --deploy-mode cluster --master spark://xd-spark.xdata.data-tactics-corp.com:7077 spark-internal Jar url 'spark-internal' is not in valid format. Must be a jar file path in URL format (e.g. hdfs://host:port/XX.jar, file:///XX.jar) Usage: DriverClient [options] launch active-master jar-url main-class [driver options] Usage: DriverClient kill active-master driver-id Options: -c CORES, --cores CORESNumber of cores to request (default: 1) -m MEMORY, --memory MEMORY Megabytes of memory to request (default: 512) -s, --superviseWhether to restart the driver on failure -v, --verbose Print more debugging output Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties {noformat} I do not get this error if deploy-mode is set to client. The --deploy-mode option is described by the --help output, so I expected it to work. I checked, and this behavior seems to be present in Spark 1.1.0 as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5351) Can't zip RDDs with unequal numbers of partitions in ReplicatedVertexView.upgrade()
Takeshi Yamamuro created SPARK-5351: --- Summary: Can't zip RDDs with unequal numbers of partitions in ReplicatedVertexView.upgrade() Key: SPARK-5351 URL: https://issues.apache.org/jira/browse/SPARK-5351 Project: Spark Issue Type: Bug Components: GraphX Reporter: Takeshi Yamamuro If the value of 'spark.default.parallelism' do not match the number of partitoins in EdgePartition(EdgeRDDImpl), the following error occurs in ReplicatedVertexView.scala:72; object GraphTest extends Logging { def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = { graph.aggregateMessages[Int]( ctx = { ctx.sendToSrc(1) ctx.sendToDst(2) }, _ + _) } } val g = GraphLoader.edgeListFile(sc, graph.txt) val rdd = GraphTest.run(g) java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.ShuffleDependency.init(Dependency.scala:82) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:193) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191) ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5360) For CoGroupedRDD, rdds for narrow dependencies and shuffle handles are included twice in serialized task
Kay Ousterhout created SPARK-5360: - Summary: For CoGroupedRDD, rdds for narrow dependencies and shuffle handles are included twice in serialized task Key: SPARK-5360 URL: https://issues.apache.org/jira/browse/SPARK-5360 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor CoGroupPartition, part of CoGroupedRDD, includes references to each RDD that the CoGroupedRDD narrowly depends on, and a reference to the ShuffleHandle. The partition is serialized separately from the RDD, so when the RDD and partition arrive on the worker, the references in the partition and in the RDD no longer point to the same object. This is a relatively minor performance issue (the closure can be 2x larger than it needs to be because the rdds and partitions are serialized twice; see numbers below) but is more annoying as a developer issue (this is where I ran into): if any state is stored in the RDD or ShuffleHandle on the worker side, subtle bugs can appear due to the fact that the references to the RDD / ShuffleHandle in the RDD and in the partition point to separate objects. I'm not sure if this is enough of a potential future problem to fix this old and central part of the code, so hoping to get input from others here. I did some simple experiments to see how much this effects closure size. For this example: $ val a = sc.parallelize(1 to 10).map((_, 1)) $ val b = sc.parallelize(1 to 2).map(x = (x, 2*x)) $ a.cogroup(b).collect() the closure was 1902 bytes with current Spark, and 1129 bytes after my change. The difference comes from eliminating duplicate serialization of the shuffle handle. For this example: $ val sortedA = a.sortByKey() $ val sortedB = b.sortByKey() $ sortedA.cogroup(sortedB).collect() the closure was 3491 bytes with current Spark, and 1333 bytes after my change. Here, the difference comes from eliminating duplicate serialization of the two RDDs for the narrow dependencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5355) SparkConf is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-5355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286179#comment-14286179 ] Apache Spark commented on SPARK-5355: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4143 SparkConf is not thread-safe Key: SPARK-5355 URL: https://issues.apache.org/jira/browse/SPARK-5355 Project: Spark Issue Type: Bug Affects Versions: 1.2.0, 1.3.0 Reporter: Davies Liu Priority: Blocker The SparkConf is not thread-safe, but is accessed by many threads. The getAll() could return parts of the configs if another thread is access it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4959) Attributes are case sensitive when using a select query from a projection
[ https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4959: Fix Version/s: 1.2.1 Attributes are case sensitive when using a select query from a projection - Key: SPARK-4959 URL: https://issues.apache.org/jira/browse/SPARK-4959 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Andy Konwinski Assignee: Cheng Hao Priority: Blocker Labels: backport-needed Fix For: 1.3.0, 1.2.1 Per [~marmbrus], see this line of code, where we should be using an attribute map https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147 To reproduce, i ran the following in the Spark shell: {code} import sqlContext._ sql(drop table if exists test) sql(create table test (col1 string)) sql(insert into table test select hi from prejoined limit 1) val projection = col1.attr.as(Symbol(CaseSensitiveColName)) :: col1.attr.as(Symbol(CaseSensitiveColName2)) :: Nil sqlContext.table(test).select(projection:_*).registerTempTable(test2) # This succeeds. sql(select CaseSensitiveColName from test2).first() # This fails with java.util.NoSuchElementException: key not found: casesensitivecolname#23046 sql(select casesensitivecolname from test2).first() {code} The full stack trace printed for the final command that is failing: {code} java.util.NoSuchElementException: key not found: casesensitivecolname#23046 at scala.collection.MapLike$class.default(MapLike.scala:228) at org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29) at scala.collection.MapLike$class.apply(MapLike.scala:141) at org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:57) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446) at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108) at org.apache.spark.rdd.RDD.first(RDD.scala:1093) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class
[ https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286250#comment-14286250 ] Yin Huai commented on SPARK-5260: - [~sonixbp] Unfortunately, I failed to come up with a proper name. Will try again:) Expose JsonRDD.allKeysWithValueTypes() in a utility class -- Key: SPARK-5260 URL: https://issues.apache.org/jira/browse/SPARK-5260 Project: Spark Issue Type: Improvement Components: SQL Reporter: Corey J. Nolet I have found this method extremely useful when implementing my own strategy for inferring a schema from parsed json. For now, I've actually copied the method right out of the JsonRDD class into my own project but I think it would be immensely useful to keep the code in Spark and expose it publicly somewhere else- like an object called JsonSchema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5009) allCaseVersions function in SqlLexical leads to StackOverflow Exception
[ https://issues.apache.org/jira/browse/SPARK-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5009. - Resolution: Fixed Fix Version/s: (was: 1.2.1) Issue resolved by pull request 3926 [https://github.com/apache/spark/pull/3926] allCaseVersions function in SqlLexical leads to StackOverflow Exception - Key: SPARK-5009 URL: https://issues.apache.org/jira/browse/SPARK-5009 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.1, 1.2.0 Reporter: shengli Fix For: 1.3.0 Original Estimate: 96h Remaining Estimate: 96h Recently I found a bug when I add new feature in SqlParser. Which is : If I define a KeyWord that has a long name. Like: ```protected val :SERDEPROPERTIES = Keyword(SERDEPROPERTIES)``` Since the all case version is implement by recursive function, so when ```implicit asParser`` function is called and the stack memory is very small, it will leads to SO Exception. java.lang.StackOverflowError at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5064) GraphX rmatGraph hangs
[ https://issues.apache.org/jira/browse/SPARK-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-5064. --- Resolution: Fixed Fix Version/s: 1.2.1 1.3.0 Issue resolved by pull request 3950 [https://github.com/apache/spark/pull/3950] GraphX rmatGraph hangs -- Key: SPARK-5064 URL: https://issues.apache.org/jira/browse/SPARK-5064 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.2.0 Environment: CentOS 7 REPL (no HDFS). Also tried Cloudera 5.2.0 QuickStart standalone compiled Scala with spark-submit. Reporter: Michael Malak Fix For: 1.3.0, 1.2.1 org.apache.spark.graphx.util.GraphGenerators.rmatGraph(sc, 4, 8) It just outputs 0 edges and then locks up. A spark-user message reports similar behavior: http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3c1408617621830-12570.p...@n3.nabble.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5275) pyspark.streaming is not included in assembly jar
[ https://issues.apache.org/jira/browse/SPARK-5275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5275: --- Fix Version/s: 1.2.1 1.3.0 pyspark.streaming is not included in assembly jar - Key: SPARK-5275 URL: https://issues.apache.org/jira/browse/SPARK-5275 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0, 1.3.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker Fix For: 1.3.0, 1.2.1 The pyspark.streaming is not included in assembly jar of spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4939) Python updateStateByKey example hang in local mode
[ https://issues.apache.org/jira/browse/SPARK-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4939: --- Target Version/s: 1.3.0 (was: 1.3.0, 1.2.1) Python updateStateByKey example hang in local mode -- Key: SPARK-4939 URL: https://issues.apache.org/jira/browse/SPARK-4939 Project: Spark Issue Type: Bug Components: PySpark, Spark Core, Streaming Affects Versions: 1.2.0, 1.3.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5244) add parser for COALESCE()
[ https://issues.apache.org/jira/browse/SPARK-5244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5244. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4040 [https://github.com/apache/spark/pull/4040] add parser for COALESCE() - Key: SPARK-5244 URL: https://issues.apache.org/jira/browse/SPARK-5244 Project: Spark Issue Type: New Feature Components: SQL Reporter: Adrian Wang Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4939) Python updateStateByKey example hang in local mode
[ https://issues.apache.org/jira/browse/SPARK-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286234#comment-14286234 ] Patrick Wendell commented on SPARK-4939: [~tdas] [~davies] [~kayousterhout] Because there is still discussion about this, and this is modifying a very complex component in Spark, I'm not going to block on this for 1.2.1. Once we merge a patch we can decide whether to put it into 1.2 based on what the final patch looks like. It is definitely inconvenient that this doesn't work in local mode, but much less of a problem than introducing a bug in the scheduler for production cluster workloads. As a workaround we could suggest running this example with local-cluster. Python updateStateByKey example hang in local mode -- Key: SPARK-4939 URL: https://issues.apache.org/jira/browse/SPARK-4939 Project: Spark Issue Type: Bug Components: PySpark, Spark Core, Streaming Affects Versions: 1.2.0, 1.3.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5357) Upgrade from commons-codec 1.5
Matthew Whelan created SPARK-5357: - Summary: Upgrade from commons-codec 1.5 Key: SPARK-5357 URL: https://issues.apache.org/jira/browse/SPARK-5357 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0, 1.1.0 Reporter: Matthew Whelan Spark uses commons-codec 1.5, which has a race condition in Base64. That race was introduced in commons-codec 1.4 and resolved in 1.7. The current version of commons-codec is 1.10. Code that runs in Workers and assumes that Base64 is thread-safe will break because spark is using a non-thread-safe version. See CODEC-96 In addition, the spark.files.userClassPathFirst mechanism is currently broken, (bug to come), so there isn't a viable work around for this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5006) spark.port.maxRetries doesn't work
[ https://issues.apache.org/jira/browse/SPARK-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5006: -- Target Version/s: 1.3.0, 1.2.1 (was: 1.3.0) spark.port.maxRetries doesn't work -- Key: SPARK-5006 URL: https://issues.apache.org/jira/browse/SPARK-5006 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.1.0 Reporter: WangTaoTheTonic Assignee: WangTaoTheTonic Fix For: 1.3.0, 1.2.1 We normally config spark.port.maxRetries in properties file or SparkConf. But in Utils.scala it read from SparkEnv's conf. As SparkEnv is an object whose env need to be set after JVM is launched and Utils.scala is also an object. So in most cases portMaxRetries will get the default value 16. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5006) spark.port.maxRetries doesn't work
[ https://issues.apache.org/jira/browse/SPARK-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5006: -- Fix Version/s: 1.2.1 spark.port.maxRetries doesn't work -- Key: SPARK-5006 URL: https://issues.apache.org/jira/browse/SPARK-5006 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.1.0 Reporter: WangTaoTheTonic Assignee: WangTaoTheTonic Fix For: 1.3.0, 1.2.1 We normally config spark.port.maxRetries in properties file or SparkConf. But in Utils.scala it read from SparkEnv's conf. As SparkEnv is an object whose env need to be set after JVM is launched and Utils.scala is also an object. So in most cases portMaxRetries will get the default value 16. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-4587: - Description: This is an umbrella JIRA for one of the most requested features on the user mailing list. Model export/import can be done via Java serialization. But it doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we should provide save/load methods to every model. PMML is an option but it has its limitations. There are couple things we need to discuss: 1) data format, 2) how to preserve partitioning, 3) data compatibility between versions and language APIs, etc. UPDATE: [Design doc for model import/export | https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] This document sketches machine learning model import/export plans, including goals, an API, and development plans. The design doc proposes: * Support our own Spark-specific format. ** This is needed to (a) support distributed models and (b) get model import/export support into Spark quickly (while avoiding new dependencies). * Also support PMML ** This is needed since it is the only thing approaching an industry standard. was:This is an umbrella JIRA for one of the most requested features on the user mailing list. Model export/import can be done via Java serialization. But it doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we should provide save/load methods to every model. PMML is an option but it has its limitations. There are couple things we need to discuss: 1) data format, 2) how to preserve partitioning, 3) data compatibility between versions and language APIs, etc. Model export/import --- Key: SPARK-4587 URL: https://issues.apache.org/jira/browse/SPARK-4587 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Xiangrui Meng Priority: Critical This is an umbrella JIRA for one of the most requested features on the user mailing list. Model export/import can be done via Java serialization. But it doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we should provide save/load methods to every model. PMML is an option but it has its limitations. There are couple things we need to discuss: 1) data format, 2) how to preserve partitioning, 3) data compatibility between versions and language APIs, etc. UPDATE: [Design doc for model import/export | https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] This document sketches machine learning model import/export plans, including goals, an API, and development plans. The design doc proposes: * Support our own Spark-specific format. ** This is needed to (a) support distributed models and (b) get model import/export support into Spark quickly (while avoiding new dependencies). * Also support PMML ** This is needed since it is the only thing approaching an industry standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5142) Possibly data may be ruined in Spark Streaming's WAL mechanism.
[ https://issues.apache.org/jira/browse/SPARK-5142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286387#comment-14286387 ] Tathagata Das commented on SPARK-5142: -- This is definitely a tricky issue. One thing we could try is stop the current log file, open a new log file and then try again. The data may have got written to the previous log file but it does not matter because if the second attempt works, the metadata will have reference to the segment in the new log file. The data in the old log file can persist around, does not matter too much. Possibly data may be ruined in Spark Streaming's WAL mechanism. --- Key: SPARK-5142 URL: https://issues.apache.org/jira/browse/SPARK-5142 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.2.0 Reporter: Saisai Shao Currently in Spark Streaming's WAL manager, data will be written into HDFS with multiple tries when meeting failure, because of lacking of transactional guarantee, previously partial-written data is not rolled back and the retried data will be appended to the last, this will ruin the file and make the WriteAheadLogReader to read data with failure. Firstly I think this problem is hard to fix because HDFS do not support truncate operation(HDFS-3107) or random write with specific offset. Secondly, I think if we meet such write exception, it is better not to try again, try again will ruin the file and make read abnormal. Sorry if I misunderstand anything. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5360) For CoGroupedRDD, rdds for narrow dependencies and shuffle handles are included twice in serialized task
[ https://issues.apache.org/jira/browse/SPARK-5360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286386#comment-14286386 ] Apache Spark commented on SPARK-5360: - User 'kayousterhout' has created a pull request for this issue: https://github.com/apache/spark/pull/4145 For CoGroupedRDD, rdds for narrow dependencies and shuffle handles are included twice in serialized task Key: SPARK-5360 URL: https://issues.apache.org/jira/browse/SPARK-5360 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor CoGroupPartition, part of CoGroupedRDD, includes references to each RDD that the CoGroupedRDD narrowly depends on, and a reference to the ShuffleHandle. The partition is serialized separately from the RDD, so when the RDD and partition arrive on the worker, the references in the partition and in the RDD no longer point to the same object. This is a relatively minor performance issue (the closure can be 2x larger than it needs to be because the rdds and partitions are serialized twice; see numbers below) but is more annoying as a developer issue (this is where I ran into): if any state is stored in the RDD or ShuffleHandle on the worker side, subtle bugs can appear due to the fact that the references to the RDD / ShuffleHandle in the RDD and in the partition point to separate objects. I'm not sure if this is enough of a potential future problem to fix this old and central part of the code, so hoping to get input from others here. I did some simple experiments to see how much this effects closure size. For this example: $ val a = sc.parallelize(1 to 10).map((_, 1)) $ val b = sc.parallelize(1 to 2).map(x = (x, 2*x)) $ a.cogroup(b).collect() the closure was 1902 bytes with current Spark, and 1129 bytes after my change. The difference comes from eliminating duplicate serialization of the shuffle handle. For this example: $ val sortedA = a.sortByKey() $ val sortedB = b.sortByKey() $ sortedA.cogroup(sortedB).collect() the closure was 3491 bytes with current Spark, and 1333 bytes after my change. Here, the difference comes from eliminating duplicate serialization of the two RDDs for the narrow dependencies. The ShuffleHandle includes the ShuffleDependency, so this difference will get larger if a ShuffleDependency includes a serializer, a key ordering, or an aggregator (all set to None by default). However, the difference is not affected by the size of the function the user specifies, which (based on my understanding) is typically the source of large task closures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3424) KMeans Plus Plus is too slow
[ https://issues.apache.org/jira/browse/SPARK-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286227#comment-14286227 ] Apache Spark commented on SPARK-3424: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4144 KMeans Plus Plus is too slow Key: SPARK-3424 URL: https://issues.apache.org/jira/browse/SPARK-3424 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.2 Reporter: Derrick Burns Assignee: Derrick Burns The KMeansPlusPlus algorithm is implemented in time O( m k^2), where m is the rounds of the KMeansParallel algorithm and k is the number of clusters. This can be dramatically improved by maintaining the distance the closest cluster center from round to round and then incrementally updating that value for each point. This incremental update is O(1) time, this reduces the running time for K Means Plus Plus to O( m k ). For large k, this is significant. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4939) Python updateStateByKey example hang in local mode
[ https://issues.apache.org/jira/browse/SPARK-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286236#comment-14286236 ] Kay Ousterhout commented on SPARK-4939: --- [~pwendell] just want to make sure you understand that there's a simple but slightly hacky change here that only modifies the local scheduler. Just wanted to point that out in case that changes the dynamics of whether this should be fixed for 1.2. Python updateStateByKey example hang in local mode -- Key: SPARK-4939 URL: https://issues.apache.org/jira/browse/SPARK-4939 Project: Spark Issue Type: Bug Components: PySpark, Spark Core, Streaming Affects Versions: 1.2.0, 1.3.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5346) Parquet filter pushdown is not enabled when parquet.task.side.metadata is set to true (default value)
[ https://issues.apache.org/jira/browse/SPARK-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5346: Target Version/s: 1.3.0, 1.2.2 (was: 1.3.0, 1.2.1) Parquet filter pushdown is not enabled when parquet.task.side.metadata is set to true (default value) - Key: SPARK-5346 URL: https://issues.apache.org/jira/browse/SPARK-5346 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Cheng Lian Priority: Blocker When computing Parquet splits, reading Parquet metadata from executor side is more memory efficient, thus Spark SQL [sets {{parquet.task.side.metadata}} to {{true}} by default|https://github.com/apache/spark/blob/v1.2.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala#L437]. However, somehow this disables filter pushdown. To workaround this issue and enable Parquet filter pushdown, users can set {{spark.sql.parquet.filterPushdown}} to {{true}} and {{parquet.task.side.metadata}} to {{false}}. However, for large Parquet files with a large number of part-files and/or columns, reading metadata from driver side eats lots of memory. The following Spark shell snippet can be useful to reproduce this issue: {code} import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) import sqlContext._ case class KeyValue(key: Int, value: String) sc. parallelize(1 to 1024). flatMap(i = Seq.fill(1024)(KeyValue(i, i.toString))). saveAsParquetFile(large.parquet) parquetFile(large.parquet).registerTempTable(large) sql(SET spark.sql.parquet.filterPushdown=true) sql(SELECT * FROM large).collect() sql(SELECT * FROM large WHERE key 200).collect() {code} Users can verify this issue by checking the input size metrics from web UI. When filter pushdown is enabled, the second query reads fewer data. Notice that {{parquet.task.side.metadata}} must be set in _Hadoop_ configuration (either via {{core-site.xml}} or {{SparkConf.hadoopConfiguration.set()}}), setting it in {{spark-defaults.conf}} or via {{SparkConf}} does NOT work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5360) For CoGroupedRDD, rdds for narrow dependencies and shuffle handles are included twice in serialized task
[ https://issues.apache.org/jira/browse/SPARK-5360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-5360: -- Description: CoGroupPartition, part of CoGroupedRDD, includes references to each RDD that the CoGroupedRDD narrowly depends on, and a reference to the ShuffleHandle. The partition is serialized separately from the RDD, so when the RDD and partition arrive on the worker, the references in the partition and in the RDD no longer point to the same object. This is a relatively minor performance issue (the closure can be 2x larger than it needs to be because the rdds and partitions are serialized twice; see numbers below) but is more annoying as a developer issue (this is where I ran into): if any state is stored in the RDD or ShuffleHandle on the worker side, subtle bugs can appear due to the fact that the references to the RDD / ShuffleHandle in the RDD and in the partition point to separate objects. I'm not sure if this is enough of a potential future problem to fix this old and central part of the code, so hoping to get input from others here. I did some simple experiments to see how much this effects closure size. For this example: $ val a = sc.parallelize(1 to 10).map((_, 1)) $ val b = sc.parallelize(1 to 2).map(x = (x, 2*x)) $ a.cogroup(b).collect() the closure was 1902 bytes with current Spark, and 1129 bytes after my change. The difference comes from eliminating duplicate serialization of the shuffle handle. For this example: $ val sortedA = a.sortByKey() $ val sortedB = b.sortByKey() $ sortedA.cogroup(sortedB).collect() the closure was 3491 bytes with current Spark, and 1333 bytes after my change. Here, the difference comes from eliminating duplicate serialization of the two RDDs for the narrow dependencies. The ShuffleHandle includes the ShuffleDependency, so this difference will get larger if a ShuffleDependency includes a serializer, a key ordering, or an aggregator (all set to None by default). However, the difference is not affected by the size of the function the user specifies, which (based on my understanding) is typically the source of large task closures. was: CoGroupPartition, part of CoGroupedRDD, includes references to each RDD that the CoGroupedRDD narrowly depends on, and a reference to the ShuffleHandle. The partition is serialized separately from the RDD, so when the RDD and partition arrive on the worker, the references in the partition and in the RDD no longer point to the same object. This is a relatively minor performance issue (the closure can be 2x larger than it needs to be because the rdds and partitions are serialized twice; see numbers below) but is more annoying as a developer issue (this is where I ran into): if any state is stored in the RDD or ShuffleHandle on the worker side, subtle bugs can appear due to the fact that the references to the RDD / ShuffleHandle in the RDD and in the partition point to separate objects. I'm not sure if this is enough of a potential future problem to fix this old and central part of the code, so hoping to get input from others here. I did some simple experiments to see how much this effects closure size. For this example: $ val a = sc.parallelize(1 to 10).map((_, 1)) $ val b = sc.parallelize(1 to 2).map(x = (x, 2*x)) $ a.cogroup(b).collect() the closure was 1902 bytes with current Spark, and 1129 bytes after my change. The difference comes from eliminating duplicate serialization of the shuffle handle. For this example: $ val sortedA = a.sortByKey() $ val sortedB = b.sortByKey() $ sortedA.cogroup(sortedB).collect() the closure was 3491 bytes with current Spark, and 1333 bytes after my change. Here, the difference comes from eliminating duplicate serialization of the two RDDs for the narrow dependencies. For CoGroupedRDD, rdds for narrow dependencies and shuffle handles are included twice in serialized task Key: SPARK-5360 URL: https://issues.apache.org/jira/browse/SPARK-5360 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor CoGroupPartition, part of CoGroupedRDD, includes references to each RDD that the CoGroupedRDD narrowly depends on, and a reference to the ShuffleHandle. The partition is serialized separately from the RDD, so when the RDD and partition arrive on the worker, the references in the partition and in the RDD no longer point to the same object. This is a relatively minor performance issue (the closure can be 2x larger than it needs to be because the rdds and partitions are serialized twice; see numbers below) but is more annoying as a developer issue (this is where I ran
[jira] [Closed] (SPARK-5359) ML model import/export
[ https://issues.apache.org/jira/browse/SPARK-5359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-5359. Resolution: Duplicate ML model import/export -- Key: SPARK-5359 URL: https://issues.apache.org/jira/browse/SPARK-5359 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley ML model import/export is a key component of any ML library. This JIRA is for creating a long-term plan to support model import/export. _From the design doc linked below:_ This document sketches machine learning model import/export plans, including goals, an API, and development plans. The design doc proposes: * Support our own Spark-specific format. ** This is needed to (a) support distributed models and (b) get model import/export support into Spark quickly (while avoiding new dependencies). * Also support PMML ** This is needed since it is the only thing approaching an industry standard. [Design doc for model import/export | https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3702) Standardize MLlib classes for learners, models
[ https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3702: - Description: Summary: Create a class hierarchy for learning algorithms and the models those algorithms produce. This is a super-task of several sub-tasks (but JIRA does not allow subtasks of subtasks). See the requires links below for subtasks. Goals: * give intuitive structure to API, both for developers and for generated documentation * support meta-algorithms (e.g., boosting) * support generic functionality (e.g., evaluation) * reduce code duplication across classes [Design doc for class hierarchy | https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs] was: Summary: Create a class hierarchy for learning algorithms and the models those algorithms produce. This is a super-task of several sub-tasks (but JIRA does not allow subtasks of subtasks). See the requires links below for subtasks. Goals: * give intuitive structure to API, both for developers and for generated documentation * support meta-algorithms (e.g., boosting) * support generic functionality (e.g., evaluation) * reduce code duplication across classes [Design doc for class hierarchy | https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/] Standardize MLlib classes for learners, models -- Key: SPARK-3702 URL: https://issues.apache.org/jira/browse/SPARK-3702 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Blocker Summary: Create a class hierarchy for learning algorithms and the models those algorithms produce. This is a super-task of several sub-tasks (but JIRA does not allow subtasks of subtasks). See the requires links below for subtasks. Goals: * give intuitive structure to API, both for developers and for generated documentation * support meta-algorithms (e.g., boosting) * support generic functionality (e.g., evaluation) * reduce code duplication across classes [Design doc for class hierarchy | https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1714) Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler
[ https://issues.apache.org/jira/browse/SPARK-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286407#comment-14286407 ] Ted Yu commented on SPARK-1714: --- {code} if (completedContainer.getExitStatus == -103) { // vmem limit exceeded {code} Should ContainerExitStatus#KILLED_EXCEEDED_VMEM be referenced above ? Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler Key: SPARK-1714 URL: https://issues.apache.org/jira/browse/SPARK-1714 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1714) Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler
[ https://issues.apache.org/jira/browse/SPARK-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286416#comment-14286416 ] Ted Yu commented on SPARK-1714: --- allocatedHostToContainersMap.synchronized is absent for the following operation in runAllocatedContainers(): {code} val containerSet = allocatedHostToContainersMap.getOrElseUpdate(executorHostname, new HashSet[ContainerId]) containerSet += containerId allocatedContainerToHostMap.put(containerId, executorHostname) {code} Is that intentional ? Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler Key: SPARK-1714 URL: https://issues.apache.org/jira/browse/SPARK-1714 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3958) Possible stream-corruption issues in TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3958. Resolution: Fixed Target Version/s: (was: 1.2.1) At this point I'm not aware of people still hitting this set of issues in newer releases, so per discussion with [~joshrosen], I'd like to close this. Please comment on this JIRA if you are having some variant of this issue in a newer version of Spark, and we'll continue to investigate. Possible stream-corruption issues in TorrentBroadcast - Key: SPARK-3958 URL: https://issues.apache.org/jira/browse/SPARK-3958 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker Attachments: spark_ex.logs TorrentBroadcast deserialization sometimes fails with decompression errors, which are most likely caused by stream-corruption exceptions. For example, this can manifest itself as a Snappy PARSING_ERROR when deserializing a broadcasted task: {code} 14/10/14 17:20:55.016 DEBUG BlockManager: Getting local block broadcast_8 14/10/14 17:20:55.016 DEBUG BlockManager: Block broadcast_8 not registered locally 14/10/14 17:20:55.016 INFO TorrentBroadcast: Started reading broadcast variable 8 14/10/14 17:20:55.017 INFO TorrentBroadcast: Reading broadcast variable 8 took 5.3433E-5 s 14/10/14 17:20:55.017 ERROR Executor: Exception in task 2.0 in stage 8.0 (TID 18) java.io.IOException: PARSING_ERROR(2) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method) at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594) at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:125) at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58) at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:216) at org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:170) at sun.reflect.GeneratedMethodAccessor92.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:164) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} SPARK-3630 is an umbrella ticket for investigating all causes of these Kryo and Snappy deserialization errors. This ticket is for a more narrowly-focused exploration of the TorrentBroadcast version of these errors, since the similar errors that we've seen in sort-based shuffle seem to be explained by a different cause (see SPARK-3948). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4105. Resolution: Fixed Target Version/s: (was: 1.2.1) At this point I'm not aware of people still hitting this set of issues in newer releases, so per discussion with [~joshrosen], I'd like to close this. Please comment on this JIRA if you are having some variant of this issue in a newer version of Spark, and we'll continue to investigate. FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle - Key: SPARK-4105 URL: https://issues.apache.org/jira/browse/SPARK-4105 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during shuffle read. Here's a sample stacktrace from an executor: {code} 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 33053) java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58) at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at
[jira] [Updated] (SPARK-5361) add in tuple handling for converting python RDD back to JavaRDD
[ https://issues.apache.org/jira/browse/SPARK-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Winston Chen updated SPARK-5361: Description: Existing `SerDeUtil.pythonToJava` implementation does not count in tuple cases: Pyrolite `python tuple` = `java Object[]`. So with the following data: {noformat} [ (u'2', {u'director': u'David Lean', u'genres': (u'Adventure', u'Biography', u'Drama'), u'title': u'Lawrence of Arabia', u'year': 1962}), (u'7', {u'director': u'Andrew Dominik', u'genres': (u'Biography', u'Crime', u'Drama'), u'title': u'The Assassination of Jesse James by the Coward Robert Ford', u'year': 2007}) ] {noformat} Exceptions happen with the `genres` part: {noformat} 15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7) java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) {noformat} There is already a pull-request for this bug: https://github.com/apache/spark/pull/4146 was: Existing `SerDeUtil.pythonToJava` implementation does not count in tuple cases: Pyrolite `python tuple` = `java Object[]`. So with the following data: ``` [ (u'2', {u'director': u'David Lean', u'genres': (u'Adventure', u'Biography', u'Drama'), u'title': u'Lawrence of Arabia', u'year': 1962}), (u'7', {u'director': u'Andrew Dominik', u'genres': (u'Biography', u'Crime', u'Drama'), u'title': u'The Assassination of Jesse James by the Coward Robert Ford', u'year': 2007}) ] ``` Exceptions happen with the `genres` part: ``` 15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7) java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) ``` This pull request adds in tuple handling both in `SerDeUtil.pythonToJava` and `JavaToWritableConverter.convertToWritable`. add in tuple handling for converting python RDD back to JavaRDD --- Key: SPARK-5361 URL: https://issues.apache.org/jira/browse/SPARK-5361 Project: Spark Issue Type: Bug Components: PySpark Reporter: Winston Chen Existing `SerDeUtil.pythonToJava` implementation does not count in tuple cases: Pyrolite `python tuple` = `java Object[]`. So with the following data: {noformat} [ (u'2', {u'director': u'David Lean', u'genres': (u'Adventure', u'Biography', u'Drama'), u'title': u'Lawrence of Arabia', u'year': 1962}), (u'7', {u'director': u'Andrew Dominik', u'genres': (u'Biography', u'Crime', u'Drama'), u'title': u'The Assassination of Jesse James by the Coward Robert Ford', u'year': 2007}) ] {noformat} Exceptions happen with the `genres` part: {noformat} 15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7) java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) {noformat} There is already a pull-request for this bug: https://github.com/apache/spark/pull/4146 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286465#comment-14286465 ] Victor Tso commented on SPARK-4105: --- What's the fix version? FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle - Key: SPARK-4105 URL: https://issues.apache.org/jira/browse/SPARK-4105 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during shuffle read. Here's a sample stacktrace from an executor: {code} 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 33053) java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58) at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Here's another occurrence of a similar error: {code} java.io.IOException: failed to read chunk org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:348)
[jira] [Closed] (SPARK-944) Give example of writing to HBase from Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das closed SPARK-944. --- Resolution: Not a Problem Give example of writing to HBase from Spark Streaming - Key: SPARK-944 URL: https://issues.apache.org/jira/browse/SPARK-944 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Patrick Wendell Assignee: Tathagata Das Attachments: MetricAggregatorHBase.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5361) python tuple not supported while converting PythonRDD back to JavaRDD
[ https://issues.apache.org/jira/browse/SPARK-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Winston Chen updated SPARK-5361: Summary: python tuple not supported while converting PythonRDD back to JavaRDD (was: add in tuple handling for converting python RDD back to JavaRDD) python tuple not supported while converting PythonRDD back to JavaRDD - Key: SPARK-5361 URL: https://issues.apache.org/jira/browse/SPARK-5361 Project: Spark Issue Type: Bug Components: PySpark Reporter: Winston Chen Existing `SerDeUtil.pythonToJava` implementation does not count in tuple cases: Pyrolite `python tuple` = `java Object[]`. So with the following data: {noformat} [ (u'2', {u'director': u'David Lean', u'genres': (u'Adventure', u'Biography', u'Drama'), u'title': u'Lawrence of Arabia', u'year': 1962}), (u'7', {u'director': u'Andrew Dominik', u'genres': (u'Biography', u'Crime', u'Drama'), u'title': u'The Assassination of Jesse James by the Coward Robert Ford', u'year': 2007}) ] {noformat} Exceptions happen with the `genres` part: {noformat} 15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7) java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) {noformat} There is already a pull-request for this bug: https://github.com/apache/spark/pull/4146 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4520) SparkSQL exception when reading certain columns from a parquet file
[ https://issues.apache.org/jira/browse/SPARK-4520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286578#comment-14286578 ] Apache Spark commented on SPARK-4520: - User 'sadhan' has created a pull request for this issue: https://github.com/apache/spark/pull/4148 SparkSQL exception when reading certain columns from a parquet file --- Key: SPARK-4520 URL: https://issues.apache.org/jira/browse/SPARK-4520 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: sadhan sood Assignee: sadhan sood Priority: Critical Attachments: part-r-0.parquet I am seeing this issue with spark sql throwing an exception when trying to read selective columns from a thrift parquet file and also when caching them. On some further digging, I was able to narrow it down to at-least one particular column type: mapstring, setstring to be causing this issue. To reproduce this I created a test thrift file with a very basic schema and stored some sample data in a parquet file: Test.thrift === {code} typedef binary SomeId enum SomeExclusionCause { WHITELIST = 1, HAS_PURCHASE = 2, } struct SampleThriftObject { 10: string col_a; 20: string col_b; 30: string col_c; 40: optional mapSomeExclusionCause, setSomeId col_d; } {code} = And loading the data in spark through schemaRDD: {code} import org.apache.spark.sql.SchemaRDD val sqlContext = new org.apache.spark.sql.SQLContext(sc); val parquetFile = /path/to/generated/parquet/file val parquetFileRDD = sqlContext.parquetFile(parquetFile) parquetFileRDD.printSchema root |-- col_a: string (nullable = true) |-- col_b: string (nullable = true) |-- col_c: string (nullable = true) |-- col_d: map (nullable = true) ||-- key: string ||-- value: array (valueContainsNull = true) |||-- element: string (containsNull = false) parquetFileRDD.registerTempTable(test) sqlContext.cacheTable(test) sqlContext.sql(select col_a from test).collect() -- see the exception stack here {code} {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/tmp/xyz/part-r-0.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780) at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780) at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.elementData(ArrayList.java:418) at
[jira] [Commented] (SPARK-5347) InputMetrics bug when inputSplit is not instanceOf FileSplit
[ https://issues.apache.org/jira/browse/SPARK-5347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286688#comment-14286688 ] Apache Spark commented on SPARK-5347: - User 'shenh062326' has created a pull request for this issue: https://github.com/apache/spark/pull/4150 InputMetrics bug when inputSplit is not instanceOf FileSplit Key: SPARK-5347 URL: https://issues.apache.org/jira/browse/SPARK-5347 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen When inputFormatClass is set to CombineFileInputFormat, input metrics show that input is empty. It don't appear is spark-1.1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5063) Display more helpful error messages for several invalid operations
[ https://issues.apache.org/jira/browse/SPARK-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5063: -- Description: Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow: - https://stackoverflow.com/questions/13770218/call-of-distinct-and-map-together-throws-npe-in-spark-library/14130534#14130534 - https://stackoverflow.com/questions/23793117/nullpointerexception-in-scala-spark-appears-to-be-caused-be-collection-type/23793399#23793399 - https://stackoverflow.com/questions/25997558/graphx-ive-got-nullpointerexception-inside-mapvertices/26003674#26003674 (those are just a sample of the ones that I've answered personally; there are many others). I think we can detect these errors by adding logic to {{RDD}} to check whether {{sc}} is null (e.g. turn {{sc}} into a getter function); we can use this to add a better error message. In PySpark, these errors manifest themselves slightly differently. Attempting to nest RDDs or perform actions inside of transformations results in pickle-time errors: {code} rdd1 = sc.parallelize(range(100)) rdd2 = sc.parallelize(range(100)) rdd1.mapPartitions(lambda x: [rdd2.map(lambda x: x)]) {code} produces {code} [...] File /Users/joshrosen/anaconda/lib/python2.7/pickle.py, line 306, in save rv = reduce(self.proto) File /Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 304, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o21.__getnewargs__. Trace: py4j.Py4JException: Method __getnewargs__([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) at py4j.Gateway.invoke(Gateway.java:252) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {code} We get the same error when attempting to broadcast an RDD in PySpark. For Python, improved error reporting could be as simple as overriding the {{getnewargs}} method to throw a more useful UnsupportedOperation exception with a more helpful error message. Users may also see confusing NPEs when calling methods on stopped SparkContexts, so I've added checks for that as well. was: Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow: - https://stackoverflow.com/questions/13770218/call-of-distinct-and-map-together-throws-npe-in-spark-library/14130534#14130534 - https://stackoverflow.com/questions/23793117/nullpointerexception-in-scala-spark-appears-to-be-caused-be-collection-type/23793399#23793399 - https://stackoverflow.com/questions/25997558/graphx-ive-got-nullpointerexception-inside-mapvertices/26003674#26003674 (those are just a sample of the ones that I've answered personally; there are many others). I think we can detect these errors by adding logic to {{RDD}} to check whether {{sc}} is null (e.g. turn {{sc}} into a getter function); we can use this to add a better error message. In PySpark, these errors manifest themselves slightly differently. Attempting to nest RDDs or perform actions inside of transformations results in pickle-time errors: {code} rdd1 = sc.parallelize(range(100)) rdd2 = sc.parallelize(range(100)) rdd1.mapPartitions(lambda x: [rdd2.map(lambda x: x)]) {code} produces {code} [...] File /Users/joshrosen/anaconda/lib/python2.7/pickle.py, line 306, in save rv = reduce(self.proto) File /Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 304, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o21.__getnewargs__. Trace: py4j.Py4JException: Method __getnewargs__([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) at py4j.Gateway.invoke(Gateway.java:252) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at
[jira] [Commented] (SPARK-4506) Update documentation to clarify whether standalone-cluster mode is now officially supported
[ https://issues.apache.org/jira/browse/SPARK-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286466#comment-14286466 ] Asim Jalis commented on SPARK-4506: --- This bug should be reopened. The doc needs some more changes. Doc: https://github.com/apache/spark/blob/master/docs/submitting-applications.md Current text: Note that cluster mode is currently not supported for standalone clusters, Mesos clusters, or Python applications. Proposed text: Note that cluster mode is currently not supported for Mesos clusters, or Python applications. Update documentation to clarify whether standalone-cluster mode is now officially supported --- Key: SPARK-4506 URL: https://issues.apache.org/jira/browse/SPARK-4506 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0, 1.1.1, 1.2.0 Reporter: Josh Rosen Assignee: Andrew Or Fix For: 1.1.1, 1.2.0 The Launching Compiled Spark Applications section of the Spark Standalone docs claims that standalone mode only supports {{client}} deploy mode: {quote} The spark-submit script provides the most straightforward way to submit a compiled Spark application to the cluster. For standalone clusters, Spark currently only supports deploying the driver inside the client process that is submitting the application (client deploy mode). {quote} It looks like {{standalone-cluster}} mode actually works (I've used it and have heard from users that are successfully using it, too). The current line was added in SPARK-2259 when {{standalone-cluster}} mode wasn't officially supported. It looks like SPARK-2260 fixed a number of bugs in {{standalone-cluster}} mode, so we should update the documentation if we're now ready to officially support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4984) add a pop-up containing the full for job description when it is very long
[ https://issues.apache.org/jira/browse/SPARK-4984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4984. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3819 [https://github.com/apache/spark/pull/3819] add a pop-up containing the full for job description when it is very long - Key: SPARK-4984 URL: https://issues.apache.org/jira/browse/SPARK-4984 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: wangfei Fix For: 1.3.0 add a pop-up containing the full for job description when it is very long -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4984) add a pop-up containing the full for job description when it is very long
[ https://issues.apache.org/jira/browse/SPARK-4984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4984: -- Assignee: wangfei add a pop-up containing the full for job description when it is very long - Key: SPARK-4984 URL: https://issues.apache.org/jira/browse/SPARK-4984 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: wangfei Assignee: wangfei Fix For: 1.3.0 add a pop-up containing the full for job description when it is very long -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4984) add a pop-up containing the full for job description when it is very long
[ https://issues.apache.org/jira/browse/SPARK-4984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4984: -- Component/s: (was: Spark Core) Web UI add a pop-up containing the full for job description when it is very long - Key: SPARK-4984 URL: https://issues.apache.org/jira/browse/SPARK-4984 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: wangfei Assignee: wangfei Fix For: 1.3.0 add a pop-up containing the full for job description when it is very long -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5227) InputOutputMetricsSuite input metrics when reading text file with multiple splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 profiles
[ https://issues.apache.org/jira/browse/SPARK-5227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5227: -- Priority: Blocker (was: Major) InputOutputMetricsSuite input metrics when reading text file with multiple splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 profiles - Key: SPARK-5227 URL: https://issues.apache.org/jira/browse/SPARK-5227 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Reporter: Josh Rosen Priority: Blocker Labels: flaky-test The InputOutputMetricsSuite input metrics when reading text file with multiple splits test has been failing consistently in our new {{branch-1.2}} Jenkins SBT build: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.2-SBT/14/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/testReport/junit/org.apache.spark.metrics/InputOutputMetricsSuite/input_metrics_when_reading_text_file_with_multiple_splits/ Here's the error message {code} ArrayBuffer(32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
[jira] [Commented] (SPARK-5227) InputOutputMetricsSuite input metrics when reading text file with multiple splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 profiles
[ https://issues.apache.org/jira/browse/SPARK-5227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286681#comment-14286681 ] Josh Rosen commented on SPARK-5227: --- I've bumped this up to a 1.2.1 blocker to see if we can find a fix, since this is preventing the unit tests from running in SBT under certain Hadoop configurations. InputOutputMetricsSuite input metrics when reading text file with multiple splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 profiles - Key: SPARK-5227 URL: https://issues.apache.org/jira/browse/SPARK-5227 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Reporter: Josh Rosen Priority: Blocker Labels: flaky-test The InputOutputMetricsSuite input metrics when reading text file with multiple splits test has been failing consistently in our new {{branch-1.2}} Jenkins SBT build: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.2-SBT/14/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/testReport/junit/org.apache.spark.metrics/InputOutputMetricsSuite/input_metrics_when_reading_text_file_with_multiple_splits/ Here's the error message {code} ArrayBuffer(32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
[jira] [Created] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches
Alexander Ulanov created SPARK-5362: --- Summary: Gradient and Optimizer to support generic output (instead of label) and data batches Key: SPARK-5362 URL: https://issues.apache.org/jira/browse/SPARK-5362 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Alexander Ulanov Fix For: 1.3.0 Currently, Gradient and Optimizer interfaces support data in form of RDD[Double, Vector] which refers to label and features. This limits its application to classification problems. For example, artificial neural network demands Vector as output (instead of label: Double). Moreover, current interface does not support data batches. I propose to replace label: Double with output: Vector. It enables passing generic output instead of label and also passing data and output batches stored in corresponding vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5361) add in tuple handling for converting python RDD back to JavaRDD
Winston Chen created SPARK-5361: --- Summary: add in tuple handling for converting python RDD back to JavaRDD Key: SPARK-5361 URL: https://issues.apache.org/jira/browse/SPARK-5361 Project: Spark Issue Type: Bug Components: PySpark Reporter: Winston Chen Existing `SerDeUtil.pythonToJava` implementation does not count in tuple cases: Pyrolite `python tuple` = `java Object[]`. So with the following data: ``` [ (u'2', {u'director': u'David Lean', u'genres': (u'Adventure', u'Biography', u'Drama'), u'title': u'Lawrence of Arabia', u'year': 1962}), (u'7', {u'director': u'Andrew Dominik', u'genres': (u'Biography', u'Crime', u'Drama'), u'title': u'The Assassination of Jesse James by the Coward Robert Ford', u'year': 2007}) ] ``` Exceptions happen with the `genres` part: ``` 15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7) java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) ``` This pull request adds in tuple handling both in `SerDeUtil.pythonToJava` and `JavaToWritableConverter.convertToWritable`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5342) Allow long running Spark apps to run on secure YARN/HDFS
[ https://issues.apache.org/jira/browse/SPARK-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286543#comment-14286543 ] Hari Shreedharan commented on SPARK-5342: - Looks like SPARK-3883 is adding SSL support to even Akka - in which case we could simply use Akka for sending the new delegation tokens (and avoid the HTTP Server route). Allow long running Spark apps to run on secure YARN/HDFS Key: SPARK-5342 URL: https://issues.apache.org/jira/browse/SPARK-5342 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Hari Shreedharan Attachments: SparkYARN.pdf Currently, Spark apps cannot write to HDFS after the delegation tokens reach their expiry, which maxes out at 7 days. We must find a way to ensure that we can run applications for longer - for example, spark streaming apps are expected to run forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5147) write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called
[ https://issues.apache.org/jira/browse/SPARK-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286596#comment-14286596 ] Apache Spark commented on SPARK-5147: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/4149 write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called -- Key: SPARK-5147 URL: https://issues.apache.org/jira/browse/SPARK-5147 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.2.0 Reporter: Max Xu Priority: Blocker Hi all, We are running a Spark streaming application with ReliableKafkaReceiver. We have spark.streaming.receiver.writeAheadLog.enable set to true so write ahead logs (WALs) for received data are created under receivedData/streamId folder in the checkpoint directory. However, old WALs are never purged by time. receivedBlockMetadata and checkpoint files are purged correctly though. I went through the code, WriteAheadLogBasedBlockHandler class in ReceivedBlockHandler.scala is responsible for cleaning up the old blocks. It has method cleanupOldBlocks, which is never called by any class. ReceiverSupervisorImpl class holds a WriteAheadLogBasedBlockHandler instance. However, it only calls storeBlock method to create WALs but never calls cleanupOldBlocks method to purge old WALs. The size of the WAL folder increases constantly on HDFS. This is preventing us from running the ReliableKafkaReceiver 24x7. Can somebody please take a look. Thanks, Max -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5355) SparkConf is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-5355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-5355. --- Resolution: Fixed Fix Version/s: 1.2.1 1.3.0 Issue resolved by pull request 4143 [https://github.com/apache/spark/pull/4143] SparkConf is not thread-safe Key: SPARK-5355 URL: https://issues.apache.org/jira/browse/SPARK-5355 Project: Spark Issue Type: Bug Affects Versions: 1.2.0, 1.3.0 Reporter: Davies Liu Priority: Blocker Fix For: 1.3.0, 1.2.1 The SparkConf is not thread-safe, but is accessed by many threads. The getAll() could return parts of the configs if another thread is access it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5355) SparkConf is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-5355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5355: -- Assignee: Davies Liu SparkConf is not thread-safe Key: SPARK-5355 URL: https://issues.apache.org/jira/browse/SPARK-5355 Project: Spark Issue Type: Bug Affects Versions: 1.2.0, 1.3.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker Fix For: 1.3.0, 1.2.1 The SparkConf is not thread-safe, but is accessed by many threads. The getAll() could return parts of the configs if another thread is access it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches
[ https://issues.apache.org/jira/browse/SPARK-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286701#comment-14286701 ] Apache Spark commented on SPARK-5362: - User 'avulanov' has created a pull request for this issue: https://github.com/apache/spark/pull/4152 Gradient and Optimizer to support generic output (instead of label) and data batches Key: SPARK-5362 URL: https://issues.apache.org/jira/browse/SPARK-5362 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Alexander Ulanov Fix For: 1.3.0 Original Estimate: 24h Remaining Estimate: 24h Currently, Gradient and Optimizer interfaces support data in form of RDD[Double, Vector] which refers to label and features. This limits its application to classification problems. For example, artificial neural network demands Vector as output (instead of label: Double). Moreover, current interface does not support data batches. I propose to replace label: Double with output: Vector. It enables passing generic output instead of label and also passing data and output batches stored in corresponding vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches
[ https://issues.apache.org/jira/browse/SPARK-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286703#comment-14286703 ] Alexander Ulanov commented on SPARK-5362: - https://github.com/apache/spark/pull/4152 Gradient and Optimizer to support generic output (instead of label) and data batches Key: SPARK-5362 URL: https://issues.apache.org/jira/browse/SPARK-5362 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Alexander Ulanov Fix For: 1.3.0 Original Estimate: 24h Remaining Estimate: 24h Currently, Gradient and Optimizer interfaces support data in form of RDD[Double, Vector] which refers to label and features. This limits its application to classification problems. For example, artificial neural network demands Vector as output (instead of label: Double). Moreover, current interface does not support data batches. I propose to replace label: Double with output: Vector. It enables passing generic output instead of label and also passing data and output batches stored in corresponding vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5361) python tuple not supported while converting PythonRDD back to JavaRDD
[ https://issues.apache.org/jira/browse/SPARK-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286459#comment-14286459 ] Apache Spark commented on SPARK-5361: - User 'wingchen' has created a pull request for this issue: https://github.com/apache/spark/pull/4146 python tuple not supported while converting PythonRDD back to JavaRDD - Key: SPARK-5361 URL: https://issues.apache.org/jira/browse/SPARK-5361 Project: Spark Issue Type: Bug Components: PySpark Reporter: Winston Chen Existing `SerDeUtil.pythonToJava` implementation does not count in tuple cases: Pyrolite `python tuple` = `java Object[]`. So with the following data: {noformat} [ (u'2', {u'director': u'David Lean', u'genres': (u'Adventure', u'Biography', u'Drama'), u'title': u'Lawrence of Arabia', u'year': 1962}), (u'7', {u'director': u'Andrew Dominik', u'genres': (u'Biography', u'Crime', u'Drama'), u'title': u'The Assassination of Jesse James by the Coward Robert Ford', u'year': 2007}) ] {noformat} Exceptions happen with the `genres` part: {noformat} 15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7) java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) {noformat} There is already a pull-request for this bug: https://github.com/apache/spark/pull/4146 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4631) Add real unit test for MQTT
[ https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-4631. -- Resolution: Fixed Fix Version/s: 1.2.1 1.3.0 Add real unit test for MQTT Key: SPARK-4631 URL: https://issues.apache.org/jira/browse/SPARK-4631 Project: Spark Issue Type: Test Components: Streaming Reporter: Tathagata Das Priority: Critical Fix For: 1.3.0, 1.2.1 A real unit test that actually transfers data to ensure that the MQTTUtil is functional -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5063) Display more helpful error messages for several invalid operations
[ https://issues.apache.org/jira/browse/SPARK-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5063: -- Target Version/s: 1.2.1 Display more helpful error messages for several invalid operations -- Key: SPARK-5063 URL: https://issues.apache.org/jira/browse/SPARK-5063 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow: - https://stackoverflow.com/questions/13770218/call-of-distinct-and-map-together-throws-npe-in-spark-library/14130534#14130534 - https://stackoverflow.com/questions/23793117/nullpointerexception-in-scala-spark-appears-to-be-caused-be-collection-type/23793399#23793399 - https://stackoverflow.com/questions/25997558/graphx-ive-got-nullpointerexception-inside-mapvertices/26003674#26003674 (those are just a sample of the ones that I've answered personally; there are many others). I think we can detect these errors by adding logic to {{RDD}} to check whether {{sc}} is null (e.g. turn {{sc}} into a getter function); we can use this to add a better error message. In PySpark, these errors manifest themselves slightly differently. Attempting to nest RDDs or perform actions inside of transformations results in pickle-time errors: {code} rdd1 = sc.parallelize(range(100)) rdd2 = sc.parallelize(range(100)) rdd1.mapPartitions(lambda x: [rdd2.map(lambda x: x)]) {code} produces {code} [...] File /Users/joshrosen/anaconda/lib/python2.7/pickle.py, line 306, in save rv = reduce(self.proto) File /Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 304, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o21.__getnewargs__. Trace: py4j.Py4JException: Method __getnewargs__([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) at py4j.Gateway.invoke(Gateway.java:252) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {code} We get the same error when attempting to broadcast an RDD in PySpark. For Python, improved error reporting could be as simple as overriding the {{getnewargs}} method to throw a more useful UnsupportedOperation exception with a more helpful error message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4939) Python updateStateByKey example hang in local mode
[ https://issues.apache.org/jira/browse/SPARK-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286447#comment-14286447 ] Davies Liu commented on SPARK-4939: --- Sent out a PR to change the local scheduler to revive offers periodically, it should be safer to be merged into 1.2. Python updateStateByKey example hang in local mode -- Key: SPARK-4939 URL: https://issues.apache.org/jira/browse/SPARK-4939 Project: Spark Issue Type: Bug Components: PySpark, Spark Core, Streaming Affects Versions: 1.2.0, 1.3.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4939) Python updateStateByKey example hang in local mode
[ https://issues.apache.org/jira/browse/SPARK-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286445#comment-14286445 ] Apache Spark commented on SPARK-4939: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4147 Python updateStateByKey example hang in local mode -- Key: SPARK-4939 URL: https://issues.apache.org/jira/browse/SPARK-4939 Project: Spark Issue Type: Bug Components: PySpark, Spark Core, Streaming Affects Versions: 1.2.0, 1.3.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-944) Give example of writing to HBase from Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286639#comment-14286639 ] Tathagata Das commented on SPARK-944: - I am closing this JIRA because this is not relevant any more. For examples, any reader take a look at https://github.com/cloudera-labs/SparkOnHBase/blob/cdh5-0.0.1/src/main/java/com/cloudera/spark/hbase/example/JavaHBaseStreamingBulkPutExample.java Give example of writing to HBase from Spark Streaming - Key: SPARK-944 URL: https://issues.apache.org/jira/browse/SPARK-944 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Patrick Wendell Assignee: Tathagata Das Attachments: MetricAggregatorHBase.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4586) Python API for ML Pipeline
[ https://issues.apache.org/jira/browse/SPARK-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286693#comment-14286693 ] Apache Spark commented on SPARK-4586: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4151 Python API for ML Pipeline -- Key: SPARK-4586 URL: https://issues.apache.org/jira/browse/SPARK-4586 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Add Python API to the newly added ML pipeline and parameters. The initial design doc is posted here: https://docs.google.com/document/d/1vL-4f5Xm-7t-kwVSaBylP_ZPrktPZjaOb2dWONtZU2s/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5063) Display more helpful error messages for several invalid operations
[ https://issues.apache.org/jira/browse/SPARK-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5063: -- Summary: Display more helpful error messages for several invalid operations (was: Raise more helpful errors when RDD actions or transformations are called inside of transformations) Display more helpful error messages for several invalid operations -- Key: SPARK-5063 URL: https://issues.apache.org/jira/browse/SPARK-5063 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow: - https://stackoverflow.com/questions/13770218/call-of-distinct-and-map-together-throws-npe-in-spark-library/14130534#14130534 - https://stackoverflow.com/questions/23793117/nullpointerexception-in-scala-spark-appears-to-be-caused-be-collection-type/23793399#23793399 - https://stackoverflow.com/questions/25997558/graphx-ive-got-nullpointerexception-inside-mapvertices/26003674#26003674 (those are just a sample of the ones that I've answered personally; there are many others). I think we can detect these errors by adding logic to {{RDD}} to check whether {{sc}} is null (e.g. turn {{sc}} into a getter function); we can use this to add a better error message. In PySpark, these errors manifest themselves slightly differently. Attempting to nest RDDs or perform actions inside of transformations results in pickle-time errors: {code} rdd1 = sc.parallelize(range(100)) rdd2 = sc.parallelize(range(100)) rdd1.mapPartitions(lambda x: [rdd2.map(lambda x: x)]) {code} produces {code} [...] File /Users/joshrosen/anaconda/lib/python2.7/pickle.py, line 306, in save rv = reduce(self.proto) File /Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 304, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o21.__getnewargs__. Trace: py4j.Py4JException: Method __getnewargs__([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) at py4j.Gateway.invoke(Gateway.java:252) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {code} We get the same error when attempting to broadcast an RDD in PySpark. For Python, improved error reporting could be as simple as overriding the {{getnewargs}} method to throw a more useful UnsupportedOperation exception with a more helpful error message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286706#comment-14286706 ] Alexander Ulanov commented on SPARK-5256: - I've implemented my proposition with Vector as output in https://issues.apache.org/jira/browse/SPARK-5362 Improving MLlib optimization APIs - Key: SPARK-5256 URL: https://issues.apache.org/jira/browse/SPARK-5256 Project: Spark Issue Type: Umbrella Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley *Goal*: Improve APIs for optimization *Motivation*: There have been several disjoint mentions of improving the optimization APIs to make them more pluggable, extensible, etc. This JIRA is a place to discuss what API changes are necessary for the long term, and to provide links to other relevant JIRAs. Eventually, I hope this leads to a design doc outlining: * current issues * requirements such as supporting many types of objective functions, optimization algorithms, and parameters to those algorithms * ideal API * breakdown of smaller JIRAs needed to achieve that API I will soon create an initial design doc, and I will try to watch this JIRA and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2546) Configuration object thread safety issue
[ https://issues.apache.org/jira/browse/SPARK-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286971#comment-14286971 ] Tsuyoshi OZAWA commented on SPARK-2546: --- Now HADOOP-11209, the problem reported by [~joshrosen], is resolved by [~varun_saxena]'s contribution. Thanks for your reporting. Configuration object thread safety issue Key: SPARK-2546 URL: https://issues.apache.org/jira/browse/SPARK-2546 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.1, 1.0.2, 1.1.0, 1.2.0 Reporter: Andrew Ash Assignee: Josh Rosen Priority: Critical Fix For: 1.1.1, 1.2.0, 1.0.3 // observed in 0.9.1 but expected to exist in 1.0.1 as well This ticket is copy-pasted from a thread on the dev@ list: {quote} We discovered a very interesting bug in Spark at work last week in Spark 0.9.1 — that the way Spark uses the Hadoop Configuration object is prone to thread safety issues. I believe it still applies in Spark 1.0.1 as well. Let me explain: Observations - Was running a relatively simple job (read from Avro files, do a map, do another map, write back to Avro files) - 412 of 413 tasks completed, but the last task was hung in RUNNING state - The 412 successful tasks completed in median time 3.4s - The last hung task didn't finish even in 20 hours - The executor with the hung task was responsible for 100% of one core of CPU usage - Jstack of the executor attached (relevant thread pasted below) Diagnosis After doing some code spelunking, we determined the issue was concurrent use of a Configuration object for each task on an executor. In Hadoop each task runs in its own JVM, but in Spark multiple tasks can run in the same JVM, so the single-threaded access assumptions of the Configuration object no longer hold in Spark. The specific issue is that the AvroRecordReader actually _modifies_ the JobConf it's given when it's instantiated! It adds a key for the RPC protocol engine in the process of connecting to the Hadoop FileSystem. When many tasks start at the same time (like at the start of a job), many tasks are adding this configuration item to the one Configuration object at once. Internally Configuration uses a java.lang.HashMap, which isn't threadsafe… The below post is an excellent explanation of what happens in the situation where multiple threads insert into a HashMap at the same time. http://mailinator.blogspot.com/2009/06/beautiful-race-condition.html The gist is that you have a thread following a cycle of linked list nodes indefinitely. This exactly matches our observations of the 100% CPU core and also the final location in the stack trace. So it seems the way Spark shares a Configuration object between task threads in an executor is incorrect. We need some way to prevent concurrent access to a single Configuration object. Proposed fix We can clone the JobConf object in HadoopRDD.getJobConf() so each task gets its own JobConf object (and thus Configuration object). The optimization of broadcasting the Configuration object across the cluster can remain, but on the other side I think it needs to be cloned for each task to allow for concurrent access. I'm not sure the performance implications, but the comments suggest that the Configuration object is ~10KB so I would expect a clone on the object to be relatively speedy. Has this been observed before? Does my suggested fix make sense? I'd be happy to file a Jira ticket and continue discussion there for the right way to fix. Thanks! Andrew P.S. For others seeing this issue, our temporary workaround is to enable spark.speculation, which retries failed (or hung) tasks on other machines. {noformat} Executor task launch worker-6 daemon prio=10 tid=0x7f91f01fe000 nid=0x54b1 runnable [0x7f92d74f1000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.transfer(HashMap.java:601) at java.util.HashMap.resize(HashMap.java:581) at java.util.HashMap.addEntry(HashMap.java:879) at java.util.HashMap.put(HashMap.java:505) at org.apache.hadoop.conf.Configuration.set(Configuration.java:803) at org.apache.hadoop.conf.Configuration.set(Configuration.java:783) at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1662) at org.apache.hadoop.ipc.RPC.setProtocolEngine(RPC.java:193) at org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:343) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:168) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:129) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:436) at
[jira] [Resolved] (SPARK-3424) KMeans Plus Plus is too slow
[ https://issues.apache.org/jira/browse/SPARK-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-3424. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4144 [https://github.com/apache/spark/pull/4144] KMeans Plus Plus is too slow Key: SPARK-3424 URL: https://issues.apache.org/jira/browse/SPARK-3424 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.2 Reporter: Derrick Burns Assignee: Derrick Burns Fix For: 1.3.0 The KMeansPlusPlus algorithm is implemented in time O( m k^2), where m is the rounds of the KMeansParallel algorithm and k is the number of clusters. This can be dramatically improved by maintaining the distance the closest cluster center from round to round and then incrementally updating that value for each point. This incremental update is O(1) time, this reduces the running time for K Means Plus Plus to O( m k ). For large k, this is significant. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5297) File Streams do not work with custom key/values
[ https://issues.apache.org/jira/browse/SPARK-5297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286750#comment-14286750 ] Apache Spark commented on SPARK-5297: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/4154 File Streams do not work with custom key/values --- Key: SPARK-5297 URL: https://issues.apache.org/jira/browse/SPARK-5297 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0 Reporter: Leonidas Fegaras Assignee: Saisai Shao Labels: backport-needed Fix For: 1.3.0 The following code: {code} stream_context.K,V,SequenceFileInputFormatK,VfileStream(directory) .foreachRDD(new FunctionJavaPairRDDK,V,Void() { public Void call ( JavaPairRDDK,V rdd ) throws Exception { for ( Tuple2K,V x: rdd.collect() ) System.out.println(# +x._1+ +x._2); return null; } }); stream_context.start(); stream_context.awaitTermination(); {code} for custom (serializable) classes K and V compiles fine but gives an error when I drop a new hadoop sequence file in the directory: {quote} 15/01/17 09:13:59 ERROR scheduler.JobScheduler: Error generating jobs for time 1421507639000 ms java.lang.ClassCastException: java.lang.Object cannot be cast to org.apache.hadoop.mapreduce.InputFormat at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:91) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$3.apply(FileInputDStream.scala:236) at org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$3.apply(FileInputDStream.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:234) at org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:128) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:296) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:288) at scala.Option.orElse(Option.scala:257) {quote} The same classes K and V work fine for non-streaming Spark: {code} spark_context.newAPIHadoopFile(path,F.class,K.class,SequenceFileInputFormat.class,conf) {code} also streaming works fine for TextFileInputFormat. The issue is that class manifests are erased to object in the Java file stream constructor, but those are relied on downstream when creating the Hadoop RDD that backs each batch of the file stream. https://github.com/apache/spark/blob/v1.2.0/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala#L263 https://github.com/apache/spark/blob/v1.2.0/core/src/main/scala/org/apache/spark/SparkContext.scala#L753 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4786) Parquet filter pushdown for BYTE and SHORT types
[ https://issues.apache.org/jira/browse/SPARK-4786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287004#comment-14287004 ] Yash Datta commented on SPARK-4786: --- https://github.com/apache/spark/pull/4156 Parquet filter pushdown for BYTE and SHORT types Key: SPARK-4786 URL: https://issues.apache.org/jira/browse/SPARK-4786 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Lian Among all integral types, currently only INT and LONG predicates can be converted to Parquet filter predicate. BYTE and SHORT predicates can be covered by INT. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4786) Parquet filter pushdown for BYTE and SHORT types
[ https://issues.apache.org/jira/browse/SPARK-4786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287003#comment-14287003 ] Apache Spark commented on SPARK-4786: - User 'saucam' has created a pull request for this issue: https://github.com/apache/spark/pull/4156 Parquet filter pushdown for BYTE and SHORT types Key: SPARK-4786 URL: https://issues.apache.org/jira/browse/SPARK-4786 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Lian Among all integral types, currently only INT and LONG predicates can be converted to Parquet filter predicate. BYTE and SHORT predicates can be covered by INT. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4786) Parquet filter pushdown for BYTE and SHORT types
[ https://issues.apache.org/jira/browse/SPARK-4786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yash Datta updated SPARK-4786: -- Comment: was deleted (was: https://github.com/apache/spark/pull/4156) Parquet filter pushdown for BYTE and SHORT types Key: SPARK-4786 URL: https://issues.apache.org/jira/browse/SPARK-4786 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Lian Among all integral types, currently only INT and LONG predicates can be converted to Parquet filter predicate. BYTE and SHORT predicates can be covered by INT. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5342) Allow long running Spark apps to run on secure YARN/HDFS
[ https://issues.apache.org/jira/browse/SPARK-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287018#comment-14287018 ] Hari Shreedharan commented on SPARK-5342: - I am considering just copying the keytab and principal to staging directory for the application, instead of using the distributed cache. Any suggestions on which is better? Allow long running Spark apps to run on secure YARN/HDFS Key: SPARK-5342 URL: https://issues.apache.org/jira/browse/SPARK-5342 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Hari Shreedharan Attachments: SparkYARN.pdf Currently, Spark apps cannot write to HDFS after the delegation tokens reach their expiry, which maxes out at 7 days. We must find a way to ensure that we can run applications for longer - for example, spark streaming apps are expected to run forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5363) Spark 1.2 freeze without error notification
Tassilo Klein created SPARK-5363: Summary: Spark 1.2 freeze without error notification Key: SPARK-5363 URL: https://issues.apache.org/jira/browse/SPARK-5363 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Tassilo Klein Priority: Critical Fix For: 1.2.1, 1.2.2 After a number of calls to a map().collect() statement Spark freezes without reporting any error. Within the map a large broadcast variable is used. The freezing can be avoided by setting 'spark.python.worker.reuse = false' (Spark 1.2) or using an earlier version, however, at the prize of low speed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5347) InputMetrics bug when inputSplit is not instanceOf FileSplit
[ https://issues.apache.org/jira/browse/SPARK-5347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286738#comment-14286738 ] Hong Shen commented on SPARK-5347: -- In addition, we will use some other inputFormat and inputSplit that not instance of FileSplit,for example,CombineFileSplit: {code} public class CombineFileSplit implements InputSplit {code} To this case, we can just get the bytesRead when close it. InputMetrics bug when inputSplit is not instanceOf FileSplit Key: SPARK-5347 URL: https://issues.apache.org/jira/browse/SPARK-5347 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen When inputFormatClass is set to CombineFileInputFormat, input metrics show that input is empty. It don't appear is spark-1.1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Victor Tso updated SPARK-4105: -- Comment: was deleted (was: What's the fix version?) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle - Key: SPARK-4105 URL: https://issues.apache.org/jira/browse/SPARK-4105 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during shuffle read. Here's a sample stacktrace from an executor: {code} 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 33053) java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58) at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Here's another occurrence of a similar error: {code} java.io.IOException: failed to read chunk org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:348)
[jira] [Updated] (SPARK-5351) Can't zip RDDs with unequal numbers of partitions in ReplicatedVertexView.upgrade()
[ https://issues.apache.org/jira/browse/SPARK-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-5351: Description: If the value of 'spark.default.parallelism' does not match the number of partitoins in EdgePartition(EdgeRDDImpl), the following error occurs in ReplicatedVertexView.scala:72; object GraphTest extends Logging { def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = { graph.aggregateMessages[Int]( ctx = { ctx.sendToSrc(1) ctx.sendToDst(2) }, _ + _) } } val g = GraphLoader.edgeListFile(sc, graph.txt) val rdd = GraphTest.run(g) java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.ShuffleDependency.init(Dependency.scala:82) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:193) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191) ... was: If the value of 'spark.default.parallelism' does not match the number of partitoins in EdgePartition(EdgeRDDImpl), the following error occurs in ReplicatedVertexView.scala:72; object GraphTest extends Logging { def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = { graph.aggregateMessages[Int]( ctx = { ctx.sendToSrc(1) ctx.sendToDst(2) }, _ + _) } } val g = GraphLoader.edgeListFile(sc, graph.txt) val rdd = GraphTest.run(g) java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.ShuffleDependency.init(Dependency.scala:82) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:193) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191) ... Can't zip RDDs with unequal numbers of partitions in ReplicatedVertexView.upgrade() --- Key: SPARK-5351 URL: https://issues.apache.org/jira/browse/SPARK-5351 Project: Spark Issue Type: Bug Components: GraphX Reporter: Takeshi Yamamuro If the value of 'spark.default.parallelism' does not match the number of partitoins in EdgePartition(EdgeRDDImpl), the following error occurs in ReplicatedVertexView.scala:72; object GraphTest extends Logging { def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = { graph.aggregateMessages[Int]( ctx = { ctx.sendToSrc(1) ctx.sendToDst(2) }, _ + _) } } val g = GraphLoader.edgeListFile(sc, graph.txt) val rdd = GraphTest.run(g) java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at
[jira] [Commented] (SPARK-5357) Upgrade from commons-codec 1.5
[ https://issues.apache.org/jira/browse/SPARK-5357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286728#comment-14286728 ] Apache Spark commented on SPARK-5357: - User 'MattWhelan' has created a pull request for this issue: https://github.com/apache/spark/pull/4153 Upgrade from commons-codec 1.5 -- Key: SPARK-5357 URL: https://issues.apache.org/jira/browse/SPARK-5357 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Matthew Whelan Original Estimate: 24h Remaining Estimate: 24h Spark uses commons-codec 1.5, which has a race condition in Base64. That race was introduced in commons-codec 1.4 and resolved in 1.7. The current version of commons-codec is 1.10. Code that runs in Workers and assumes that Base64 is thread-safe will break because spark is using a non-thread-safe version. See CODEC-96 In addition, the spark.files.userClassPathFirst mechanism is currently broken, (bug to come), so there isn't a viable work around for this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org