[jira] [Updated] (SPARK-5180) Data source API improvement
[ https://issues.apache.org/jira/browse/SPARK-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5180: Target Version/s: 1.4.0 (was: 1.3.0) Data source API improvement --- Key: SPARK-5180 URL: https://issues.apache.org/jira/browse/SPARK-5180 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4768) Add Support For Impala Encoded Timestamp (INT96)
[ https://issues.apache.org/jira/browse/SPARK-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4768: Priority: Blocker (was: Critical) Add Support For Impala Encoded Timestamp (INT96) Key: SPARK-4768 URL: https://issues.apache.org/jira/browse/SPARK-4768 Project: Spark Issue Type: Improvement Components: SQL Reporter: Pat McDonough Priority: Blocker Attachments: 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq, string_timestamp.gz Impala is using INT96 for timestamps. Spark SQL should be able to read this data despite the fact that it is not part of the spec. Perhaps adding a flag to act like impala when reading parquet (like we do for strings already) would be useful. Here's an example of the error you might see: {code} Caused by: java.lang.RuntimeException: Potential loss of precision: cannot convert INT96 at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:61) at org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:113) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:314) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:311) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToAttributes(ParquetTypes.scala:310) at org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:441) at org.apache.spark.sql.parquet.ParquetRelation.init(ParquetRelation.scala:66) at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:141) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4768) Add Support For Impala Encoded Timestamp (INT96)
[ https://issues.apache.org/jira/browse/SPARK-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4768: Assignee: Yin Huai Add Support For Impala Encoded Timestamp (INT96) Key: SPARK-4768 URL: https://issues.apache.org/jira/browse/SPARK-4768 Project: Spark Issue Type: Improvement Components: SQL Reporter: Pat McDonough Assignee: Yin Huai Priority: Blocker Attachments: 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq, string_timestamp.gz Impala is using INT96 for timestamps. Spark SQL should be able to read this data despite the fact that it is not part of the spec. Perhaps adding a flag to act like impala when reading parquet (like we do for strings already) would be useful. Here's an example of the error you might see: {code} Caused by: java.lang.RuntimeException: Potential loss of precision: cannot convert INT96 at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:61) at org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:113) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:314) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:311) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToAttributes(ParquetTypes.scala:310) at org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:441) at org.apache.spark.sql.parquet.ParquetRelation.init(ParquetRelation.scala:66) at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:141) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3851) Support for reading parquet files with different but compatible schema
[ https://issues.apache.org/jira/browse/SPARK-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3851: Priority: Blocker (was: Critical) Support for reading parquet files with different but compatible schema -- Key: SPARK-3851 URL: https://issues.apache.org/jira/browse/SPARK-3851 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Right now it is required that all of the parquet files have the same schema. It would be nice to support some safe subset of cases where the schemas of files is different. For example: - Adding and removing nullable columns. - Widening types (a column that is of both Int and Long type) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5425) ConcurrentModificationException during SparkConf creation
[ https://issues.apache.org/jira/browse/SPARK-5425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5425: -- Target Version/s: 1.2.2 I've merged [~jlewandowski]'s patch (https://github.com/apache/spark/pull/4222) to fix this in `master` (1.3.0) and `branch-1.1` (1.1.2), and I've added the {{backport-needed}} tag so we remember to merge it into 1.2.2. ConcurrentModificationException during SparkConf creation - Key: SPARK-5425 URL: https://issues.apache.org/jira/browse/SPARK-5425 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0 Reporter: Jacek Lewandowski Assignee: Jacek Lewandowski Fix For: 1.3.0, 1.1.2 This fragment of code: {code} if (loadDefaults) { // Load any spark.* system properties for ((k, v) - System.getProperties.asScala if k.startsWith(spark.)) { settings(k) = v } } {code} causes {noformat} ERROR 09:43:15 SparkMaster service caused error in state STARTINGjava.util.ConcurrentModificationException: null at java.util.Hashtable$Enumerator.next(Hashtable.java:1167) ~[na:1.7.0_60] at scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:458) ~[scala-library-2.10.4.jar:na] at scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:454) ~[scala-library-2.10.4.jar:na] at scala.collection.Iterator$class.foreach(Iterator.scala:727) ~[scala-library-2.10.4.jar:na] at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) ~[scala-library-2.10.4.jar:na] at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) ~[scala-library-2.10.4.jar:na] at scala.collection.AbstractIterable.foreach(Iterable.scala:54) ~[scala-library-2.10.4.jar:na] at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) ~[scala-library-2.10.4.jar:na] at org.apache.spark.SparkConf.init(SparkConf.scala:53) ~[spark-core_2.10-1.2.1_dse-20150121.075638-2.jar:1.2.1_dse-SNAPSHOT] at org.apache.spark.SparkConf.init(SparkConf.scala:47) ~[spark-core_2.10-1.2.1_dse-20150121.075638-2.jar:1.2.1_dse-SNAPSHOT] {noformat} when there is another thread which modifies system properties at the same time. This bug https://issues.scala-lang.org/browse/SI-7775 is somehow related to the issue and shows that the problem has been also found elsewhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5534) EdgeRDD, VertexRDD getStorageLevel return bad values
[ https://issues.apache.org/jira/browse/SPARK-5534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302043#comment-14302043 ] Joseph K. Bradley commented on SPARK-5534: -- Note: This is needed for [https://github.com/apache/spark/pull/4047], which is a PR for this JIRA: [https://issues.apache.org/jira/browse/SPARK-1405] EdgeRDD, VertexRDD getStorageLevel return bad values Key: SPARK-5534 URL: https://issues.apache.org/jira/browse/SPARK-5534 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.3.0 Reporter: Joseph K. Bradley After caching a graph, its edge and vertex RDDs still return StorageLevel.None. Reproduce error: {code} import org.apache.spark.graphx.{Edge, Graph} val edges = Seq( Edge[Double](0, 1, 0), Edge[Double](1, 2, 0), Edge[Double](2, 3, 0), Edge[Double](3, 4, 0)) val g = Graph.fromEdges[Double,Double](sc.parallelize(edges), 0) g.vertices.getStorageLevel // returns value for StorageLevel.None g.edges.getStorageLevel // returns value for StorageLevel.None g.cache() g.vertices.count() g.edges.count() g.vertices.getStorageLevel // returns value for StorageLevel.None g.edges.getStorageLevel // returns value for StorageLevel.None {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5534) EdgeRDD, VertexRDD getStorageLevel return bad values
[ https://issues.apache.org/jira/browse/SPARK-5534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-5534: Assignee: Joseph K. Bradley EdgeRDD, VertexRDD getStorageLevel return bad values Key: SPARK-5534 URL: https://issues.apache.org/jira/browse/SPARK-5534 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley After caching a graph, its edge and vertex RDDs still return StorageLevel.None. Reproduce error: {code} import org.apache.spark.graphx.{Edge, Graph} val edges = Seq( Edge[Double](0, 1, 0), Edge[Double](1, 2, 0), Edge[Double](2, 3, 0), Edge[Double](3, 4, 0)) val g = Graph.fromEdges[Double,Double](sc.parallelize(edges), 0) g.vertices.getStorageLevel // returns value for StorageLevel.None g.edges.getStorageLevel // returns value for StorageLevel.None g.cache() g.vertices.count() g.edges.count() g.vertices.getStorageLevel // returns value for StorageLevel.None g.edges.getStorageLevel // returns value for StorageLevel.None {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5505) ConsumerRebalanceFailedException from Kafka consumer
[ https://issues.apache.org/jira/browse/SPARK-5505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302062#comment-14302062 ] Tathagata Das commented on SPARK-5505: -- Since this is a problem with the HighLevel consumer, solving this requires completely rearchitecting the KafkaReceiver. This is hard to do. Possible workarounds: http://mail-archives.apache.org/mod_mbox/kafka-users/201312.mbox/%3CCAFbh0Q38qQ0aAg_cj=jzk-kbi8xwf+1m6xlj+fzf6eetj9z...@mail.gmail.com%3E ConsumerRebalanceFailedException from Kafka consumer Key: SPARK-5505 URL: https://issues.apache.org/jira/browse/SPARK-5505 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0 Environment: CentOS6 / Linux 2.6.32-358.2.1.el6.x86_64 java version 1.7.0_21 Scala compiler version 2.9.3 2 cores Intel(R) Xeon(R) CPU E5620 @ 2.40GHz / 16G RAM VMWare VM. Reporter: Greg Temchenko Priority: Critical From time to time Spark streaming produces a ConsumerRebalanceFailedException and stops receiving messages. After that all consequential RDDs are empty. {code} 15/01/30 18:18:36 ERROR consumer.ZookeeperConsumerConnector: [terran_vmname-1422670149779-243b4e10], error during syncedRebalance kafka.common.ConsumerRebalanceFailedException: terran_vmname-1422670149779-243b4e10 can't rebalance after 4 retries at kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:432) at kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:355) {code} The problem is also described in the mailing list: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-Spark-streaming-consumes-from-Kafka-td19570.html As I understand it's a critical blocker for kafka-spark streaming production use. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5514) collect should call executeCollect
[ https://issues.apache.org/jira/browse/SPARK-5514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5514: Assignee: Reynold Xin collect should call executeCollect -- Key: SPARK-5514 URL: https://issues.apache.org/jira/browse/SPARK-5514 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.0 Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5491) Chi-square feature selection
[ https://issues.apache.org/jira/browse/SPARK-5491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5491. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 1484 [https://github.com/apache/spark/pull/1484] Chi-square feature selection Key: SPARK-5491 URL: https://issues.apache.org/jira/browse/SPARK-5491 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Alexander Ulanov Fix For: 1.3.0 Implement chi-square feature selection. PR: https://github.com/apache/spark/pull/1484 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5437) DriverSuite and SparkSubmitSuite incorrect timeout behavior
[ https://issues.apache.org/jira/browse/SPARK-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5437. Resolution: Fixed Fix Version/s: 1.3.0 Target Version/s: 1.3.0 DriverSuite and SparkSubmitSuite incorrect timeout behavior --- Key: SPARK-5437 URL: https://issues.apache.org/jira/browse/SPARK-5437 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Or Assignee: Andrew Or Fix For: 1.3.0 In DriverSuite, we currently set a timeout of 60 seconds. If after this time the process has not terminated, we leak the process because we never destroy it. In SparkSubmitSuite, we currently do not have a timeout so the test can hang indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5388: - Summary: Provide a stable application submission gateway in standalone cluster mode (was: Provide a stable application submission gateway) Provide a stable application submission gateway in standalone cluster mode -- Key: SPARK-5388 URL: https://issues.apache.org/jira/browse/SPARK-5388 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Attachments: Stable Spark Standalone Submission.pdf The existing submission gateway in standalone mode is not compatible across Spark versions. If you have a newer version of Spark submitting to an older version of the standalone Master, it is currently not guaranteed to work. The goal is to provide a stable REST interface to replace this channel. The first cut implementation will target standalone cluster mode because there are very few messages exchanged. The design, however, will be general enough to eventually support this for other cluster managers too. Note that this is not necessarily required in YARN because we already use YARN's stable interface to submit applications there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5388: - Description: The existing submission gateway in standalone mode is not compatible across Spark versions. If you have a newer version of Spark submitting to an older version of the standalone Master, it is currently not guaranteed to work. The goal is to provide a stable REST interface to replace this channel. The first cut implementation will target standalone cluster mode because there are very few messages exchanged. The design, however, should be general enough to potentially support this for other cluster managers too. Note that this is not necessarily required in YARN because we already use YARN's stable interface to submit applications there. was: The existing submission gateway in standalone mode is not compatible across Spark versions. If you have a newer version of Spark submitting to an older version of the standalone Master, it is currently not guaranteed to work. The goal is to provide a stable REST interface to replace this channel. The first cut implementation will target standalone cluster mode because there are very few messages exchanged. The design, however, will be general enough to eventually support this for other cluster managers too. Note that this is not necessarily required in YARN because we already use YARN's stable interface to submit applications there. Provide a stable application submission gateway in standalone cluster mode -- Key: SPARK-5388 URL: https://issues.apache.org/jira/browse/SPARK-5388 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Attachments: Stable Spark Standalone Submission.pdf The existing submission gateway in standalone mode is not compatible across Spark versions. If you have a newer version of Spark submitting to an older version of the standalone Master, it is currently not guaranteed to work. The goal is to provide a stable REST interface to replace this channel. The first cut implementation will target standalone cluster mode because there are very few messages exchanged. The design, however, should be general enough to potentially support this for other cluster managers too. Note that this is not necessarily required in YARN because we already use YARN's stable interface to submit applications there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301792#comment-14301792 ] Xiangrui Meng commented on SPARK-5226: -- [~alitouka] Thanks for implementing DBSCAN on top of Spark! I'd like to recommend you registering it as a package on http://spark-packages.org, so it is more visible to the community. Add DBSCAN Clustering Algorithm to MLlib Key: SPARK-5226 URL: https://issues.apache.org/jira/browse/SPARK-5226 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Muhammad-Ali A'rabi Priority: Minor Labels: DBSCAN MLlib is all k-means now, and I think we should add some new clustering algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4523) Improve handling of serialized schema information
[ https://issues.apache.org/jira/browse/SPARK-4523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4523: Priority: Critical (was: Blocker) Improve handling of serialized schema information - Key: SPARK-4523 URL: https://issues.apache.org/jira/browse/SPARK-4523 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Critical There are several issues with our current handling of metadata serialization, which is especially troublesome since this is the only place that we persist information directly using Spark SQL. Moving forward we should do the following: - Relax the parsing so that it does not fail when optional fields are missing (i.e. containsNull or metadata) - Include a regression suite that attempts to read old parquet files written by previous versions of Spark SQL. - Provide better warning messages when various forms of parsing fail (I think that it is silent right now which makes tracking down bugs more difficult than it needs to be). - Deprecate (display a warning) when reading data with the old case class schema representation and eventually remove it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3851) Support for reading parquet files with different but compatible schema
[ https://issues.apache.org/jira/browse/SPARK-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3851: Assignee: Cheng Lian Support for reading parquet files with different but compatible schema -- Key: SPARK-3851 URL: https://issues.apache.org/jira/browse/SPARK-3851 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Critical Right now it is required that all of the parquet files have the same schema. It would be nice to support some safe subset of cases where the schemas of files is different. For example: - Adding and removing nullable columns. - Widening types (a column that is of both Int and Long type) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3575) Hive Schema is ignored when using convertMetastoreParquet
[ https://issues.apache.org/jira/browse/SPARK-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3575: Priority: Blocker (was: Critical) Hive Schema is ignored when using convertMetastoreParquet - Key: SPARK-3575 URL: https://issues.apache.org/jira/browse/SPARK-3575 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker This can cause problems when for example one of the columns is defined as TINYINT. A class cast exception will be thrown since the parquet table scan produces INTs while the rest of the execution is expecting bytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302027#comment-14302027 ] Apache Spark commented on SPARK-3039: - User 'medale' has created a pull request for this issue: https://github.com/apache/spark/pull/4315 Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API -- Key: SPARK-3039 URL: https://issues.apache.org/jira/browse/SPARK-3039 Project: Spark Issue Type: Bug Components: Build, Input/Output, Spark Core Affects Versions: 0.9.1, 1.0.0, 1.1.0 Environment: hadoop2, hadoop-2.4.0, HDP-2.1 Reporter: Bertrand Bossy Assignee: Bertrand Bossy Fix For: 1.2.0 The spark assembly contains the artifact org.apache.avro:avro-mapred as a dependency of org.spark-project.hive:hive-serde. The avro-mapred package provides a hadoop FileInputFormat to read and write avro files. There are two versions of this package, distinguished by a classifier. avro-mapred for the new Hadoop API uses the classifier hadoop2. avro-mapred for the old Hadoop API uses no classifier. E.g. when reading avro files using {code} sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]](hdfs://path/to/file.avro) {code} The following error occurs: {code} java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} This error usually is a hint that there was a mix up of the old and the new Hadoop API. As a work-around, if avro-mapred for hadoop2 is forced to appear before the version that is bundled with Spark, reading avro files works fine. Also, if Spark is built using avro-mapred for hadoop2, it works fine as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4497) HiveThriftServer2 does not exit properly on failure
[ https://issues.apache.org/jira/browse/SPARK-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4497: Target Version/s: 1.4.0 (was: 1.3.0) HiveThriftServer2 does not exit properly on failure --- Key: SPARK-4497 URL: https://issues.apache.org/jira/browse/SPARK-4497 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Priority: Critical start thriftserver with sbin/start-thriftserver.sh --master ... If there is an error (in my case namenode is in standby mode) the driver shuts down properly: 14/11/19 16:32:58 ERROR HiveThriftServer2: Error starting HiveThriftServer2 14/11/19 16:32:59 INFO SparkUI: Stopped Spark web UI at http://myip:4040 14/11/19 16:32:59 INFO DAGScheduler: Stopping DAGScheduler 14/11/19 16:32:59 INFO SparkDeploySchedulerBackend: Shutting down all executors 14/11/19 16:32:59 INFO SparkDeploySchedulerBackend: Asking each executor to shut down 14/11/19 16:33:00 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 14/11/19 16:33:00 INFO MemoryStore: MemoryStore cleared 14/11/19 16:33:00 INFO BlockManager: BlockManager stopped 14/11/19 16:33:00 INFO BlockManagerMaster: BlockManagerMaster stopped 14/11/19 16:33:00 INFO SparkContext: Successfully stopped SparkContext but trying to run sbin/start-thriftserver.sh --master ... again results in an error that Thrifserver is already running. ps -aef|grep offendingPID shows root 32334 1 0 16:32 ?00:00:00 /usr/local/bin/java org.apache.spark.deploy.SparkSubmitDriverBootstrapper --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --master spark://myip:7077 --conf -spark.executor.extraJavaOptions=-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps spark-internal --hiveconf hive.root.logger=INFO,console This is problematic since we have a process that tries to restart the driver if it dies -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5530) ApplicationMaster can't kill executor when using dynamicAllocation
[ https://issues.apache.org/jira/browse/SPARK-5530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5530: - Affects Version/s: 1.3.0 ApplicationMaster can't kill executor when using dynamicAllocation -- Key: SPARK-5530 URL: https://issues.apache.org/jira/browse/SPARK-5530 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: meiyoula Assignee: meiyoula Priority: Critical Fix For: 1.3.0 Yarn allocator logs Attempted to kill unknown executor 3, and executor can't be killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5514) collect should call executeCollect
[ https://issues.apache.org/jira/browse/SPARK-5514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301953#comment-14301953 ] Apache Spark commented on SPARK-5514: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4313 collect should call executeCollect -- Key: SPARK-5514 URL: https://issues.apache.org/jira/browse/SPARK-5514 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.0 Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5501) Write support for the data source API
[ https://issues.apache.org/jira/browse/SPARK-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5501: Assignee: Yin Huai Write support for the data source API - Key: SPARK-5501 URL: https://issues.apache.org/jira/browse/SPARK-5501 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5463) Fix Parquet filter push-down
[ https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5463: Assignee: Cheng Lian Fix Parquet filter push-down Key: SPARK-5463 URL: https://issues.apache.org/jira/browse/SPARK-5463 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.2.2 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5532) Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5532: - Description: Deterministic failure: {code} import org.apache.spark.mllib.linalg._ import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) import sqlContext._ val data = sc.parallelize(Seq((1.0, Vectors.dense(1,2,3.toDataFrame(label, features) data.repartition(1).saveAsParquetFile(blah) {code} If you remove the repartition, then this succeeds. Here's the stack trace: {code} 15/02/02 12:10:53 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 4, 192.168.1.230): java.lang.ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot be cast to org.apache.spark.sql.Row at org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186) at org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129) at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/02/02 12:10:54 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; aborting job org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 7, 192.168.1.230): java.lang.ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot be cast to org.apache.spark.sql.Row at org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186) at org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129) at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at
[jira] [Updated] (SPARK-5463) Fix Parquet filter push-down
[ https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5463: Priority: Blocker (was: Critical) Fix Parquet filter push-down Key: SPARK-5463 URL: https://issues.apache.org/jira/browse/SPARK-5463 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.2.2 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5184) Improve the performance of metadata operations
[ https://issues.apache.org/jira/browse/SPARK-5184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5184. - Resolution: Won't Fix Improve the performance of metadata operations -- Key: SPARK-5184 URL: https://issues.apache.org/jira/browse/SPARK-5184 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301780#comment-14301780 ] Markus Dale edited comment on SPARK-3039 at 2/2/15 8:40 PM: For me, Spark 1.2.0 either downloading spark-1.2.0-bin-hadoop2.4.tgz or compiling the source with {code} mvn -Pyarn -Phadoop-2.4 -Phive-0.13.1 -DskipTests clean package {code} still had the same problem: {noformat} java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:87) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:135) {noformat} Starting the build with a clean .m2/repository, the repository afterwards contained: * avro-mapred/1.7.5 (with the default jar - i.e. hadoop1) * avro-mapred/1.7.6 with the avro-mapred-1.7.6-hadoop2.jar (the one we want). Seemed that sharding these two dependencies into the spark-assembly-jar resulted in the error above at least in the downloaded hadoop2.4 spark bin and my own build. Running the following (after doing a mvn install and by-hand copy of all the spark artifacts into my local repo for spark-repl/yarn): {code} mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests dependency:tree -Dincludes=org.apache.avro:avro-mapred {code} Showed that the culprit was in the Hive project, namely org.spark-project.hive:hive-exec's dependency on 1.7.5. {noformat} Building Spark Project Hive 1.2.0 [INFO] [INFO] [INFO] --- maven-dependency-plugin:2.4:tree (default-cli) @ spark-hive_2.10 --- [INFO] org.apache.spark:spark-hive_2.10:jar:1.2.0 [INFO] +- org.spark-project.hive:hive-exec:jar:0.13.1a:compile [INFO] | \- org.apache.avro:avro-mapred:jar:1.7.5:compile [INFO] \- org.apache.avro:avro-mapred:jar:hadoop2:1.7.6:compile [INFO] {noformat} Editing spark-1.2.0/sql/hive/pom.xml and excluding avro-mapred from hive-exec, then recompile, fixed the problem and the resulting dist works well against Avro/Hadoop2 code: {code:xml} dependency groupIdorg.spark-project.hive/groupId artifactIdhive-exec/artifactId version${hive.version}/version exclusions exclusion groupIdcommons-logging/groupId artifactIdcommons-logging/artifactId /exclusion exclusion groupIdcom.esotericsoftware.kryo/groupId artifactIdkryo/artifactId /exclusion exclusion groupIdorg.apache.avro/groupId artifactIdavro-mapred/artifactId /exclusion /exclusions /dependency {code} Just the last exclusion added. Will try to do a pull-request if that's not already addressed in the latest code. was (Author: medale): For me, Spark 1.2.0 either downloading spark-1.2.0-bin-hadoop2.4.tgz or compiling the source with {code} mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests clean package {code} still had the same problem: {noformat} java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:87) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:135) {noformat} Starting the build with a clean .m2/repository, the repository afterwards contained: * avro-mapred/1.7.5 (with the default jar - i.e. hadoop1) * avro-mapred/1.7.6 with the avro-mapred-1.7.6-hadoop2.jar (the one we want). Seemed that sharding these two dependencies into the spark-assembly-jar resulted in the error above at least in the downloaded hadoop2.4 spark bin and my own build. Running the following (after doing a mvn install and by-hand copy of all the spark artifacts into my local repo for spark-repl/yarn): {code} mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests dependency:tree -Dincludes=org.apache.avro:avro-mapred {code} Showed that the culprit was in the Hive project, namely org.spark-project.hive:hive-exec's dependency on 1.7.5. {noformat} Building Spark Project Hive 1.2.0 [INFO] [INFO] [INFO] --- maven-dependency-plugin:2.4:tree (default-cli) @ spark-hive_2.10 --- [INFO] org.apache.spark:spark-hive_2.10:jar:1.2.0 [INFO] +- org.spark-project.hive:hive-exec:jar:0.13.1a:compile [INFO] | \- org.apache.avro:avro-mapred:jar:1.7.5:compile [INFO] \- org.apache.avro:avro-mapred:jar:hadoop2:1.7.6:compile [INFO] {noformat} Editing spark-1.2.0/sql/hive/pom.xml and excluding avro-mapred from hive-exec, then recompile, fixed the problem and the resulting dist works well against Avro/Hadoop2 code: {code:xml} dependency groupIdorg.spark-project.hive/groupId artifactIdhive-exec/artifactId
[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301780#comment-14301780 ] Markus Dale commented on SPARK-3039: For me, Spark 1.2.0 either downloading spark-1.2.0-bin-hadoop2.4.tgz or compiling the source with {code} mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests clean package {code} still had the same problem: {noformat} java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:87) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:135) {noformat} Starting the build with a clean .m2/repository, the repository afterwards contained: * avro-mapred/1.7.5 (with the default jar - i.e. hadoop1) * avro-mapred/1.7.6 with the avro-mapred-1.7.6-hadoop2.jar (the one we want). Seemed that sharding these two dependencies into the spark-assembly-jar resulted in the error above at least in the downloaded hadoop2.4 spark bin and my own build. Running the following (after doing a mvn install and by-hand copy of all the spark artifacts into my local repo for spark-repl/yarn): {code} mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests dependency:tree -Dincludes=org.apache.avro:avro-mapred {code} Showed that the culprit was in the Hive project, namely org.spark-project.hive:hive-exec's dependency on 1.7.5. {noformat} Building Spark Project Hive 1.2.0 [INFO] [INFO] [INFO] --- maven-dependency-plugin:2.4:tree (default-cli) @ spark-hive_2.10 --- [INFO] org.apache.spark:spark-hive_2.10:jar:1.2.0 [INFO] +- org.spark-project.hive:hive-exec:jar:0.13.1a:compile [INFO] | \- org.apache.avro:avro-mapred:jar:1.7.5:compile [INFO] \- org.apache.avro:avro-mapred:jar:hadoop2:1.7.6:compile [INFO] {noformat} Editing spark-1.2.0/sql/hive/pom.xml and excluding avro-mapred from hive-exec, then recompile, fixed the problem and the resulting dist works well against Avro/Hadoop2 code: {code:xml} dependency groupIdorg.spark-project.hive/groupId artifactIdhive-exec/artifactId version${hive.version}/version exclusions exclusion groupIdcommons-logging/groupId artifactIdcommons-logging/artifactId /exclusion exclusion groupIdcom.esotericsoftware.kryo/groupId artifactIdkryo/artifactId /exclusion exclusion groupIdorg.apache.avro/groupId artifactIdavro-mapred/artifactId /exclusion /exclusions /dependency {code} Just the last exclusion added. Will try to do a pull-request if that's not already addressed in the latest code. Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API -- Key: SPARK-3039 URL: https://issues.apache.org/jira/browse/SPARK-3039 Project: Spark Issue Type: Bug Components: Build, Input/Output, Spark Core Affects Versions: 0.9.1, 1.0.0, 1.1.0 Environment: hadoop2, hadoop-2.4.0, HDP-2.1 Reporter: Bertrand Bossy Assignee: Bertrand Bossy Fix For: 1.2.0 The spark assembly contains the artifact org.apache.avro:avro-mapred as a dependency of org.spark-project.hive:hive-serde. The avro-mapred package provides a hadoop FileInputFormat to read and write avro files. There are two versions of this package, distinguished by a classifier. avro-mapred for the new Hadoop API uses the classifier hadoop2. avro-mapred for the old Hadoop API uses no classifier. E.g. when reading avro files using {code} sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]](hdfs://path/to/file.avro) {code} The following error occurs: {code} java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at
[jira] [Assigned] (SPARK-5518) Error messages for plans with invalid AttributeReferences
[ https://issues.apache.org/jira/browse/SPARK-5518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-5518: --- Assignee: Michael Armbrust Error messages for plans with invalid AttributeReferences - Key: SPARK-5518 URL: https://issues.apache.org/jira/browse/SPARK-5518 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker It is now possible for users to put invalid attribute references into query plans. We should check for this case at the end of analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3267) Deadlock between ScalaReflectionLock and Data type initialization
[ https://issues.apache.org/jira/browse/SPARK-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3267: Target Version/s: 1.4.0 (was: 1.3.0) Deadlock between ScalaReflectionLock and Data type initialization - Key: SPARK-3267 URL: https://issues.apache.org/jira/browse/SPARK-3267 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Aaron Davidson Priority: Critical Deadlock here: {code} Executor task launch worker-0 daemon prio=10 tid=0x7fab50036000 nid=0x27a in Object.wait() [0x7fab60c2e000 ] java.lang.Thread.State: RUNNABLE at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.defaultPrimitive(CodeGenerator.scala:565) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:202) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 93) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$Evaluate2$2.evaluateAs(CodeGenerator.scal a:175) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:304) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 93) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:314) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 93) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:313) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) ... {code} and {code} Executor task launch worker-2 daemon prio=10 tid=0x7fab100f0800 nid=0x27e in Object.wait() [0x7fab0eeec000 ] java.lang.Thread.State: RUNNABLE at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:250) - locked 0x00064e5d9a48 (a org.apache.spark.sql.catalyst.expressions.Cast) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations. scala:139) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations. scala:139) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:139) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:126) at org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:197) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at
[jira] [Updated] (SPARK-5258) Clean up exposed classes in sql.hive package
[ https://issues.apache.org/jira/browse/SPARK-5258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5258: Priority: Blocker (was: Major) Clean up exposed classes in sql.hive package Key: SPARK-5258 URL: https://issues.apache.org/jira/browse/SPARK-5258 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5534) EdgeRDD, VertexRDD getStorageLevel return bad values
Joseph K. Bradley created SPARK-5534: Summary: EdgeRDD, VertexRDD getStorageLevel return bad values Key: SPARK-5534 URL: https://issues.apache.org/jira/browse/SPARK-5534 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.3.0 Reporter: Joseph K. Bradley After caching a graph, its edge and vertex RDDs still return StorageLevel.None. Reproduce error: {code} import org.apache.spark.graphx.{Edge, Graph} val edges = Seq( Edge[Double](0, 1, 0), Edge[Double](1, 2, 0), Edge[Double](2, 3, 0), Edge[Double](3, 4, 0)) val g = Graph.fromEdges[Double,Double](sc.parallelize(edges), 0) g.vertices.getStorageLevel // returns value for StorageLevel.None g.edges.getStorageLevel // returns value for StorageLevel.None g.cache() g.vertices.count() g.edges.count() g.vertices.getStorageLevel // returns value for StorageLevel.None g.edges.getStorageLevel // returns value for StorageLevel.None {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5531) Spark download .tgz file does not get unpacked
DeepakVohra created SPARK-5531: -- Summary: Spark download .tgz file does not get unpacked Key: SPARK-5531 URL: https://issues.apache.org/jira/browse/SPARK-5531 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: Linux Reporter: DeepakVohra The spark-1.2.0-bin-cdh4.tgz file downloaded from http://spark.apache.org/downloads.html does not get unpacked. tar xvf spark-1.2.0-bin-cdh4.tgz gzip: stdin: not in gzip format tar: Child returned status 1 tar: Error is not recoverable: exiting now -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4811) Custom UDTFs not working in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4811: Target Version/s: 1.4.0 (was: 1.3.0) Custom UDTFs not working in Spark SQL - Key: SPARK-4811 URL: https://issues.apache.org/jira/browse/SPARK-4811 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.1.1 Reporter: Saurabh Santhosh Priority: Critical I am using the Thrift srever interface to Spark SQL and using beeline to connect to it. I tried Spark SQL versions 1.1.0 and 1.1.1 and both are throwing the following exception when using any custom UDTF. These are the steps i did : *Created a UDTF 'com.x.y.xxx'.* Registered the UDTF using following query : *create temporary function xxx as 'com.x.y.xxx'* The registration went through without any errors. But when i tried executing the UDTF i got the following error. *java.lang.ClassNotFoundException: xxx* Funny thing is that Its trying to load the function name instead of the funtion class. The exception is at *line no: 81 in hiveudfs.scala* I have been at it for quite a long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5532) Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT
Joseph K. Bradley created SPARK-5532: Summary: Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT Key: SPARK-5532 URL: https://issues.apache.org/jira/browse/SPARK-5532 Project: Spark Issue Type: Bug Components: MLlib, SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Deterministic failure: {code} import org.apache.spark.mllib.linalg._ import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) import sqlContext._ val data = sc.parallelize(Seq((1.0, Vectors.dense(1,2,3.toDataFrame(label, features) data.repartition(1).saveAsParquetFile(blah) {code} If you remove the repartition, then this succeeds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4553) query for parquet table with string fields in spark sql hive get binary result
[ https://issues.apache.org/jira/browse/SPARK-4553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4553: Assignee: Cheng Lian query for parquet table with string fields in spark sql hive get binary result -- Key: SPARK-4553 URL: https://issues.apache.org/jira/browse/SPARK-4553 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Assignee: Cheng Lian run create table test_parquet(key int, value string) stored as parquet; insert into table test_parquet select * from src; select * from test_parquet; get result as follow ... 282 [B@38fda3b 138 [B@1407a24 238 [B@12de6fb 419 [B@6c97695 15 [B@4885067 118 [B@156a8d3 72 [B@65d20dd 90 [B@4c18906 307 [B@60b24cc 19 [B@59cf51b 435 [B@39fdf37 10 [B@4f799d7 277 [B@3950951 273 [B@596bf4b 306 [B@3e91557 224 [B@3781d61 309 [B@2d0d128 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4553) query for parquet table with string fields in spark sql hive get binary result
[ https://issues.apache.org/jira/browse/SPARK-4553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4553: Priority: Blocker (was: Major) query for parquet table with string fields in spark sql hive get binary result -- Key: SPARK-4553 URL: https://issues.apache.org/jira/browse/SPARK-4553 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Assignee: Cheng Lian Priority: Blocker run create table test_parquet(key int, value string) stored as parquet; insert into table test_parquet select * from src; select * from test_parquet; get result as follow ... 282 [B@38fda3b 138 [B@1407a24 238 [B@12de6fb 419 [B@6c97695 15 [B@4885067 118 [B@156a8d3 72 [B@65d20dd 90 [B@4c18906 307 [B@60b24cc 19 [B@59cf51b 435 [B@39fdf37 10 [B@4f799d7 277 [B@3950951 273 [B@596bf4b 306 [B@3e91557 224 [B@3781d61 309 [B@2d0d128 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4585) Spark dynamic executor allocation shouldn't use maxExecutors as initial number
[ https://issues.apache.org/jira/browse/SPARK-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4585. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Sandy Ryza Target Version/s: 1.3.0 Spark dynamic executor allocation shouldn't use maxExecutors as initial number -- Key: SPARK-4585 URL: https://issues.apache.org/jira/browse/SPARK-4585 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.1.0 Reporter: Chengxiang Li Assignee: Sandy Ryza Fix For: 1.3.0 With SPARK-3174, one can configure a minimum and maximum number of executors for a Spark application on Yarn. However, the application always starts with the maximum. It seems more reasonable, at least for Hive on Spark, to start from the minimum and scale up as needed up to the maximum. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5425) ConcurrentModificationException during SparkConf creation
[ https://issues.apache.org/jira/browse/SPARK-5425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5425: -- Assignee: Jacek Lewandowski ConcurrentModificationException during SparkConf creation - Key: SPARK-5425 URL: https://issues.apache.org/jira/browse/SPARK-5425 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0 Reporter: Jacek Lewandowski Assignee: Jacek Lewandowski Fix For: 1.3.0, 1.1.2 This fragment of code: {code} if (loadDefaults) { // Load any spark.* system properties for ((k, v) - System.getProperties.asScala if k.startsWith(spark.)) { settings(k) = v } } {code} causes {noformat} ERROR 09:43:15 SparkMaster service caused error in state STARTINGjava.util.ConcurrentModificationException: null at java.util.Hashtable$Enumerator.next(Hashtable.java:1167) ~[na:1.7.0_60] at scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:458) ~[scala-library-2.10.4.jar:na] at scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:454) ~[scala-library-2.10.4.jar:na] at scala.collection.Iterator$class.foreach(Iterator.scala:727) ~[scala-library-2.10.4.jar:na] at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) ~[scala-library-2.10.4.jar:na] at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) ~[scala-library-2.10.4.jar:na] at scala.collection.AbstractIterable.foreach(Iterable.scala:54) ~[scala-library-2.10.4.jar:na] at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) ~[scala-library-2.10.4.jar:na] at org.apache.spark.SparkConf.init(SparkConf.scala:53) ~[spark-core_2.10-1.2.1_dse-20150121.075638-2.jar:1.2.1_dse-SNAPSHOT] at org.apache.spark.SparkConf.init(SparkConf.scala:47) ~[spark-core_2.10-1.2.1_dse-20150121.075638-2.jar:1.2.1_dse-SNAPSHOT] {noformat} when there is another thread which modifies system properties at the same time. This bug https://issues.scala-lang.org/browse/SI-7775 is somehow related to the issue and shows that the problem has been also found elsewhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5425) ConcurrentModificationException during SparkConf creation
[ https://issues.apache.org/jira/browse/SPARK-5425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5425: -- Fix Version/s: 1.1.2 1.3.0 ConcurrentModificationException during SparkConf creation - Key: SPARK-5425 URL: https://issues.apache.org/jira/browse/SPARK-5425 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0 Reporter: Jacek Lewandowski Assignee: Jacek Lewandowski Fix For: 1.3.0, 1.1.2 This fragment of code: {code} if (loadDefaults) { // Load any spark.* system properties for ((k, v) - System.getProperties.asScala if k.startsWith(spark.)) { settings(k) = v } } {code} causes {noformat} ERROR 09:43:15 SparkMaster service caused error in state STARTINGjava.util.ConcurrentModificationException: null at java.util.Hashtable$Enumerator.next(Hashtable.java:1167) ~[na:1.7.0_60] at scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:458) ~[scala-library-2.10.4.jar:na] at scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:454) ~[scala-library-2.10.4.jar:na] at scala.collection.Iterator$class.foreach(Iterator.scala:727) ~[scala-library-2.10.4.jar:na] at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) ~[scala-library-2.10.4.jar:na] at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) ~[scala-library-2.10.4.jar:na] at scala.collection.AbstractIterable.foreach(Iterable.scala:54) ~[scala-library-2.10.4.jar:na] at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) ~[scala-library-2.10.4.jar:na] at org.apache.spark.SparkConf.init(SparkConf.scala:53) ~[spark-core_2.10-1.2.1_dse-20150121.075638-2.jar:1.2.1_dse-SNAPSHOT] at org.apache.spark.SparkConf.init(SparkConf.scala:47) ~[spark-core_2.10-1.2.1_dse-20150121.075638-2.jar:1.2.1_dse-SNAPSHOT] {noformat} when there is another thread which modifies system properties at the same time. This bug https://issues.scala-lang.org/browse/SI-7775 is somehow related to the issue and shows that the problem has been also found elsewhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4986) Graceful shutdown for Spark Streaming does not work in Standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-4986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-4986: - Priority: Blocker (was: Major) Graceful shutdown for Spark Streaming does not work in Standalone cluster mode -- Key: SPARK-4986 URL: https://issues.apache.org/jira/browse/SPARK-4986 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0 Reporter: Jesper Lundgren Priority: Blocker Fix For: 1.3.0 When using the graceful stop API of Spark Streaming in Spark Standalone cluster the stop signal never reaches the receivers. I have tested this with Spark 1.2 and Kafka receivers. ReceiverTracker will send StopReceiver message to ReceiverSupervisorImpl. In local mode ReceiverSupervisorImpl receives this message but in Standalone cluster mode the message seems to be lost. (I have modified the code to send my own string message as a stop signal from ReceiverTracker to ReceiverSupervisorImpl and it works as a workaround.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4986) Graceful shutdown for Spark Streaming does not work in Standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-4986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-4986: - Fix Version/s: (was: 1.2.1) Graceful shutdown for Spark Streaming does not work in Standalone cluster mode -- Key: SPARK-4986 URL: https://issues.apache.org/jira/browse/SPARK-4986 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0 Reporter: Jesper Lundgren Priority: Blocker Fix For: 1.3.0 When using the graceful stop API of Spark Streaming in Spark Standalone cluster the stop signal never reaches the receivers. I have tested this with Spark 1.2 and Kafka receivers. ReceiverTracker will send StopReceiver message to ReceiverSupervisorImpl. In local mode ReceiverSupervisorImpl receives this message but in Standalone cluster mode the message seems to be lost. (I have modified the code to send my own string message as a stop signal from ReceiverTracker to ReceiverSupervisorImpl and it works as a workaround.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5027) add SVMWithLBFGS interface in MLLIB
[ https://issues.apache.org/jira/browse/SPARK-5027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5027: - Assignee: zhengbing li add SVMWithLBFGS interface in MLLIB --- Key: SPARK-5027 URL: https://issues.apache.org/jira/browse/SPARK-5027 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: zhengbing li Assignee: zhengbing li Original Estimate: 120h Remaining Estimate: 120h Our team has done the comparison test for ann. The test results are in “https://github.com/apache/spark/pull/1290” And we find the performance of svm using LBFGS is higher than svm using SGD, so I want to add SVMWithLBFGS interface to mllib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2206) Automatically infer the number of classification classes in multiclass classification
[ https://issues.apache.org/jira/browse/SPARK-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2206: - Target Version/s: 1.4.0 (was: 1.3.0) Automatically infer the number of classification classes in multiclass classification - Key: SPARK-2206 URL: https://issues.apache.org/jira/browse/SPARK-2206 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.0 Reporter: Manish Amde Assignee: Manish Amde Currently, the user needs to specify the numClassesForClassification parameter explicitly during multiclass classification for decision trees. This feature will automatically infer this information (and possibly class histograms) from the training data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5027) add SVMWithLBFGS interface in MLLIB
[ https://issues.apache.org/jira/browse/SPARK-5027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5027: - Target Version/s: 1.4.0 (was: 1.3.0) add SVMWithLBFGS interface in MLLIB --- Key: SPARK-5027 URL: https://issues.apache.org/jira/browse/SPARK-5027 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: zhengbing li Assignee: zhengbing li Original Estimate: 120h Remaining Estimate: 120h Our team has done the comparison test for ann. The test results are in “https://github.com/apache/spark/pull/1290” And we find the performance of svm using LBFGS is higher than svm using SGD, so I want to add SVMWithLBFGS interface to mllib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5536) Wrap the old ALS to use the new ALS implementation.
Xiangrui Meng created SPARK-5536: Summary: Wrap the old ALS to use the new ALS implementation. Key: SPARK-5536 URL: https://issues.apache.org/jira/browse/SPARK-5536 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng The new implementation performs better. We should replace the old one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5542) Decouple publishing, packaging, and tagging in release script
[ https://issues.apache.org/jira/browse/SPARK-5542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302578#comment-14302578 ] Apache Spark commented on SPARK-5542: - User 'pwendell' has created a pull request for this issue: https://github.com/apache/spark/pull/4319 Decouple publishing, packaging, and tagging in release script - Key: SPARK-5542 URL: https://issues.apache.org/jira/browse/SPARK-5542 Project: Spark Issue Type: Bug Components: Build Reporter: Patrick Wendell Assignee: Patrick Wendell Our release script should make it easy to do these separately. I.e. it should be possible to publish a release from a tag that we already cut. This would help with things such as publishing nightly releases (SPARK-1517). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3883) Provide SSL support for Akka and HttpServer based connections
[ https://issues.apache.org/jira/browse/SPARK-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3883. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3571 [https://github.com/apache/spark/pull/3571] Provide SSL support for Akka and HttpServer based connections - Key: SPARK-3883 URL: https://issues.apache.org/jira/browse/SPARK-3883 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Jacek Lewandowski Fix For: 1.3.0 Spark uses at least 4 logical communication channels: 1. Control messages - Akka based 2. JARs and other files - Jetty based (HttpServer) 3. Computation results - Java NIO based 4. Web UI - Jetty based The aim of this feature is to enable SSL for (1) and (2). Why: Spark configuration is sent through (1). Spark configuration may contain sensitive information like credentials for accessing external data sources or streams. Application JAR files (2) may include the application logic and therefore they may include information about the structure of the external data sources, and credentials as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5532) Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5532: - Assignee: Cheng Lian Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT Key: SPARK-5532 URL: https://issues.apache.org/jira/browse/SPARK-5532 Project: Spark Issue Type: Bug Components: MLlib, SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Cheng Lian Priority: Blocker Deterministic failure: {code} import org.apache.spark.mllib.linalg._ import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) import sqlContext._ val data = sc.parallelize(Seq((1.0, Vectors.dense(1,2,3.toDataFrame(label, features) data.repartition(1).saveAsParquetFile(blah) {code} If you remove the repartition, then this succeeds. Here's the stack trace: {code} 15/02/02 12:10:53 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 4, 192.168.1.230): java.lang.ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot be cast to org.apache.spark.sql.Row at org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186) at org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129) at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/02/02 12:10:54 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; aborting job org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 7, 192.168.1.230): java.lang.ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot be cast to org.apache.spark.sql.Row at org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186) at org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129) at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at
[jira] [Commented] (SPARK-3778) newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn
[ https://issues.apache.org/jira/browse/SPARK-3778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302290#comment-14302290 ] Patrick Wendell commented on SPARK-3778: /cc [~hshreedharan] newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn - Key: SPARK-3778 URL: https://issues.apache.org/jira/browse/SPARK-3778 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Thomas Graves Assignee: Thomas Graves Priority: Blocker The newAPIHadoopRDD routine doesn't properly add the credentials to the conf to be able to access secure hdfs. Note that newAPIHadoopFile does handle these because the org.apache.hadoop.mapreduce.Job automatically adds it for you. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3778) newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn
[ https://issues.apache.org/jira/browse/SPARK-3778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3778: --- Priority: Blocker (was: Critical) newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn - Key: SPARK-3778 URL: https://issues.apache.org/jira/browse/SPARK-3778 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Thomas Graves Assignee: Thomas Graves Priority: Blocker The newAPIHadoopRDD routine doesn't properly add the credentials to the conf to be able to access secure hdfs. Note that newAPIHadoopFile does handle these because the org.apache.hadoop.mapreduce.Job automatically adds it for you. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form
[ https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4550: --- Target Version/s: 1.4.0 In sort-based shuffle, store map outputs in serialized form --- Key: SPARK-4550 URL: https://issues.apache.org/jira/browse/SPARK-4550 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza Attachments: SPARK-4550-design-v1.pdf One drawback with sort-based shuffle compared to hash-based shuffle is that it ends up storing many more java objects in memory. If Spark could store map outputs in serialized form, it could * spill less often because the serialized form is more compact * reduce GC pressure This will only work when the serialized representations of objects are independent from each other and occupy contiguous segments of memory. E.g. when Kryo reference tracking is left on, objects may contain pointers to objects farther back in the stream, which means that the sort can't relocate objects without corrupting them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form
[ https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4550: --- Priority: Critical (was: Major) In sort-based shuffle, store map outputs in serialized form --- Key: SPARK-4550 URL: https://issues.apache.org/jira/browse/SPARK-4550 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Critical Attachments: SPARK-4550-design-v1.pdf One drawback with sort-based shuffle compared to hash-based shuffle is that it ends up storing many more java objects in memory. If Spark could store map outputs in serialized form, it could * spill less often because the serialized form is more compact * reduce GC pressure This will only work when the serialized representations of objects are independent from each other and occupy contiguous segments of memory. E.g. when Kryo reference tracking is left on, objects may contain pointers to objects farther back in the stream, which means that the sort can't relocate objects without corrupting them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5537) Expand user guide for multinomial logistic regression
Xiangrui Meng created SPARK-5537: Summary: Expand user guide for multinomial logistic regression Key: SPARK-5537 URL: https://issues.apache.org/jira/browse/SPARK-5537 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: DB Tsai We probably don't need to work out the math in the user guide. We can point users to wikipedia for details and focus on the public APIs and how to use it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4508) Native Date type for SQL92 Date
[ https://issues.apache.org/jira/browse/SPARK-4508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4508. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3732 [https://github.com/apache/spark/pull/3732] Native Date type for SQL92 Date --- Key: SPARK-4508 URL: https://issues.apache.org/jira/browse/SPARK-4508 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Adrian Wang Assignee: Adrian Wang Fix For: 1.3.0 Store daysSinceEpoch as an Int(4 bytes), instead of using java.sql.Date(8 bytes as Long) in catalyst row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5540) Hide ALS.solveLeastSquares.
[ https://issues.apache.org/jira/browse/SPARK-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302495#comment-14302495 ] Apache Spark commented on SPARK-5540: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4318 Hide ALS.solveLeastSquares. --- Key: SPARK-5540 URL: https://issues.apache.org/jira/browse/SPARK-5540 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng This method survived the code review and it has been there since v1.1.0. It exposes jblas types. Let's remove it from the public API. I expect that no one calls it directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5513) Add NMF option to the new ALS implementation
[ https://issues.apache.org/jira/browse/SPARK-5513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5513. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4302 [https://github.com/apache/spark/pull/4302] Add NMF option to the new ALS implementation Key: SPARK-5513 URL: https://issues.apache.org/jira/browse/SPARK-5513 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.3.0 Then we can swap spark.mllib's implementation to use the new ALS impl. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5231) History Server shows wrong job submission time.
[ https://issues.apache.org/jira/browse/SPARK-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5231: -- Target Version/s: 1.3.0, 1.2.2 (was: 1.3.0) History Server shows wrong job submission time. --- Key: SPARK-5231 URL: https://issues.apache.org/jira/browse/SPARK-5231 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Labels: backport-needed Fix For: 1.3.0 History Server doesn't show collect job submission time. It's because JobProgressListener updates job submission time every time onJobStart method is invoked from ReplayListenerBus. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5195) when hive table is query with alias the cache data lose effectiveness.
[ https://issues.apache.org/jira/browse/SPARK-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5195: --- Fix Version/s: (was: 1.2.1) when hive table is query with alias the cache data lose effectiveness. Key: SPARK-5195 URL: https://issues.apache.org/jira/browse/SPARK-5195 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: yixiaohua Fix For: 1.3.0 override the MetastoreRelation's sameresult method only compare databasename and table name because in previous : cache table t1; select count() from t1; it will read data from memory but the sql below will not,instead it read from hdfs: select count() from t1 t; because cache data is keyed by logical plan and compare with sameResult ,so when table with alias the same table 's logicalplan is not the same logical plan with out alias so modify the sameresult method only compare databasename and table name -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5231) History Server shows wrong job submission time.
[ https://issues.apache.org/jira/browse/SPARK-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5231: -- Labels: backport-needed (was: ) History Server shows wrong job submission time. --- Key: SPARK-5231 URL: https://issues.apache.org/jira/browse/SPARK-5231 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Labels: backport-needed Fix For: 1.3.0 History Server doesn't show collect job submission time. It's because JobProgressListener updates job submission time every time onJobStart method is invoked from ReplayListenerBus. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5454) [SQL] Self join with ArrayType columns problems
[ https://issues.apache.org/jira/browse/SPARK-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5454: Priority: Blocker (was: Major) [SQL] Self join with ArrayType columns problems --- Key: SPARK-5454 URL: https://issues.apache.org/jira/browse/SPARK-5454 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Pierre Borckmans Priority: Blocker Weird behaviour when performing self join on a table with some ArrayType field. (potential bug ?) I have set up a minimal non working example here: https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f In a nutshell, if the ArrayType column used for the pivot is created manually in the StructType definition, everything works as expected. However, if the ArrayType pivot column is obtained by a sql query (be it by using a array wrapper, or using a collect_list operator for instance), then results are completely off. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5514) collect should call executeCollect
[ https://issues.apache.org/jira/browse/SPARK-5514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5514. Resolution: Fixed Fix Version/s: 1.3.0 collect should call executeCollect -- Key: SPARK-5514 URL: https://issues.apache.org/jira/browse/SPARK-5514 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.0 Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4508) Native Date type for SQL92 Date
[ https://issues.apache.org/jira/browse/SPARK-4508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4508: --- Fix Version/s: (was: 1.3.0) Native Date type for SQL92 Date --- Key: SPARK-4508 URL: https://issues.apache.org/jira/browse/SPARK-4508 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Adrian Wang Assignee: Adrian Wang Store daysSinceEpoch as an Int(4 bytes), instead of using java.sql.Date(8 bytes as Long) in catalyst row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5544) wholeTextFiles should recognize multiple input paths delimited by ,
Xiangrui Meng created SPARK-5544: Summary: wholeTextFiles should recognize multiple input paths delimited by , Key: SPARK-5544 URL: https://issues.apache.org/jira/browse/SPARK-5544 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Xiangrui Meng textFile takes delimited paths in a single path string. wholeTextFiles should behave the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5500) Document that feeding hadoopFile into a shuffle operation will cause problems
[ https://issues.apache.org/jira/browse/SPARK-5500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5500. Resolution: Fixed Fix Version/s: 1.3.0 Document that feeding hadoopFile into a shuffle operation will cause problems - Key: SPARK-5500 URL: https://issues.apache.org/jira/browse/SPARK-5500 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.3.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Critical Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2309) Generalize the binary logistic regression into multinomial logistic regression
[ https://issues.apache.org/jira/browse/SPARK-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2309. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3833 [https://github.com/apache/spark/pull/3833] Generalize the binary logistic regression into multinomial logistic regression -- Key: SPARK-2309 URL: https://issues.apache.org/jira/browse/SPARK-2309 Project: Spark Issue Type: New Feature Components: MLlib Reporter: DB Tsai Assignee: DB Tsai Priority: Critical Fix For: 1.3.0 Currently, there is no multi-class classifier in mllib. Logistic regression can be extended to multinomial one straightforwardly. The following formula will be implemented. http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4980) Add decay factors to streaming linear methods
[ https://issues.apache.org/jira/browse/SPARK-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4980: - Target Version/s: 1.4.0 (was: 1.3.0) Add decay factors to streaming linear methods - Key: SPARK-4980 URL: https://issues.apache.org/jira/browse/SPARK-4980 Project: Spark Issue Type: New Feature Components: MLlib, Streaming Reporter: Jeremy Freeman Priority: Minor Our implementation of streaming k-means uses an decay factor that allows users to control how quickly the model adjusts to new data: whether it treats all data equally, or only bases its estimate on the most recent batch. It is intuitively parameterized, and can be specified in units of either batches or points. We should add a similar decay factor to the streaming linear methods using SGD, including streaming linear regression (currently implemented) and streaming logistic regression (in development). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5520) Make FP-Growth implementation take generic item types
[ https://issues.apache.org/jira/browse/SPARK-5520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5520: - Priority: Critical (was: Major) Make FP-Growth implementation take generic item types - Key: SPARK-5520 URL: https://issues.apache.org/jira/browse/SPARK-5520 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Priority: Critical There is not technical restriction on the item types in the FP-Growth implementation. We used String in the first PR for simplicity. Maybe we could make the type generic before 1.3 (and specialize it for Int/Long). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4526) Gradient should be added batch computing interface
[ https://issues.apache.org/jira/browse/SPARK-4526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4526: - Assignee: Guoqiang Li Gradient should be added batch computing interface -- Key: SPARK-4526 URL: https://issues.apache.org/jira/browse/SPARK-4526 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Guoqiang Li Assignee: Guoqiang Li If Gradient support batch computing, we can use some efficient numerical libraries(eg, BLAS). In some cases, it can improve the performance of more than ten times as much. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4526) Gradient should be added batch computing interface
[ https://issues.apache.org/jira/browse/SPARK-4526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4526: - Target Version/s: 1.4.0 (was: 1.3.0) Gradient should be added batch computing interface -- Key: SPARK-4526 URL: https://issues.apache.org/jira/browse/SPARK-4526 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Guoqiang Li Assignee: Guoqiang Li If Gradient support batch computing, we can use some efficient numerical libraries(eg, BLAS). In some cases, it can improve the performance of more than ten times as much. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2309) Generalize the binary logistic regression into multinomial logistic regression
[ https://issues.apache.org/jira/browse/SPARK-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2309: - Priority: Critical (was: Major) Generalize the binary logistic regression into multinomial logistic regression -- Key: SPARK-2309 URL: https://issues.apache.org/jira/browse/SPARK-2309 Project: Spark Issue Type: New Feature Components: MLlib Reporter: DB Tsai Assignee: DB Tsai Priority: Critical Currently, there is no multi-class classifier in mllib. Logistic regression can be extended to multinomial one straightforwardly. The following formula will be implemented. http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5539) User guide for LDA
Xiangrui Meng created SPARK-5539: Summary: User guide for LDA Key: SPARK-5539 URL: https://issues.apache.org/jira/browse/SPARK-5539 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: Joseph K. Bradley Add a section for LDA in the user guide. We probably don't need to explain the algorithm in details but point people to the wikipedia page. The user guide should focus on public APIs and how to use LDA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2005) Investigate linux container-based solution
[ https://issues.apache.org/jira/browse/SPARK-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302554#comment-14302554 ] Nicholas Chammas commented on SPARK-2005: - [~mengxr] - Do you mind if I renamed this issue to Containerize execution of Spark tests? I'm thinking it would be helpful to convert {{dev/run-tests}} to execute tests within a container so we can run more test configurations in parallel on the same server without worrying about things like port or file collisions. Or were you thinking we should develop a way to deploy full Spark clusters within containers? Investigate linux container-based solution -- Key: SPARK-2005 URL: https://issues.apache.org/jira/browse/SPARK-2005 Project: Spark Issue Type: Sub-task Components: Build Reporter: Xiangrui Meng We can set up container-based cluster environment and automatically test against a deployment matrix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5454) [SQL] Self join with ArrayType columns problems
[ https://issues.apache.org/jira/browse/SPARK-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5454: Target Version/s: 1.3.0 [SQL] Self join with ArrayType columns problems --- Key: SPARK-5454 URL: https://issues.apache.org/jira/browse/SPARK-5454 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Pierre Borckmans Priority: Blocker Weird behaviour when performing self join on a table with some ArrayType field. (potential bug ?) I have set up a minimal non working example here: https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f In a nutshell, if the ArrayType column used for the pivot is created manually in the StructType definition, everything works as expected. However, if the ArrayType pivot column is obtained by a sql query (be it by using a array wrapper, or using a collect_list operator for instance), then results are completely off. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5534) EdgeRDD, VertexRDD getStorageLevel return bad values
[ https://issues.apache.org/jira/browse/SPARK-5534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5534. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4317 [https://github.com/apache/spark/pull/4317] EdgeRDD, VertexRDD getStorageLevel return bad values Key: SPARK-5534 URL: https://issues.apache.org/jira/browse/SPARK-5534 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Fix For: 1.3.0 After caching a graph, its edge and vertex RDDs still return StorageLevel.None. Reproduce error: {code} import org.apache.spark.graphx.{Edge, Graph} val edges = Seq( Edge[Double](0, 1, 0), Edge[Double](1, 2, 0), Edge[Double](2, 3, 0), Edge[Double](3, 4, 0)) val g = Graph.fromEdges[Double,Double](sc.parallelize(edges), 0) g.vertices.getStorageLevel // returns value for StorageLevel.None g.edges.getStorageLevel // returns value for StorageLevel.None g.cache() g.vertices.count() g.edges.count() g.vertices.getStorageLevel // returns value for StorageLevel.None g.edges.getStorageLevel // returns value for StorageLevel.None {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1406) PMML model evaluation support via MLib
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1406: - Target Version/s: 1.4.0 (was: 1.3.0) PMML model evaluation support via MLib -- Key: SPARK-1406 URL: https://issues.apache.org/jira/browse/SPARK-1406 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Thomas Darimont Assignee: Vincenzo Selvaggio Attachments: MyJPMMLEval.java, SPARK-1406.pdf, SPARK-1406_v2.pdf, kmeans.xml It would be useful if spark would provide support the evaluation of PMML models (http://www.dmg.org/v4-2/GeneralStructure.html). This would allow to use analytical models that were created with a statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which would perform the actual model evaluation for a given input tuple. The PMML model would then just contain the parameterization of an analytical model. Other projects like JPMML-Evaluator do a similar thing. https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5520) Make FP-Growth implementation take generic item types
[ https://issues.apache.org/jira/browse/SPARK-5520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5520: - Assignee: Jacky Li Make FP-Growth implementation take generic item types - Key: SPARK-5520 URL: https://issues.apache.org/jira/browse/SPARK-5520 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Jacky Li Priority: Critical There is not technical restriction on the item types in the FP-Growth implementation. We used String in the first PR for simplicity. Maybe we could make the type generic before 1.3 (and specialize it for Int/Long). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5519) Add user guide for FP-Growth
[ https://issues.apache.org/jira/browse/SPARK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5519: - Assignee: Jacky Li Add user guide for FP-Growth Key: SPARK-5519 URL: https://issues.apache.org/jira/browse/SPARK-5519 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: Jacky Li We need to add a section for FP-Growth in the user guide after we merge the FP-Growth PR is merged. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4285) Transpose RDD[Vector] to column store for ML
[ https://issues.apache.org/jira/browse/SPARK-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4285: - Target Version/s: 1.4.0 (was: 1.3.0) Transpose RDD[Vector] to column store for ML Key: SPARK-4285 URL: https://issues.apache.org/jira/browse/SPARK-4285 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor For certain ML algorithms, a column store is more efficient than a row store (which is currently used everywhere). E.g., deep decision trees can be faster to train when partitioning by features. Proposal: Provide a method with the following API (probably in util/): ``` def rowToColumnStore(data: RDD[Vector]): RDD[(Int, Vector)] ``` The input Vectors will be data rows/instances, and the output Vectors will be columns/features paired with column/feature indices. **Question**: Is it important to maintain matrix structure? That is, should output Vectors in the same partition be adjacent columns in the matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form
[ https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-4550: -- Attachment: SPARK-4550-design-v1.pdf In sort-based shuffle, store map outputs in serialized form --- Key: SPARK-4550 URL: https://issues.apache.org/jira/browse/SPARK-4550 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza Attachments: SPARK-4550-design-v1.pdf One drawback with sort-based shuffle compared to hash-based shuffle is that it ends up storing many more java objects in memory. If Spark could store map outputs in serialized form, it could * spill less often because the serialized form is more compact * reduce GC pressure This will only work when the serialized representations of objects are independent from each other and occupy contiguous segments of memory. E.g. when Kryo reference tracking is left on, objects may contain pointers to objects farther back in the stream, which means that the sort can't relocate objects without corrupting them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3505) Augmenting SparkStreaming updateStateByKey API with timestamp
[ https://issues.apache.org/jira/browse/SPARK-3505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xi Liu closed SPARK-3505. - Resolution: Won't Fix Close this issue for now. Will re-open later when I find time to work on it. Augmenting SparkStreaming updateStateByKey API with timestamp - Key: SPARK-3505 URL: https://issues.apache.org/jira/browse/SPARK-3505 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.0 Reporter: Xi Liu Priority: Minor The current updateStateByKey API in Spark Streaming does not expose timestamp to the application. In our use case, the application need to know the batch timestamp to decide whether to keep the state or not. And we do not want to use real system time because we want to decouple the two (because the same code base is used for streaming and offline processing). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5535) Add parameter for storage levels.
Xiangrui Meng created SPARK-5535: Summary: Add parameter for storage levels. Key: SPARK-5535 URL: https://issues.apache.org/jira/browse/SPARK-5535 Project: Spark Issue Type: New Feature Components: ML Reporter: Xiangrui Meng Assignee: Xiangrui Meng Add a special parameter type for storage levels that takes both StorageLevels and their string representation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5538) CachedTableSuite failure due to unpersisting RDDs in a non-blocking way
Cheng Lian created SPARK-5538: - Summary: CachedTableSuite failure due to unpersisting RDDs in a non-blocking way Key: SPARK-5538 URL: https://issues.apache.org/jira/browse/SPARK-5538 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Lian Priority: Minor [PR #4173|https://github.com/apache/spark/pull/4173/files#diff-726d84ece1e6f6197b98a5868c881ac7R164] introduced this, and introduced a race condition in {{CachedTableSuite}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5540) Hide ALS.solveLeastSquares.
Xiangrui Meng created SPARK-5540: Summary: Hide ALS.solveLeastSquares. Key: SPARK-5540 URL: https://issues.apache.org/jira/browse/SPARK-5540 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng This method survived the code review and it has been there since v1.1.0. It exposes jblas types. Let's remove it from the public API. I expect that no one calls it directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5541) Allow running Maven or SBT in the Spark build
Patrick Wendell created SPARK-5541: -- Summary: Allow running Maven or SBT in the Spark build Key: SPARK-5541 URL: https://issues.apache.org/jira/browse/SPARK-5541 Project: Spark Issue Type: Bug Components: Build Reporter: Patrick Wendell Assignee: Nicholas Chammas It would be nice if we had a hook for the spark test scripts to run with Maven in addition to running with SBT. Right now it is difficult for us to test pull requests in maven and we get master build breaks because of it. A simple first step is to modify run-tests to allow building with maven. Then we can add a second PRB that invokes this maven build. I would just add an env var called SPARK_BUILD_TOOL that can be set to sbt or mvn. And make sure the associated logic works in either case. If we don't want to have the fancy SQL only stuff in Maven, that's fine too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5541) Allow running Maven or SBT in run-tests
[ https://issues.apache.org/jira/browse/SPARK-5541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5541: --- Summary: Allow running Maven or SBT in run-tests (was: Allow running Maven or SBT in the Spark build) Allow running Maven or SBT in run-tests --- Key: SPARK-5541 URL: https://issues.apache.org/jira/browse/SPARK-5541 Project: Spark Issue Type: Bug Components: Build Reporter: Patrick Wendell Assignee: Nicholas Chammas It would be nice if we had a hook for the spark test scripts to run with Maven in addition to running with SBT. Right now it is difficult for us to test pull requests in maven and we get master build breaks because of it. A simple first step is to modify run-tests to allow building with maven. Then we can add a second PRB that invokes this maven build. I would just add an env var called SPARK_BUILD_TOOL that can be set to sbt or mvn. And make sure the associated logic works in either case. If we don't want to have the fancy SQL only stuff in Maven, that's fine too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-4508) Native Date type for SQL92 Date
[ https://issues.apache.org/jira/browse/SPARK-4508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reopened SPARK-4508: This has caused several date-related test failures in the master and pull request builds, so I'm reverting it: https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26560/testReport/org.apache.spark.sql/ScalaReflectionRelationSuite/query_case_class_RDD/ Native Date type for SQL92 Date --- Key: SPARK-4508 URL: https://issues.apache.org/jira/browse/SPARK-4508 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Adrian Wang Assignee: Adrian Wang Fix For: 1.3.0 Store daysSinceEpoch as an Int(4 bytes), instead of using java.sql.Date(8 bytes as Long) in catalyst row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5536) Wrap the old ALS to use the new ALS implementation.
[ https://issues.apache.org/jira/browse/SPARK-5536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302653#comment-14302653 ] Apache Spark commented on SPARK-5536: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4321 Wrap the old ALS to use the new ALS implementation. --- Key: SPARK-5536 URL: https://issues.apache.org/jira/browse/SPARK-5536 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng The new implementation performs better. We should replace the old one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-4986) Graceful shutdown for Spark Streaming does not work in Standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-4986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reopened SPARK-4986: -- Graceful shutdown for Spark Streaming does not work in Standalone cluster mode -- Key: SPARK-4986 URL: https://issues.apache.org/jira/browse/SPARK-4986 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0 Reporter: Jesper Lundgren Priority: Blocker Fix For: 1.3.0 When using the graceful stop API of Spark Streaming in Spark Standalone cluster the stop signal never reaches the receivers. I have tested this with Spark 1.2 and Kafka receivers. ReceiverTracker will send StopReceiver message to ReceiverSupervisorImpl. In local mode ReceiverSupervisorImpl receives this message but in Standalone cluster mode the message seems to be lost. (I have modified the code to send my own string message as a stop signal from ReceiverTracker to ReceiverSupervisorImpl and it works as a workaround.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4588) Add API for feature attributes
[ https://issues.apache.org/jira/browse/SPARK-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4588: - Target Version/s: 1.4.0 (was: 1.3.0) Add API for feature attributes -- Key: SPARK-4588 URL: https://issues.apache.org/jira/browse/SPARK-4588 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Xiangrui Meng Feature attributes, e.g., continuous/categorical, feature names, feature dimension, number of categories, number of nonzeros (support) could be useful for ML algorithms. In SPARK-3569, we added metadata to schema, which can be used to store feature attributes along with the dataset. We need to provide a wrapper over the Metadata class for ML usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5534) EdgeRDD, VertexRDD getStorageLevel return bad values
[ https://issues.apache.org/jira/browse/SPARK-5534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302285#comment-14302285 ] Apache Spark commented on SPARK-5534: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/4317 EdgeRDD, VertexRDD getStorageLevel return bad values Key: SPARK-5534 URL: https://issues.apache.org/jira/browse/SPARK-5534 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley After caching a graph, its edge and vertex RDDs still return StorageLevel.None. Reproduce error: {code} import org.apache.spark.graphx.{Edge, Graph} val edges = Seq( Edge[Double](0, 1, 0), Edge[Double](1, 2, 0), Edge[Double](2, 3, 0), Edge[Double](3, 4, 0)) val g = Graph.fromEdges[Double,Double](sc.parallelize(edges), 0) g.vertices.getStorageLevel // returns value for StorageLevel.None g.edges.getStorageLevel // returns value for StorageLevel.None g.cache() g.vertices.count() g.edges.count() g.vertices.getStorageLevel // returns value for StorageLevel.None g.edges.getStorageLevel // returns value for StorageLevel.None {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form
[ https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302326#comment-14302326 ] Patrick Wendell commented on SPARK-4550: Yeah, this is a good idea. I don't see why we don't serialize these immediately. In sort-based shuffle, store map outputs in serialized form --- Key: SPARK-4550 URL: https://issues.apache.org/jira/browse/SPARK-4550 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Critical Attachments: SPARK-4550-design-v1.pdf One drawback with sort-based shuffle compared to hash-based shuffle is that it ends up storing many more java objects in memory. If Spark could store map outputs in serialized form, it could * spill less often because the serialized form is more compact * reduce GC pressure This will only work when the serialized representations of objects are independent from each other and occupy contiguous segments of memory. E.g. when Kryo reference tracking is left on, objects may contain pointers to objects farther back in the stream, which means that the sort can't relocate objects without corrupting them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5131) A typo in configuration doc
[ https://issues.apache.org/jira/browse/SPARK-5131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302342#comment-14302342 ] Sean Owen commented on SPARK-5131: -- Sometimes site changes reflect changes not in the current stable release, so in general it is updated with a release. Typos could be fixed directly in the interim. In this case the site will be updated very shortly for 1.2.1 anyway. A typo in configuration doc --- Key: SPARK-5131 URL: https://issues.apache.org/jira/browse/SPARK-5131 Project: Spark Issue Type: Bug Reporter: uncleGen Assignee: uncleGen Priority: Minor Fix For: 1.3.0, 1.2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5543) Remove unused import JsonUtil from from org.apache.spark.util.JsonProtocol.scala which fails builds with older versions of hadoop-core
Nathan M created SPARK-5543: --- Summary: Remove unused import JsonUtil from from org.apache.spark.util.JsonProtocol.scala which fails builds with older versions of hadoop-core Key: SPARK-5543 URL: https://issues.apache.org/jira/browse/SPARK-5543 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Nathan M Priority: Minor There is an unused import in org.apache.spark.util.JsonProtocol.scala `import org.apache.hadoop.hdfs.web.JsonUtil` This which fails builds with older versions of hadoop-core. In particular building mapr3 causes a compile error; [ERROR] /var/lib/jenkins/workspace/cse-Apache-Spark/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala:35: object web is not a member of package org.apache.hadoop.hdfs [ERROR] import org.apache.hadoop.hdfs.web.JsonUtil This import is unused. It was introduced in PR #4029 https://github.com/apache/spark/pull/4029 as a part of JIRA SPARK-5231 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3883) Provide SSL support for Akka and HttpServer based connections
[ https://issues.apache.org/jira/browse/SPARK-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3883: -- Assignee: Jacek Lewandowski Provide SSL support for Akka and HttpServer based connections - Key: SPARK-3883 URL: https://issues.apache.org/jira/browse/SPARK-3883 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Jacek Lewandowski Assignee: Jacek Lewandowski Fix For: 1.3.0 Spark uses at least 4 logical communication channels: 1. Control messages - Akka based 2. JARs and other files - Jetty based (HttpServer) 3. Computation results - Java NIO based 4. Web UI - Jetty based The aim of this feature is to enable SSL for (1) and (2). Why: Spark configuration is sent through (1). Spark configuration may contain sensitive information like credentials for accessing external data sources or streams. Application JAR files (2) may include the application logic and therefore they may include information about the structure of the external data sources, and credentials as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5543) Remove unused import JsonUtil from from org.apache.spark.util.JsonProtocol.scala which fails builds with older versions of hadoop-core
[ https://issues.apache.org/jira/browse/SPARK-5543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302641#comment-14302641 ] Apache Spark commented on SPARK-5543: - User 'nemccarthy' has created a pull request for this issue: https://github.com/apache/spark/pull/4320 Remove unused import JsonUtil from from org.apache.spark.util.JsonProtocol.scala which fails builds with older versions of hadoop-core -- Key: SPARK-5543 URL: https://issues.apache.org/jira/browse/SPARK-5543 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Nathan M Priority: Minor Labels: easyfix There is an unused import in org.apache.spark.util.JsonProtocol.scala `import org.apache.hadoop.hdfs.web.JsonUtil` This which fails builds with older versions of hadoop-core. In particular building mapr3 causes a compile error; [ERROR] /var/lib/jenkins/workspace/cse-Apache-Spark/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala:35: object web is not a member of package org.apache.hadoop.hdfs [ERROR] import org.apache.hadoop.hdfs.web.JsonUtil This import is unused. It was introduced in PR #4029 https://github.com/apache/spark/pull/4029 as a part of JIRA SPARK-5231 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5461) Graph should have isCheckpointed, getCheckpointFiles methods
[ https://issues.apache.org/jira/browse/SPARK-5461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5461. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4253 [https://github.com/apache/spark/pull/4253] Graph should have isCheckpointed, getCheckpointFiles methods Key: SPARK-5461 URL: https://issues.apache.org/jira/browse/SPARK-5461 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor Fix For: 1.3.0 Graph has a checkpoint method but does not have other helper functionality which RDD has. Proposal: {code} /** * Return whether this Graph has been checkpointed or not */ def isCheckpointed: Boolean /** * Gets the name of the files to which this Graph was checkpointed */ def getCheckpointFiles: Seq[String] {code} I need this for [SPARK-1405]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5532) Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5532: - Priority: Critical (was: Blocker) Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT Key: SPARK-5532 URL: https://issues.apache.org/jira/browse/SPARK-5532 Project: Spark Issue Type: Bug Components: MLlib, SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Cheng Lian Priority: Critical Deterministic failure: {code} import org.apache.spark.mllib.linalg._ import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) import sqlContext._ val data = sc.parallelize(Seq((1.0, Vectors.dense(1,2,3.toDataFrame(label, features) data.repartition(1).saveAsParquetFile(blah) {code} If you remove the repartition, then this succeeds. Here's the stack trace: {code} 15/02/02 12:10:53 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 4, 192.168.1.230): java.lang.ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot be cast to org.apache.spark.sql.Row at org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186) at org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129) at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/02/02 12:10:54 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; aborting job org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 7, 192.168.1.230): java.lang.ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot be cast to org.apache.spark.sql.Row at org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186) at org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129) at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at
[jira] [Commented] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form
[ https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302303#comment-14302303 ] Sandy Ryza commented on SPARK-4550: --- Just posted a design doc. Would love to get feedback [~ilikerps] [~matei] [~jerryshao]. In sort-based shuffle, store map outputs in serialized form --- Key: SPARK-4550 URL: https://issues.apache.org/jira/browse/SPARK-4550 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Critical Attachments: SPARK-4550-design-v1.pdf One drawback with sort-based shuffle compared to hash-based shuffle is that it ends up storing many more java objects in memory. If Spark could store map outputs in serialized form, it could * spill less often because the serialized form is more compact * reduce GC pressure This will only work when the serialized representations of objects are independent from each other and occupy contiguous segments of memory. E.g. when Kryo reference tracking is left on, objects may contain pointers to objects farther back in the stream, which means that the sort can't relocate objects without corrupting them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5541) Allow running Maven or SBT in run-tests
[ https://issues.apache.org/jira/browse/SPARK-5541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302525#comment-14302525 ] Nicholas Chammas commented on SPARK-5541: - Dup of SPARK-3355? Allow running Maven or SBT in run-tests --- Key: SPARK-5541 URL: https://issues.apache.org/jira/browse/SPARK-5541 Project: Spark Issue Type: Bug Components: Build Reporter: Patrick Wendell Assignee: Nicholas Chammas It would be nice if we had a hook for the spark test scripts to run with Maven in addition to running with SBT. Right now it is difficult for us to test pull requests in maven and we get master build breaks because of it. A simple first step is to modify run-tests to allow building with maven. Then we can add a second PRB that invokes this maven build. I would just add an env var called SPARK_BUILD_TOOL that can be set to sbt or mvn. And make sure the associated logic works in either case. If we don't want to have the fancy SQL only stuff in Maven, that's fine too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org