[jira] [Updated] (SPARK-3469) All TaskCompletionListeners should be called even if some of them fail
[ https://issues.apache.org/jira/browse/SPARK-3469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3469: --- Summary: All TaskCompletionListeners should be called even if some of them fail (was: Make sure TaskCompletionListeners are called in presence of failures) All TaskCompletionListeners should be called even if some of them fail -- Key: SPARK-3469 URL: https://issues.apache.org/jira/browse/SPARK-3469 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Reynold Xin Assignee: Reynold Xin If there are multiple TaskCompletionListeners, and any one of them misbehaves (e.g. throws an exception), then we will skip executing the rest of them. As we are increasingly relying on TaskCompletionListener for cleaning up of resources, we should make sure they are always called, even if the previous ones fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3469) All TaskCompletionListeners should be called even if some of them fail
[ https://issues.apache.org/jira/browse/SPARK-3469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3469: --- Component/s: Spark Core All TaskCompletionListeners should be called even if some of them fail -- Key: SPARK-3469 URL: https://issues.apache.org/jira/browse/SPARK-3469 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Reynold Xin Assignee: Reynold Xin If there are multiple TaskCompletionListeners, and any one of them misbehaves (e.g. throws an exception), then we will skip executing the rest of them. As we are increasingly relying on TaskCompletionListener for cleaning up of resources, we should make sure they are always called, even if the previous ones fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3282) It should support multiple receivers at one socketInputDStream
[ https://issues.apache.org/jira/browse/SPARK-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shenhong resolved SPARK-3282. - Resolution: Won't Fix It should support multiple receivers at one socketInputDStream --- Key: SPARK-3282 URL: https://issues.apache.org/jira/browse/SPARK-3282 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2 Reporter: shenhong At present, a socketInputDStream support at most one receiver, it will be bottleneck when large inputStrem appear. It should support multiple receivers at one socketInputDStream -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3470) Have JavaSparkContext implement Closeable/AutoCloseable
Shay Rojansky created SPARK-3470: Summary: Have JavaSparkContext implement Closeable/AutoCloseable Key: SPARK-3470 URL: https://issues.apache.org/jira/browse/SPARK-3470 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.2 Reporter: Shay Rojansky Priority: Minor After discussion in SPARK-2972, it seems like a good idea to allow Java developers to use Java 7 automatic resource management with JavaSparkContext, like so: {code:java} try (JavaSparkContext ctx = new JavaSparkContext(...)) { return br.readLine(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3395) {SQL] DSL uses incorrect attribute ids after a distinct()
[ https://issues.apache.org/jira/browse/SPARK-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3395. - Resolution: Fixed Fix Version/s: 1.2.0 {SQL] DSL uses incorrect attribute ids after a distinct() - Key: SPARK-3395 URL: https://issues.apache.org/jira/browse/SPARK-3395 Project: Spark Issue Type: Bug Components: SQL Reporter: Eric Liang Assignee: Eric Liang Priority: Minor Fix For: 1.2.0 In the following example, val rdd = ... // two columns: {key, value} val derivedRDD = rdd.distinct().limit(1) sql(explain select * from rdd inner join derivedRDD on rdd.key = derivedRDD.key) The inner join executes incorrectly since the two keys end up with the same attribute id after analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3469) All TaskCompletionListeners should be called even if some of them fail
[ https://issues.apache.org/jira/browse/SPARK-3469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128149#comment-14128149 ] Apache Spark commented on SPARK-3469: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/2343 All TaskCompletionListeners should be called even if some of them fail -- Key: SPARK-3469 URL: https://issues.apache.org/jira/browse/SPARK-3469 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Reynold Xin Assignee: Reynold Xin If there are multiple TaskCompletionListeners, and any one of them misbehaves (e.g. throws an exception), then we will skip executing the rest of them. As we are increasingly relying on TaskCompletionListener for cleaning up of resources, we should make sure they are always called, even if the previous ones fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3471) Automatic resource manager for SparkContext in Scala?
Shay Rojansky created SPARK-3471: Summary: Automatic resource manager for SparkContext in Scala? Key: SPARK-3471 URL: https://issues.apache.org/jira/browse/SPARK-3471 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.2 Reporter: Shay Rojansky Priority: Minor After discussion in SPARK-2972, it seems like a good idea to add automatic resource management semantics to SparkContext (i.e. with in Python (SPARK-3458), Closeable/AutoCloseable in Java (SPARK-3470)). I have no knowledge of Scala whatsoever, but a quick search seems to indicate that there isn't a standard mechanism for this - someone with real Scala knowledge should take a look and make a decision... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped
[ https://issues.apache.org/jira/browse/SPARK-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shay Rojansky closed SPARK-2972. Resolution: Won't Fix APPLICATION_COMPLETE not created in Python unless context explicitly stopped Key: SPARK-2972 URL: https://issues.apache.org/jira/browse/SPARK-2972 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.2 Environment: Cloudera 5.1, yarn master on ubuntu precise Reporter: Shay Rojansky If you don't explicitly stop a SparkContext at the end of a Python application with sc.stop(), an APPLICATION_COMPLETE file isn't created and the job doesn't get picked up by the history server. This can be easily reproduced with pyspark (but affects scripts as well). The current workaround is to wrap the entire script with a try/finally and stop manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3472) Option to take top n elements (unsorted)
Kanwaljit Singh created SPARK-3472: -- Summary: Option to take top n elements (unsorted) Key: SPARK-3472 URL: https://issues.apache.org/jira/browse/SPARK-3472 Project: Spark Issue Type: New Feature Reporter: Kanwaljit Singh Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3472) Option to take top n elements (unsorted)
[ https://issues.apache.org/jira/browse/SPARK-3472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kanwaljit Singh closed SPARK-3472. -- Resolution: Invalid Option to take top n elements (unsorted) Key: SPARK-3472 URL: https://issues.apache.org/jira/browse/SPARK-3472 Project: Spark Issue Type: New Feature Reporter: Kanwaljit Singh Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3364) Zip equal-length but unequally-partition
[ https://issues.apache.org/jira/browse/SPARK-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li resolved SPARK-3364. Resolution: Fixed Zip equal-length but unequally-partition Key: SPARK-3364 URL: https://issues.apache.org/jira/browse/SPARK-3364 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Kevin Jung Fix For: 1.1.0 ZippedRDD losts some elements after zipping RDDs with equal numbers of partitions but unequal numbers of elements in their each partitions. This can happen when a user creates RDD by sc.textFile(path,partitionNumbers) with physically unbalanced HDFS file. {noformat} var x = sc.parallelize(1 to 9,3) var y = sc.parallelize(Array(1,1,1,1,1,2,2,3,3),3).keyBy(i=i) var z = y.partitionBy(new RangePartitioner(3,y)) expected x.zip(y).count() 9 x.zip(y).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(1,1)), (5,(1,1)), (6,(2,2)), (7,(2,2)), (8,(3,3)), (9,(3,3))) unexpected x.zip(z).count() 7 x.zip(z).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(2,2)), (5,(2,2)), (7,(3,3)), (8,(3,3))) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3345) Do correct parameters for ShuffleFileGroup
[ https://issues.apache.org/jira/browse/SPARK-3345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li resolved SPARK-3345. Resolution: Fixed Fix Version/s: (was: 1.1.1) Do correct parameters for ShuffleFileGroup -- Key: SPARK-3345 URL: https://issues.apache.org/jira/browse/SPARK-3345 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Priority: Minor Fix For: 1.2.0 In the method newFileGroup of class FileShuffleBlockManager, the parameters for creating new ShuffleFileGroup object is in wrong order. Wrong: new ShuffleFileGroup(fileId, shuffleId, files) Corrent: new ShuffleFileGroup(shuffleId, fileId, files) Because in current codes, the parameters shuffleId and fileId are not used. So it doesn't cause problem now. However it should be corrected for readability and avoid future problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3326) can't access a static variable after init in mapper
[ https://issues.apache.org/jira/browse/SPARK-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li resolved SPARK-3326. Resolution: Not a Problem can't access a static variable after init in mapper --- Key: SPARK-3326 URL: https://issues.apache.org/jira/browse/SPARK-3326 Project: Spark Issue Type: Bug Environment: CDH5.1.0 Spark1.0.0 Reporter: Gavin Zhang I wrote a object like: object Foo { private Bar bar = null def init(Bar bar){ this.bar = bar } def getSome(){ bar.someDef() } } In Spark main def, I read some text from HDFS and init this object. And after then calling getSome(). I was successful with this code: sc.textFile(args(0)).take(10).map(println(Foo.getSome())) However, when I changed it for write output to HDFS, I found the bar variable in Foo object is null: sc.textFile(args(0)).map(line=Foo.getSome()).saveAsTextFile(args(1)) WHY? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3473) Expose task status when converting TaskInfo into JSON representation
Kousuke Saruta created SPARK-3473: - Summary: Expose task status when converting TaskInfo into JSON representation Key: SPARK-3473 URL: https://issues.apache.org/jira/browse/SPARK-3473 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Kousuke Saruta When TaskInfo is converted into JSON by JsonProtocol, status is lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3473) Expose task status when converting TaskInfo into JSON representation
[ https://issues.apache.org/jira/browse/SPARK-3473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta closed SPARK-3473. - Resolution: Won't Fix I can know task status from failed field and finishTime so I close this. Expose task status when converting TaskInfo into JSON representation Key: SPARK-3473 URL: https://issues.apache.org/jira/browse/SPARK-3473 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Kousuke Saruta When TaskInfo is converted into JSON by JsonProtocol, status is lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3470) Have JavaSparkContext implement Closeable/AutoCloseable
[ https://issues.apache.org/jira/browse/SPARK-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128232#comment-14128232 ] Sean Owen commented on SPARK-3470: -- If you implement {{AutoCloseable}}, then Spark will not work on Java 6, since this class does not exist before Java 7. Implementing {{Closeable}} is fine of course. I assume it would just call {{stop()}} Have JavaSparkContext implement Closeable/AutoCloseable --- Key: SPARK-3470 URL: https://issues.apache.org/jira/browse/SPARK-3470 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.2 Reporter: Shay Rojansky Priority: Minor After discussion in SPARK-2972, it seems like a good idea to allow Java developers to use Java 7 automatic resource management with JavaSparkContext, like so: {code:java} try (JavaSparkContext ctx = new JavaSparkContext(...)) { return br.readLine(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3470) Have JavaSparkContext implement Closeable/AutoCloseable
[ https://issues.apache.org/jira/browse/SPARK-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128285#comment-14128285 ] Shay Rojansky commented on SPARK-3470: -- Good point about AutoCloseable. Yes, the idea is for Closeable to call stop(). I'd submit a PR myself but I don't know any Scala whatsoever... Have JavaSparkContext implement Closeable/AutoCloseable --- Key: SPARK-3470 URL: https://issues.apache.org/jira/browse/SPARK-3470 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.2 Reporter: Shay Rojansky Priority: Minor After discussion in SPARK-2972, it seems like a good idea to allow Java developers to use Java 7 automatic resource management with JavaSparkContext, like so: {code:java} try (JavaSparkContext ctx = new JavaSparkContext(...)) { return br.readLine(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3474) Rename the env variable SPARK_MASTER_IP to SPARK_MASTER_HOST
Chunjun Xiao created SPARK-3474: --- Summary: Rename the env variable SPARK_MASTER_IP to SPARK_MASTER_HOST Key: SPARK-3474 URL: https://issues.apache.org/jira/browse/SPARK-3474 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.1 Reporter: Chunjun Xiao There's some inconsistency regarding the env variable used to specify the spark master host server. In spark source code (MasterArguments.scala), the env variable is SPARK_MASTER_HOST, while in the shell script (e.g., spark-env.sh, start-master.sh), it's named SPARK_MASTER_IP. This will introduce an issue in some case, e.g., if spark master is started via service spark-master start, which is built based on latest bigtop (refer to bigtop/spark-master.svc). In this case, SPARK_MASTER_IP will have no effect. I suggest we change SPARK_MASTER_IP in the shell script to SPARK_MASTER_HOST. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3474) Rename the env variable SPARK_MASTER_IP to SPARK_MASTER_HOST
[ https://issues.apache.org/jira/browse/SPARK-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128300#comment-14128300 ] Sean Owen commented on SPARK-3474: -- (You can deprecate but still support old variable names, right? so SPARK_MASTER_IP has the effect of setting new SPARK_MASTER_HOST but generates a warning. You wouldn't want to or need to remove old vars immediately.) Rename the env variable SPARK_MASTER_IP to SPARK_MASTER_HOST Key: SPARK-3474 URL: https://issues.apache.org/jira/browse/SPARK-3474 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.1 Reporter: Chunjun Xiao There's some inconsistency regarding the env variable used to specify the spark master host server. In spark source code (MasterArguments.scala), the env variable is SPARK_MASTER_HOST, while in the shell script (e.g., spark-env.sh, start-master.sh), it's named SPARK_MASTER_IP. This will introduce an issue in some case, e.g., if spark master is started via service spark-master start, which is built based on latest bigtop (refer to bigtop/spark-master.svc). In this case, SPARK_MASTER_IP will have no effect. I suggest we change SPARK_MASTER_IP in the shell script to SPARK_MASTER_HOST. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3407) Add Date type support
[ https://issues.apache.org/jira/browse/SPARK-3407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128421#comment-14128421 ] Apache Spark commented on SPARK-3407: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/2344 Add Date type support - Key: SPARK-3407 URL: https://issues.apache.org/jira/browse/SPARK-3407 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3462) parquet pushdown for unionAll
[ https://issues.apache.org/jira/browse/SPARK-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128451#comment-14128451 ] Cody Koeninger commented on SPARK-3462: --- Created a PR for feedback. https://github.com/apache/spark/pull/2345 Seems to do the right thing locally, will see about testing on a cluster parquet pushdown for unionAll - Key: SPARK-3462 URL: https://issues.apache.org/jira/browse/SPARK-3462 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Cody Koeninger http://apache-spark-developers-list.1001551.n3.nabble.com/parquet-predicate-projection-pushdown-into-unionAll-td8339.html // single table, pushdown scala p.where('age 40).select('name) res36: org.apache.spark.sql.SchemaRDD = SchemaRDD[97] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Project [name#3] ParquetTableScan [name#3,age#4], (ParquetRelation /var/tmp/people, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [(age#4 40)] // union of 2 tables, no pushdown scala b.where('age 40).select('name) res37: org.apache.spark.sql.SchemaRDD = SchemaRDD[99] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Project [name#3] Filter (age#4 40) Union [ParquetTableScan [name#3,age#4,phones#5], (ParquetRelation /var/tmp/people, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [] ,ParquetTableScan [name#0,age#1,phones#2], (ParquetRelation /var/tmp/people2, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [] ] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3462) parquet pushdown for unionAll
[ https://issues.apache.org/jira/browse/SPARK-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128478#comment-14128478 ] Apache Spark commented on SPARK-3462: - User 'koeninger' has created a pull request for this issue: https://github.com/apache/spark/pull/2345 parquet pushdown for unionAll - Key: SPARK-3462 URL: https://issues.apache.org/jira/browse/SPARK-3462 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Cody Koeninger http://apache-spark-developers-list.1001551.n3.nabble.com/parquet-predicate-projection-pushdown-into-unionAll-td8339.html // single table, pushdown scala p.where('age 40).select('name) res36: org.apache.spark.sql.SchemaRDD = SchemaRDD[97] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Project [name#3] ParquetTableScan [name#3,age#4], (ParquetRelation /var/tmp/people, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [(age#4 40)] // union of 2 tables, no pushdown scala b.where('age 40).select('name) res37: org.apache.spark.sql.SchemaRDD = SchemaRDD[99] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Project [name#3] Filter (age#4 40) Union [ParquetTableScan [name#3,age#4,phones#5], (ParquetRelation /var/tmp/people, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [] ,ParquetTableScan [name#0,age#1,phones#2], (ParquetRelation /var/tmp/people2, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [] ] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3470) Have JavaSparkContext implement Closeable/AutoCloseable
[ https://issues.apache.org/jira/browse/SPARK-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128552#comment-14128552 ] Apache Spark commented on SPARK-3470: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/2346 Have JavaSparkContext implement Closeable/AutoCloseable --- Key: SPARK-3470 URL: https://issues.apache.org/jira/browse/SPARK-3470 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.2 Reporter: Shay Rojansky Priority: Minor After discussion in SPARK-2972, it seems like a good idea to allow Java developers to use Java 7 automatic resource management with JavaSparkContext, like so: {code:java} try (JavaSparkContext ctx = new JavaSparkContext(...)) { return br.readLine(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3462) parquet pushdown for unionAll
[ https://issues.apache.org/jira/browse/SPARK-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128682#comment-14128682 ] Cody Koeninger commented on SPARK-3462: --- Tested this on a cluster against unions of 2 and 3 parquet tables, around 2billion records. Seems like a big performance win - previously, simple queries (eg count, approx distinct count of single column) against a union of 2 tables were taking 5 to 10x as long as a single table. Now it's closer to linear, e.g. 35 secs for one table, 74 for union of 2, etc. parquet pushdown for unionAll - Key: SPARK-3462 URL: https://issues.apache.org/jira/browse/SPARK-3462 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Cody Koeninger http://apache-spark-developers-list.1001551.n3.nabble.com/parquet-predicate-projection-pushdown-into-unionAll-td8339.html // single table, pushdown scala p.where('age 40).select('name) res36: org.apache.spark.sql.SchemaRDD = SchemaRDD[97] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Project [name#3] ParquetTableScan [name#3,age#4], (ParquetRelation /var/tmp/people, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [(age#4 40)] // union of 2 tables, no pushdown scala b.where('age 40).select('name) res37: org.apache.spark.sql.SchemaRDD = SchemaRDD[99] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Project [name#3] Filter (age#4 40) Union [ParquetTableScan [name#3,age#4,phones#5], (ParquetRelation /var/tmp/people, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [] ,ParquetTableScan [name#0,age#1,phones#2], (ParquetRelation /var/tmp/people2, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [] ] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https
[ https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-3286. -- Resolution: Fixed Fix Version/s: 1.2.0 Cannot view ApplicationMaster UI when Yarn’s url scheme is https Key: SPARK-3286 URL: https://issues.apache.org/jira/browse/SPARK-3286 Project: Spark Issue Type: Bug Components: Web UI, YARN Affects Versions: 1.0.2 Reporter: Benoy Antony Fix For: 1.2.0 Attachments: SPARK-3286-branch-1-0.patch, SPARK-3286.patch The spark Application Master starts its web UI at http://host-name:port. When Spark ApplicationMaster registers its URL with Resource Manager , the URL does not contain URI scheme. If the URL scheme is absent, Resource Manager’s web app proxy will use the HTTP Policy of the Resource Manager.(YARN-1553) If the HTTP Policy of the Resource Manager is https, then web app proxy will try to access https://host-name:port. This will result in error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3475) dev/merge_spark_pr.py fails on mac
Thomas Graves created SPARK-3475: Summary: dev/merge_spark_pr.py fails on mac Key: SPARK-3475 URL: https://issues.apache.org/jira/browse/SPARK-3475 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.2.0 Reporter: Thomas Graves commit https://github.com/apache/spark/commit/4f4a9884d9268ba9808744b3d612ac23c75f105a#diff-c321b6c82ebb21d8fd225abea9b7b74c adding in print statement in the run command. When I try to run on mac it errors out when it hits these print statements. Perhaps there is workaround of issue with my environment. Automatic merge went well; stopped before committing as requested git log HEAD..PR_TOOL_MERGE_PR_2276 --pretty=format:%an %ae git log HEAD..PR_TOOL_MERGE_PR_2276 --pretty=format:%h [%an] %s Traceback (most recent call last): File ./dev/merge_spark_pr.py, line 332, in module merge_hash = merge_pr(pr_num, target_ref) File ./dev/merge_spark_pr.py, line 156, in merge_pr run_cmd(['git', 'commit', '--author=%s' % primary_author] + merge_message_flags) File ./dev/merge_spark_pr.py, line 77, in run_cmd print .join(cmd) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3470) Have JavaSparkContext implement Closeable/AutoCloseable
[ https://issues.apache.org/jira/browse/SPARK-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128751#comment-14128751 ] Matthew Farrellee commented on SPARK-3470: -- while you can implement Closeable in java 7+ and use try (Closeable c = new ...) { ... } (at least w/ openjdk 1.8), since spark targets java 7+, why not just use AutoCloseable? Have JavaSparkContext implement Closeable/AutoCloseable --- Key: SPARK-3470 URL: https://issues.apache.org/jira/browse/SPARK-3470 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.2 Reporter: Shay Rojansky Priority: Minor After discussion in SPARK-2972, it seems like a good idea to allow Java developers to use Java 7 automatic resource management with JavaSparkContext, like so: {code:java} try (JavaSparkContext ctx = new JavaSparkContext(...)) { return br.readLine(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3470) Have JavaSparkContext implement Closeable/AutoCloseable
[ https://issues.apache.org/jira/browse/SPARK-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128764#comment-14128764 ] Sean Owen commented on SPARK-3470: -- Spark retains compatibility with Java 6 on purpose AFAIK. But implementing Closeable is fine and also works with try-with-resources in Java 7, yes. Have JavaSparkContext implement Closeable/AutoCloseable --- Key: SPARK-3470 URL: https://issues.apache.org/jira/browse/SPARK-3470 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.2 Reporter: Shay Rojansky Priority: Minor After discussion in SPARK-2972, it seems like a good idea to allow Java developers to use Java 7 automatic resource management with JavaSparkContext, like so: {code:java} try (JavaSparkContext ctx = new JavaSparkContext(...)) { return br.readLine(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1484) MLlib should warn if you are using an iterative algorithm on non-cached data
[ https://issues.apache.org/jira/browse/SPARK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128800#comment-14128800 ] Apache Spark commented on SPARK-1484: - User 'staple' has created a pull request for this issue: https://github.com/apache/spark/pull/2347 MLlib should warn if you are using an iterative algorithm on non-cached data Key: SPARK-1484 URL: https://issues.apache.org/jira/browse/SPARK-1484 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Matei Zaharia Not sure what the best way to warn is, but even printing to the log is probably fine. We may want to print at the end of the training run as well as the beginning to make it more visible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3478) Profile Python tasks stage by stage in worker
[ https://issues.apache.org/jira/browse/SPARK-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129503#comment-14129503 ] Apache Spark commented on SPARK-3478: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2351 Profile Python tasks stage by stage in worker - Key: SPARK-3478 URL: https://issues.apache.org/jira/browse/SPARK-3478 Project: Spark Issue Type: New Feature Components: PySpark Reporter: Davies Liu Assignee: Davies Liu The Python code in driver is easy to profile by users, but the code run in worker is distributed in clusters, is not easy to profile by users. So we need a way to do the profiling in worker and aggregate all the result together for users. This also can be used to analys the bottleneck in PySpark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3480) Throws out Not a valid command 'yarn-alpha/scalastyle' in dev/scalastyle for sbt build tool during 'Running Scala style checks'
[ https://issues.apache.org/jira/browse/SPARK-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Zhou updated SPARK-3480: --- Description: Symptom: Run ./dev/run-tests and dump outputs as following: SBT_MAVEN_PROFILES_ARGS=-Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Pkinesis-asl [Warn] Java 8 tests will not run because JDK version is 1.8. = Running Apache RAT checks = RAT checks passed. = Running Scala style checks = Scalastyle checks failed at following occurrences: [error] Expected ID character [error] Not a valid command: yarn-alpha [error] Expected project ID [error] Expected configuration [error] Expected ':' (if selecting a configuration) [error] Expected key [error] Not a valid key: yarn-alpha [error] yarn-alpha/scalastyle [error] ^ Possible Cause: I checked the dev/scalastyle, found that there are 2 parameters 'yarn-alpha/scalastyle' and 'yarn/scalastyle' separately,like echo -e q\n | sbt/sbt -Pyarn -Phadoop-0.23 -Dhadoop.version=0.23.9 yarn-alpha/scalastyle \ scalastyle.txt echo -e q\n | sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 yarn/scalastyle \ scalastyle.txt From above error message, sbt seems to complain them due to '/' separator. So it can be run through after I manually modified original ones to 'yarn-alpha:scalastyle' and 'yarn:scalastyle'.. was: Symptom: Run ./dev/run-tests and dump outputs as following: SBT_MAVEN_PROFILES_ARGS=-Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Pkinesis-asl [Warn] Java 8 tests will not run because JDK version is 1.8. = Running Apache RAT checks = RAT checks passed. = Running Scala style checks = Scalastyle checks failed at following occurrences: [error] Expected ID character [error] Not a valid command: yarn-alpha [error] Expected project ID [error] Expected configuration [error] Expected ':' (if selecting a configuration) [error] Expected key [error] Not a valid key: yarn-alpha [error] yarn-alpha/scalastyle [error] ^ Possible Cause: I checked the dev/scalastyle, found that there are 2 parameters 'yarn-alpha/scalastyle' and 'yarn/scalastyle' separately,like echo -e q\n | sbt/sbt -Pyarn -Phadoop-0.23 -Dhadoop.version=0.23.9 yarn-alpha/scalastyle \ scalastyle.txt # Check style with YARN built too echo -e q\n | sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 yarn/scalastyle \ scalastyle.txt From above error message, sbt seems to complain them due to '/' separator. So it can be run through after I manually modified original ones to 'yarn-alpha:scalastyle' and 'yarn:scalastyle'.. Throws out Not a valid command 'yarn-alpha/scalastyle' in dev/scalastyle for sbt build tool during 'Running Scala style checks' --- Key: SPARK-3480 URL: https://issues.apache.org/jira/browse/SPARK-3480 Project: Spark Issue Type: Bug Components: Build Reporter: Yi Zhou Priority: Minor Symptom: Run ./dev/run-tests and dump outputs as following: SBT_MAVEN_PROFILES_ARGS=-Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Pkinesis-asl [Warn] Java 8 tests will not run because JDK version is 1.8. = Running Apache RAT checks = RAT checks passed. = Running Scala style checks = Scalastyle checks failed at following occurrences: [error] Expected ID character [error] Not a valid command: yarn-alpha [error] Expected project ID [error] Expected configuration [error] Expected ':' (if selecting a configuration) [error] Expected key [error] Not a valid key: yarn-alpha [error] yarn-alpha/scalastyle [error] ^ Possible Cause: I checked the dev/scalastyle, found that there are 2 parameters 'yarn-alpha/scalastyle' and 'yarn/scalastyle' separately,like echo -e q\n | sbt/sbt -Pyarn -Phadoop-0.23 -Dhadoop.version=0.23.9 yarn-alpha/scalastyle \ scalastyle.txt echo -e q\n | sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 yarn/scalastyle \ scalastyle.txt From above error
[jira] [Resolved] (SPARK-3447) Kryo NPE when serializing JListWrapper
[ https://issues.apache.org/jira/browse/SPARK-3447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3447. - Resolution: Fixed Fix Version/s: 1.2.0 Kryo NPE when serializing JListWrapper -- Key: SPARK-3447 URL: https://issues.apache.org/jira/browse/SPARK-3447 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Fix For: 1.2.0 Repro (provided by [~davies]): {code} from pyspark.sql import SQLContext; SQLContext(sc).inferSchema(sc.parallelize([{a: [3]}]))._jschema_rdd.collect() {code} {code} 14/09/05 21:59:47 ERROR TaskResultGetter: Exception while getting task result com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: underlying (scala.collection.convert.Wrappers$JListWrapper) values (org.apache.spark.sql.catalyst.expressions.GenericRow) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1276) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:701) Caused by: java.lang.NullPointerException at scala.collection.convert.Wrappers$MutableBufferWrapper.add(Wrappers.scala:80) at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) ... 23 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org