[jira] [Commented] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large
[ https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080539#comment-14080539 ] Carlos Fuertes commented on SPARK-2017: --- Hi, I have implemented under https://github.com/apache/spark/pull/1682 the solution where you serve the data for the tables as JSON for tasks under 'stages' and also 'storage' (this is issue SPARK-2016 which boils to same bottom problem). Main addition is exposing paths with the JSON data as: /stages/stage/tasks/json/?id=nnn /storage/json /storage/rdd/workers/json?id=nnn /storage/rdd/blocks/json?id=nnn and using javascript to built the tables from an ajax request of those JSON. This solves partially the issue of responsiveness since the data is served asynchronously to the loading of the page. However since the driver is sending for every refresh all the data again, with very big number of tasks as they progress, that means that it starts taking longer and longer to send all the data. But at least the Summary table loads much faster with no need to wait for all the task table to complete. A better solution would be to stream the data by chunks as they are ready or keep a cache of the previos results. I have not explored the latter yet but the above could be a start to build on it. web ui stage page becomes unresponsive when the number of tasks is large Key: SPARK-2017 URL: https://issues.apache.org/jira/browse/SPARK-2017 Project: Spark Issue Type: Sub-task Components: Web UI Reporter: Reynold Xin Labels: starter {code} sc.parallelize(1 to 100, 100).count() {code} The above code creates one million tasks to be executed. The stage detail web ui page takes forever to load (if it ever completes). There are again a few different alternatives: 0. Limit the number of tasks we show. 1. Pagination 2. By default only show the aggregate metrics and failed tasks, and hide the successful ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2766) ScalaReflectionSuite throw an llegalArgumentException in JDK 6
[ https://issues.apache.org/jira/browse/SPARK-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-2766: --- Summary: ScalaReflectionSuite throw an llegalArgumentException in JDK 6 (was: ScalaReflectionSuite throw an llegalArgumentException in jdk6) ScalaReflectionSuite throw an llegalArgumentException in JDK 6 --- Key: SPARK-2766 URL: https://issues.apache.org/jira/browse/SPARK-2766 Project: Spark Issue Type: Bug Reporter: Guoqiang Li Assignee: Guoqiang Li -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large
[ https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080539#comment-14080539 ] Carlos Fuertes edited comment on SPARK-2017 at 7/31/14 6:05 AM: Hi, I have implemented under https://github.com/apache/spark/pull/1682 the solution where you serve the data for the tables as JSON for tasks under 'stages' and also 'storage' (this is issue SPARK-2016 which boils to same bottom problem). Main addition is exposing paths with the JSON data as: /stages/stage/tasks/json/?id=nnn /storage/json /storage/rdd/workers/json?id=nnn /storage/rdd/blocks/json?id=nnn and using javascript to built the tables from an ajax request of those JSON. This solves partially the issue of responsiveness since the data is served asynchronously to the loading of the page. However since the driver is sending for every refresh all the data again, with very big number of tasks as they progress, that means that it starts taking longer and longer to send all the data. But at least the Summary table loads much faster with no need to wait for all the task table to complete. A better solution would be to stream the data by chunks as they are ready or keep a cache of the previos results. I have not explored the latter yet but the above could be a start to build on it. Let me know how this looks to you. was (Author: carlosfuertes): Hi, I have implemented under https://github.com/apache/spark/pull/1682 the solution where you serve the data for the tables as JSON for tasks under 'stages' and also 'storage' (this is issue SPARK-2016 which boils to same bottom problem). Main addition is exposing paths with the JSON data as: /stages/stage/tasks/json/?id=nnn /storage/json /storage/rdd/workers/json?id=nnn /storage/rdd/blocks/json?id=nnn and using javascript to built the tables from an ajax request of those JSON. This solves partially the issue of responsiveness since the data is served asynchronously to the loading of the page. However since the driver is sending for every refresh all the data again, with very big number of tasks as they progress, that means that it starts taking longer and longer to send all the data. But at least the Summary table loads much faster with no need to wait for all the task table to complete. A better solution would be to stream the data by chunks as they are ready or keep a cache of the previos results. I have not explored the latter yet but the above could be a start to build on it. web ui stage page becomes unresponsive when the number of tasks is large Key: SPARK-2017 URL: https://issues.apache.org/jira/browse/SPARK-2017 Project: Spark Issue Type: Sub-task Components: Web UI Reporter: Reynold Xin Labels: starter {code} sc.parallelize(1 to 100, 100).count() {code} The above code creates one million tasks to be executed. The stage detail web ui page takes forever to load (if it ever completes). There are again a few different alternatives: 0. Limit the number of tasks we show. 1. Pagination 2. By default only show the aggregate metrics and failed tasks, and hide the successful ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2016) rdd in-memory storage UI becomes unresponsive when the number of RDD partitions is large
[ https://issues.apache.org/jira/browse/SPARK-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080542#comment-14080542 ] Carlos Fuertes commented on SPARK-2016: --- I have created a pull request https://github.com/apache/spark/pull/1682 that deals with this issue. The idea follow the discussion of issue SPARK-2017 where the data for the tables is served as JSON and later rendered javascript. See https://issues.apache.org/jira/browse/SPARK-2017 for all the discussion. rdd in-memory storage UI becomes unresponsive when the number of RDD partitions is large Key: SPARK-2016 URL: https://issues.apache.org/jira/browse/SPARK-2016 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin Labels: starter Try run {code} sc.parallelize(1 to 100, 100).cache().count() {code} And open the storage UI for this RDD. It takes forever to load the page. When the number of partitions is very large, I think there are a few alternatives: 0. Only show the top 1000. 1. Pagination 2. Instead of grouping by RDD blocks, group by executors -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2022) Spark 1.0.0 is failing if mesos.coarse set to true
[ https://issues.apache.org/jira/browse/SPARK-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2022: --- Target Version/s: 1.1.0 Spark 1.0.0 is failing if mesos.coarse set to true -- Key: SPARK-2022 URL: https://issues.apache.org/jira/browse/SPARK-2022 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.0 Reporter: Marek Wiewiorka Assignee: Tim Chen Priority: Critical more stderr --- WARNING: Logging before InitGoogleLogging() is written to STDERR I0603 16:07:53.721132 61192 exec.cpp:131] Version: 0.18.2 I0603 16:07:53.725230 61200 exec.cpp:205] Executor registered on slave 201405220917-134217738-5050-27119-0 Exception in thread main java.lang.NumberFormatException: For input string: sparkseq003.cloudapp.net at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:492) at java.lang.Integer.parseInt(Integer.java:527) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) at scala.collection.immutable.StringOps.toInt(StringOps.scala:31) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:135) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) more stdout --- Registered executor on sparkseq003.cloudapp.net Starting task 5 Forked command at 61202 sh -c '/home/mesos/spark-1.0.0/bin/spark-class org.apache.spark.executor.CoarseGrainedExecutorBackend -Dspark.mesos.coarse=true akka.tcp://sp...@sparkseq001.cloudapp.net:40312/user/CoarseG rainedScheduler 201405220917-134217738-5050-27119-0 sparkseq003.cloudapp.net 4' Command exited with status 1 (pid: 61202) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2633) support register spark listener to listener bus with Java API
[ https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated SPARK-2633: - Attachment: Spark listener enhancement for Hive on Spark job monitor and statistic.docx I add a doc to collect requirement from hive on spark side, it may looks mussy in the comments. we could keep on discussing based on this file. support register spark listener to listener bus with Java API - Key: SPARK-2633 URL: https://issues.apache.org/jira/browse/SPARK-2633 Project: Spark Issue Type: New Feature Components: Java API Reporter: Chengxiang Li Attachments: Spark listener enhancement for Hive on Spark job monitor and statistic.docx Currently user can only register spark listener with Scala API, we should add this feature to Java API as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2712) Add a small note that mvn package must happen before test
[ https://issues.apache.org/jira/browse/SPARK-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080579#comment-14080579 ] Patrick Wendell commented on SPARK-2712: I'm a bit confused about the doc request because we already have a section in the maven doc pertaining explicitly to this: http://spark.apache.org/docs/latest/building-with-maven.html#spark-tests-in-maven Add a small note that mvn package must happen before test - Key: SPARK-2712 URL: https://issues.apache.org/jira/browse/SPARK-2712 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 0.9.1, 1.0.0, 1.1.1 Environment: all Reporter: Stephen Boesch Priority: Trivial Labels: documentation Fix For: 1.1.0 Original Estimate: 0h Remaining Estimate: 0h Add to the building-with-maven.md: Requirement: build packages before running tests Tests must be run AFTER the package target has already been executed. The following is an example of a correct (build, test) sequence: mvn -Pyarn -Phadoop-2.3 -DskipTests -Phive clean package mvn -Pyarn -Phadoop-2.3 -Phive test BTW Reynold Xin requested this tiny doc improvement. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2712) Add a small note that mvn package must happen before test
[ https://issues.apache.org/jira/browse/SPARK-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2712: --- Assignee: Stephen Boesch Add a small note that mvn package must happen before test - Key: SPARK-2712 URL: https://issues.apache.org/jira/browse/SPARK-2712 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 0.9.1, 1.0.0, 1.1.1 Environment: all Reporter: Stephen Boesch Assignee: Stephen Boesch Priority: Trivial Labels: documentation Fix For: 1.1.0 Original Estimate: 0h Remaining Estimate: 0h Add to the building-with-maven.md: Requirement: build packages before running tests Tests must be run AFTER the package target has already been executed. The following is an example of a correct (build, test) sequence: mvn -Pyarn -Phadoop-2.3 -DskipTests -Phive clean package mvn -Pyarn -Phadoop-2.3 -Phive test BTW Reynold Xin requested this tiny doc improvement. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner
[ https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080594#comment-14080594 ] Larry Xiao commented on SPARK-2728: --- I checked-out newest commit 5a110da25f15694773d6f7c6ee63c5b08ada4eb0 and find it's very different now. Integer overflow in partition index calculation RangePartitioner Key: SPARK-2728 URL: https://issues.apache.org/jira/browse/SPARK-2728 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: Spark 1.0.1 Reporter: Jianshi Huang Labels: easyfix If the partition number are greater than 10362, then spark will report ArrayOutofIndex error. The reason is in the partition index calculation in rangeBounds: #Line: 112 val bounds = new Array[K](partitions - 1) for (i - 0 until partitions - 1) { val index = (rddSample.length - 1) * (i + 1) / partitions bounds(i) = rddSample(index) } Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int. Cast rddSample.length - 1 to Long should be enough for a fix? Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner
[ https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080594#comment-14080594 ] Larry Xiao edited comment on SPARK-2728 at 7/31/14 7:23 AM: I checked-out newest commit 5a110da25f15694773d6f7c6ee63c5b08ada4eb0 and find it's very different now. It uses Int still, but don't have multiplication like before. Can you test on it? was (Author: larryxiao): I checked-out newest commit 5a110da25f15694773d6f7c6ee63c5b08ada4eb0 and find it's very different now. Integer overflow in partition index calculation RangePartitioner Key: SPARK-2728 URL: https://issues.apache.org/jira/browse/SPARK-2728 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: Spark 1.0.1 Reporter: Jianshi Huang Labels: easyfix If the partition number are greater than 10362, then spark will report ArrayOutofIndex error. The reason is in the partition index calculation in rangeBounds: #Line: 112 val bounds = new Array[K](partitions - 1) for (i - 0 until partitions - 1) { val index = (rddSample.length - 1) * (i + 1) / partitions bounds(i) = rddSample(index) } Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int. Cast rddSample.length - 1 to Long should be enough for a fix? Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner
[ https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080594#comment-14080594 ] Larry Xiao edited comment on SPARK-2728 at 7/31/14 7:24 AM: I checked-out newest commit 5a110da25f15694773d6f7c6ee63c5b08ada4eb0 and find it's very different now. It uses Int still, but don't have multiplication like before. Can you test again? [~huangjs] was (Author: larryxiao): I checked-out newest commit 5a110da25f15694773d6f7c6ee63c5b08ada4eb0 and find it's very different now. It uses Int still, but don't have multiplication like before. Can you test on it? Integer overflow in partition index calculation RangePartitioner Key: SPARK-2728 URL: https://issues.apache.org/jira/browse/SPARK-2728 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: Spark 1.0.1 Reporter: Jianshi Huang Labels: easyfix If the partition number are greater than 10362, then spark will report ArrayOutofIndex error. The reason is in the partition index calculation in rangeBounds: #Line: 112 val bounds = new Array[K](partitions - 1) for (i - 0 until partitions - 1) { val index = (rddSample.length - 1) * (i + 1) / partitions bounds(i) = rddSample(index) } Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int. Cast rddSample.length - 1 to Long should be enough for a fix? Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2729) Forgot to match Timestamp type in ColumnBuilder
[ https://issues.apache.org/jira/browse/SPARK-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080624#comment-14080624 ] Cheng Lian commented on SPARK-2729: --- Steps to reproduce this bug within {{sbt -Phive hive/console}}: {code} scala hql(create table dates as select cast('2011-01-01 01:01:01' as timestamp) from src) ... scala cacheTable(dates) ... scala hql(select count(*) from dates).collect() ... 14/07/31 16:00:37 ERROR executor.Executor: Exception in task 0.0 in stage 3.0 (TID 6) scala.MatchError: 8 (of class java.lang.Integer) at org.apache.spark.sql.columnar.ColumnBuilder$.apply(ColumnBuilder.scala:146) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anonfun$2.apply(InMemoryColumnarTableScan.scala:48) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anonfun$2.apply(InMemoryColumnarTableScan.scala:47) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1.apply(InMemoryColumnarTableScan.scala:47) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1.apply(InMemoryColumnarTableScan.scala:46) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:595) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:595) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:226) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:189) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Forgot to match Timestamp type in ColumnBuilder --- Key: SPARK-2729 URL: https://issues.apache.org/jira/browse/SPARK-2729 Project: Spark Issue Type: Bug Components: SQL Reporter: Teng Qiu after SPARK-2710 we can create a table in Spark SQL with ColumnType Timestamp from jdbc. when i try to {code} sqlContext.cacheTable(myJdbcTable) {code} then {code} sqlContext.sql(select count(*) from myJdbcTable) {code} i got exception: {code} scala.MatchError: 8 (of class java.lang.Integer) at org.apache.spark.sql.columnar.ColumnBuilder$.apply(ColumnBuilder.scala:146) {code} i checked the code ColumnBuilder.scala:146 it is just missing a match of Timestamp typeid. so it is easy to fix. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1812) Support cross-building with Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080627#comment-14080627 ] Apache Spark commented on SPARK-1812: - User 'avati' has created a pull request for this issue: https://github.com/apache/spark/pull/1685 Support cross-building with Scala 2.11 -- Key: SPARK-1812 URL: https://issues.apache.org/jira/browse/SPARK-1812 Project: Spark Issue Type: New Feature Components: Build, Spark Core Reporter: Matei Zaharia Assignee: Prashant Sharma Since Scala 2.10/2.11 are source compatible, we should be able to cross build for both versions. From what I understand there are basically three things we need to figure out: 1. Have a two versions of our dependency graph, one that uses 2.11 dependencies and the other that uses 2.10 dependencies. 2. Figure out how to publish different poms for 2.10 and 2.11. I think (1) can be accomplished by having a scala 2.11 profile. (2) isn't really well supported by Maven since published pom's aren't generated dynamically. But we can probably script around it to make it work. I've done some initial sanity checks with a simple build here: https://github.com/pwendell/scala-maven-crossbuild -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner
[ https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080647#comment-14080647 ] Sean Owen commented on SPARK-2728: -- Yes, this was likely fixed by recent changes to RangePartitioner already. Integer overflow in partition index calculation RangePartitioner Key: SPARK-2728 URL: https://issues.apache.org/jira/browse/SPARK-2728 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: Spark 1.0.1 Reporter: Jianshi Huang Labels: easyfix If the partition number are greater than 10362, then spark will report ArrayOutofIndex error. The reason is in the partition index calculation in rangeBounds: #Line: 112 val bounds = new Array[K](partitions - 1) for (i - 0 until partitions - 1) { val index = (rddSample.length - 1) * (i + 1) / partitions bounds(i) = rddSample(index) } Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int. Cast rddSample.length - 1 to Long should be enough for a fix? Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080653#comment-14080653 ] Tathagata Das commented on SPARK-2447: -- I took a brief look at SPARK-1127 as well. I think both the PRs have their merits. We should consider consolidating the functionalities that they provide. Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080653#comment-14080653 ] Tathagata Das edited comment on SPARK-2447 at 7/31/14 8:39 AM: --- I took a brief look at SPARK-1127 as well. I think both the PRs have their merits. We should consider consolidating the functionalities that they provide. The relevant PR is this https://github.com/apache/spark/pull/194/files was (Author: tdas): I took a brief look at SPARK-1127 as well. I think both the PRs have their merits. We should consider consolidating the functionalities that they provide. Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080653#comment-14080653 ] Tathagata Das edited comment on SPARK-2447 at 7/31/14 8:40 AM: --- I took a brief look at SPARK-1127 as well. I think both the PRs have their merits. We should consider consolidating the functionalities that they provide. The relevant PR is this https://github.com/apache/spark/pull/194/files [~ted.m] was (Author: tdas): I took a brief look at SPARK-1127 as well. I think both the PRs have their merits. We should consider consolidating the functionalities that they provide. The relevant PR is this https://github.com/apache/spark/pull/194/files Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080653#comment-14080653 ] Tathagata Das edited comment on SPARK-2447 at 7/31/14 8:41 AM: --- I took a brief look at SPARK-1127 as well. I think both the PRs have their merits. We should consider consolidating the functionalities that they provide. The relevant PR is this https://github.com/apache/spark/pull/194/files [~ted.m] can you take a look at this PR as well. I think the saveAsHBaseFile is a simpler interface that may be worth supporting if there is enough use of this simple interface (which assumes that all rows have same column structure). was (Author: tdas): I took a brief look at SPARK-1127 as well. I think both the PRs have their merits. We should consider consolidating the functionalities that they provide. The relevant PR is this https://github.com/apache/spark/pull/194/files [~ted.m] Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2468) zero-copy shuffle network communication
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080657#comment-14080657 ] Raymond Liu commented on SPARK-2468: so, is there anyone working on this? zero-copy shuffle network communication --- Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty NIO. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2767) SparkSQL CLI doens't output error message if query failed.
Cheng Hao created SPARK-2767: Summary: SparkSQL CLI doens't output error message if query failed. Key: SPARK-2767 URL: https://issues.apache.org/jira/browse/SPARK-2767 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2767) SparkSQL CLI doens't output error message if query failed.
[ https://issues.apache.org/jira/browse/SPARK-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080660#comment-14080660 ] Apache Spark commented on SPARK-2767: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/1686 SparkSQL CLI doens't output error message if query failed. -- Key: SPARK-2767 URL: https://issues.apache.org/jira/browse/SPARK-2767 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2749) Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep
[ https://issues.apache.org/jira/browse/SPARK-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080661#comment-14080661 ] Sean Owen commented on SPARK-2749: -- Thanks! PR merged so this can be closed as fixed. Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep --- Key: SPARK-2749 URL: https://issues.apache.org/jira/browse/SPARK-2749 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.1 Reporter: Sean Owen Priority: Minor The Maven-based builds in the build matrix have been failing for a few days: https://amplab.cs.berkeley.edu/jenkins/view/Spark/ On inspection, it looks like the Spark SQL Java tests don't compile: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull I confirmed it by repeating the command vs master: mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package The problem is that this module doesn't depend on JUnit. In fact, none of the modules do, but com.novocode:junit-interface (the SBT-JUnit bridge) pulls it in, in most places. However this module doesn't depend on com.novocode:junit-interface Adding the junit:junit dependency fixes the compile problem. In fact, the other modules with Java tests should probably depend on it explicitly instead of happening to get it via com.novocode:junit-interface, since that is a bit SBT/Scala-specific (and I am not even sure it's needed). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2768) Add product, user recommend method to MatrixFactorizationModel
Sean Owen created SPARK-2768: Summary: Add product, user recommend method to MatrixFactorizationModel Key: SPARK-2768 URL: https://issues.apache.org/jira/browse/SPARK-2768 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.1 Reporter: Sean Owen Priority: Minor Right now, MatrixFactorizationModel can only predict a score for one or more (user,product) tuples. As a comment in the file notes, it would be more useful to expose a recommend method, that computes top N scoring products for a user (or vice versa -- users for a product). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2768) Add product, user recommend method to MatrixFactorizationModel
[ https://issues.apache.org/jira/browse/SPARK-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080715#comment-14080715 ] Apache Spark commented on SPARK-2768: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/1687 Add product, user recommend method to MatrixFactorizationModel -- Key: SPARK-2768 URL: https://issues.apache.org/jira/browse/SPARK-2768 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.1 Reporter: Sean Owen Priority: Minor Right now, MatrixFactorizationModel can only predict a score for one or more (user,product) tuples. As a comment in the file notes, it would be more useful to expose a recommend method, that computes top N scoring products for a user (or vice versa -- users for a product). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner
[ https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080729#comment-14080729 ] Jianshi Huang commented on SPARK-2728: -- I see. Thanks for the fix Sean and Larry! Integer overflow in partition index calculation RangePartitioner Key: SPARK-2728 URL: https://issues.apache.org/jira/browse/SPARK-2728 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: Spark 1.0.1 Reporter: Jianshi Huang Labels: easyfix If the partition number are greater than 10362, then spark will report ArrayOutofIndex error. The reason is in the partition index calculation in rangeBounds: #Line: 112 val bounds = new Array[K](partitions - 1) for (i - 0 until partitions - 1) { val index = (rddSample.length - 1) * (i + 1) / partitions bounds(i) = rddSample(index) } Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int. Cast rddSample.length - 1 to Long should be enough for a fix? Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner
[ https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080741#comment-14080741 ] Jianshi Huang commented on SPARK-2728: -- Anyone can test it? I'll close the issue. My build for HDP2.1 couldn't work with YARN... strange. Integer overflow in partition index calculation RangePartitioner Key: SPARK-2728 URL: https://issues.apache.org/jira/browse/SPARK-2728 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: Spark 1.0.1 Reporter: Jianshi Huang Labels: easyfix If the partition number are greater than 10362, then spark will report ArrayOutofIndex error. The reason is in the partition index calculation in rangeBounds: #Line: 112 val bounds = new Array[K](partitions - 1) for (i - 0 until partitions - 1) { val index = (rddSample.length - 1) * (i + 1) / partitions bounds(i) = rddSample(index) } Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int. Cast rddSample.length - 1 to Long should be enough for a fix? Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2769) Ganglia Support Broken / Not working
Stephen Walsh created SPARK-2769: Summary: Ganglia Support Broken / Not working Key: SPARK-2769 URL: https://issues.apache.org/jira/browse/SPARK-2769 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: Linux Red Hat 6.4 on Spark 1.1.0 Reporter: Stephen Walsh Hi all, I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0 No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is installed. I've added the following to the metrics.properties *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink *.sink.graphite.host=HOSTNAME *.sink.graphite.port=8649 *.sink.graphite.period=1 *.sink.graphite.prefix=aa and I get this error message java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) at java.net.SocketOutputStream.write(SocketOutputStream.java:159) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229) at java.io.BufferedWriter.flush(BufferedWriter.java:254) at com.codahale.metrics.graphite.Graphite.send(Graphite.java:77) at com.codahale.metrics.graphite.GraphiteReporter.reportGauge(GraphiteReporter.java:254) at com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:156) at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:107) at com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:86) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) From looking at the code I see the following. val graphite: Graphite = new Graphite(new InetSocketAddress(host, port)) val reporter: GraphiteReporter = GraphiteReporter.forRegistry(registry) .convertDurationsTo(TimeUnit.MILLISECONDS) .convertRatesTo(TimeUnit.SECONDS) .prefixedWith(prefix) .build(graphite) Followed by override def start() { reporter.start(pollPeriod, pollUnit) } I noticed that the error fails when we first fry to send a message but nowhere do I see graphite.connect() being called? https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L62 The GraphiteBuilder doesn't call it either when creating the reporter object. https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteReporter.java#L113 Maybe I'm looking in the wrong area and I'm passing in the wrong values - but very little logging has me thinking it is a bug. Regards Steve -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2769) Ganglia Support Broken / Not working
[ https://issues.apache.org/jira/browse/SPARK-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Walsh updated SPARK-2769: - Description: Hi all, I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0 No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is installed. I've added the following to the metrics.properties *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink *.sink.graphite.host=HOSTNAME *.sink.graphite.port=8649 *.sink.graphite.period=1 *.sink.graphite.prefix=aa and I get this error message 14/07/31 05:39:00 WARN graphite.GraphiteReporter: Unable to report to Graphite java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) at java.net.SocketOutputStream.write(SocketOutputStream.java:159) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229) at java.io.BufferedWriter.flush(BufferedWriter.java:254) at com.codahale.metrics.graphite.Graphite.send(Graphite.java:77) at com.codahale.metrics.graphite.GraphiteReporter.reportGauge(GraphiteReporter.java:254) at com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:156) at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:107) at com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:86) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) From looking at the code I see the following. val graphite: Graphite = new Graphite(new InetSocketAddress(host, port)) val reporter: GraphiteReporter = GraphiteReporter.forRegistry(registry) .convertDurationsTo(TimeUnit.MILLISECONDS) .convertRatesTo(TimeUnit.SECONDS) .prefixedWith(prefix) .build(graphite) Followed by override def start() { reporter.start(pollPeriod, pollUnit) } I noticed that the error fails when we first fry to send a message but nowhere do I see graphite.connect() being called? https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L62 The GraphiteBuilder doesn't call it either when creating the reporter object. https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteReporter.java#L113 Maybe I'm looking in the wrong area and I'm passing in the wrong values - but very little logging has me thinking it is a bug. Regards Steve was: Hi all, I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0 No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is installed. I've added the following to the metrics.properties *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink *.sink.graphite.host=HOSTNAME *.sink.graphite.port=8649 *.sink.graphite.period=1 *.sink.graphite.prefix=aa and I get this error message java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) at java.net.SocketOutputStream.write(SocketOutputStream.java:159) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229) at java.io.BufferedWriter.flush(BufferedWriter.java:254) at com.codahale.metrics.graphite.Graphite.send(Graphite.java:77) at com.codahale.metrics.graphite.GraphiteReporter.reportGauge(GraphiteReporter.java:254) at com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:156) at
[jira] [Updated] (SPARK-2769) Ganglia Support Broken / Not working
[ https://issues.apache.org/jira/browse/SPARK-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Walsh updated SPARK-2769: - Description: Hi all, I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0 No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is installed. I've added the following to the metrics.properties *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink *.sink.graphite.host=HOSTNAME *.sink.graphite.port=8649 *.sink.graphite.period=1 *.sink.graphite.prefix=aa and I get this error message 14/07/31 05:39:00 WARN graphite.GraphiteReporter: Unable to report to Graphite java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) at java.net.SocketOutputStream.write(SocketOutputStream.java:159) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229) at java.io.BufferedWriter.flush(BufferedWriter.java:254) at com.codahale.metrics.graphite.Graphite.send(Graphite.java:77) at com.codahale.metrics.graphite.GraphiteReporter.reportGauge(GraphiteReporter.java:254) at com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:156) at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:107) at com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:86) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) From looking at the code I see the following. val graphite: Graphite = new Graphite(new InetSocketAddress(host, port)) val reporter: GraphiteReporter = GraphiteReporter.forRegistry(registry) .convertDurationsTo(TimeUnit.MILLISECONDS) .convertRatesTo(TimeUnit.SECONDS) .prefixedWith(prefix) .build(graphite) Followed by override def start() { reporter.start(pollPeriod, pollUnit) } I noticed that the error fails when we first fry to send a message but nowhere do I see graphite.connect() being called? https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L62 as it seems to fail on the send function.. https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L77 a with this.writer not initialized the writer.write will fail. The GraphiteBuilder doesn't call it either when creating the reporter object. https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteReporter.java#L113 Maybe I'm looking in the wrong area and I'm passing in the wrong values - but very little logging has me thinking it is a bug. Regards Steve was: Hi all, I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0 No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is installed. I've added the following to the metrics.properties *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink *.sink.graphite.host=HOSTNAME *.sink.graphite.port=8649 *.sink.graphite.period=1 *.sink.graphite.prefix=aa and I get this error message 14/07/31 05:39:00 WARN graphite.GraphiteReporter: Unable to report to Graphite java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) at java.net.SocketOutputStream.write(SocketOutputStream.java:159) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229) at java.io.BufferedWriter.flush(BufferedWriter.java:254) at
[jira] [Updated] (SPARK-2769) Ganglia Support Broken / Not working
[ https://issues.apache.org/jira/browse/SPARK-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Walsh updated SPARK-2769: - Description: Hi all, I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0 No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is installed. I've added the following to the metrics.properties *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink *.sink.graphite.host=HOSTNAME *.sink.graphite.port=8649 *.sink.graphite.period=1 *.sink.graphite.prefix=aa and I get this error message 14/07/31 05:39:00 WARN graphite.GraphiteReporter: Unable to report to Graphite java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) at java.net.SocketOutputStream.write(SocketOutputStream.java:159) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229) at java.io.BufferedWriter.flush(BufferedWriter.java:254) at com.codahale.metrics.graphite.Graphite.send(Graphite.java:77) at com.codahale.metrics.graphite.GraphiteReporter.reportGauge(GraphiteReporter.java:254) at com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:156) at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:107) at com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:86) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) From looking at the code I see the following. val graphite: Graphite = new Graphite(new InetSocketAddress(host, port)) val reporter: GraphiteReporter = GraphiteReporter.forRegistry(registry) .convertDurationsTo(TimeUnit.MILLISECONDS) .convertRatesTo(TimeUnit.SECONDS) .prefixedWith(prefix) .build(graphite) https://github.com/apache/spark/blob/87bd1f9ef7d547ee54a8a83214b45462e0751efb/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala#L69 Followed by override def start() { reporter.start(pollPeriod, pollUnit) } I noticed that the error fails when we first fry to send a message but nowhere do I see graphite.connect() being called? https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L62 as it seems to fail on the send function.. https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L77 a with this.writer not initialized the writer.write will fail. The GraphiteBuilder doesn't call it either when creating the reporter object. https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteReporter.java#L113 Maybe I'm looking in the wrong area and I'm passing in the wrong values - but very little logging has me thinking it is a bug. Regards Steve was: Hi all, I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0 No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is installed. I've added the following to the metrics.properties *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink *.sink.graphite.host=HOSTNAME *.sink.graphite.port=8649 *.sink.graphite.period=1 *.sink.graphite.prefix=aa and I get this error message 14/07/31 05:39:00 WARN graphite.GraphiteReporter: Unable to report to Graphite java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) at java.net.SocketOutputStream.write(SocketOutputStream.java:159) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
[jira] [Commented] (SPARK-2709) Add a tool for certifying Spark API compatiblity
[ https://issues.apache.org/jira/browse/SPARK-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080910#comment-14080910 ] Apache Spark commented on SPARK-2709: - User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/1688 Add a tool for certifying Spark API compatiblity Key: SPARK-2709 URL: https://issues.apache.org/jira/browse/SPARK-2709 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Patrick Wendell Assignee: Prashant Sharma As Spark is packaged by more and more distributors, it would be good to have a tool that verifies API compatiblity of a provided Spark package. The tool would certify that a vendor distrubtion of Spark contains all of the API's present in a particular upstream Spark version. This will help vendors make sure they remain API compliant when they make changes or back ports to Spark. It will also discourage vendors from knowingly breaking API's, because anyone can audit their distribution and see that they have removed support for certain API's. I'm hoping a tool like this will avoid API fragmentation in the Spark community. One poor man's implementation of this is that a vendor can just run the binary compatibility checks in the spark build against an upstream version of Spark. That's a pretty good start, but it means you can't come as a third party and audit a distribution. Another approach would be to have something where anyone can come in and audit a distribution even if they don't have access to the packaging and source code. That would look something like this: 1. For each release we publish a manifest of all public API's (we might borrow the MIMA string representation of bye code signatures) 2. We package an auditing tool as a jar file. 3. The user runs a tool with spark-submit that reflectively walks through all exposed Spark API's and makes sure that everything on the manifest is encountered. From the implementation side, this is just brainstorming at this point. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't
[ https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080935#comment-14080935 ] Apache Spark commented on SPARK-1021: - User 'erikerlandson' has created a pull request for this issue: https://github.com/apache/spark/pull/1689 sortByKey() launches a cluster job when it shouldn't Key: SPARK-1021 URL: https://issues.apache.org/jira/browse/SPARK-1021 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.8.0, 0.9.0 Reporter: Andrew Ash Assignee: Mark Hamstra Labels: starter The sortByKey() method is listed as a transformation, not an action, in the documentation. But it launches a cluster job regardless. http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html Some discussion on the mailing list suggested that this is a problem with the rdd.count() call inside Partitioner.scala's rangeBounds method. https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102 Josh Rosen suggests that rangeBounds should be made into a lazy variable: {quote} I wonder whether making RangePartitoner .rangeBounds into a lazy val would fix this (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95). We'd need to make sure that rangeBounds() is never called before an action is performed. This could be tricky because it's called in the RangePartitioner.equals() method. Maybe it's sufficient to just compare the number of partitions, the ids of the RDDs used to create the RangePartitioner, and the sort ordering. This still supports the case where I range-partition one RDD and pass the same partitioner to a different RDD. It breaks support for the case where two range partitioners created on different RDDs happened to have the same rangeBounds(), but it seems unlikely that this would really harm performance since it's probably unlikely that the range partitioners are equal by chance. {quote} Can we please make this happen? I'll send a PR on GitHub to start the discussion and testing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2749) Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep
[ https://issues.apache.org/jira/browse/SPARK-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080953#comment-14080953 ] Apache Spark commented on SPARK-2749: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/1690 Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep --- Key: SPARK-2749 URL: https://issues.apache.org/jira/browse/SPARK-2749 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.1 Reporter: Sean Owen Priority: Minor The Maven-based builds in the build matrix have been failing for a few days: https://amplab.cs.berkeley.edu/jenkins/view/Spark/ On inspection, it looks like the Spark SQL Java tests don't compile: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull I confirmed it by repeating the command vs master: mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package The problem is that this module doesn't depend on JUnit. In fact, none of the modules do, but com.novocode:junit-interface (the SBT-JUnit bridge) pulls it in, in most places. However this module doesn't depend on com.novocode:junit-interface Adding the junit:junit dependency fixes the compile problem. In fact, the other modules with Java tests should probably depend on it explicitly instead of happening to get it via com.novocode:junit-interface, since that is a bit SBT/Scala-specific (and I am not even sure it's needed). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2769) Ganglia Support Broken / Not working
[ https://issues.apache.org/jira/browse/SPARK-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Walsh updated SPARK-2769: - Description: Hi all, I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0 No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is installed. I've added the following to the metrics.properties *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink *.sink.graphite.host=HOSTNAME *.sink.graphite.port=8649 *.sink.graphite.period=1 *.sink.graphite.prefix=aa and I get this error message 14/07/31 05:39:00 WARN graphite.GraphiteReporter: Unable to report to Graphite java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) at java.net.SocketOutputStream.write(SocketOutputStream.java:159) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229) at java.io.BufferedWriter.flush(BufferedWriter.java:254) at com.codahale.metrics.graphite.Graphite.send(Graphite.java:77) at com.codahale.metrics.graphite.GraphiteReporter.reportGauge(GraphiteReporter.java:254) at com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:156) at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:107) at com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:86) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) From looking at the code I see the following. val graphite: Graphite = new Graphite(new InetSocketAddress(host, port)) val reporter: GraphiteReporter = GraphiteReporter.forRegistry(registry) .convertDurationsTo(TimeUnit.MILLISECONDS) .convertRatesTo(TimeUnit.SECONDS) .prefixedWith(prefix) .build(graphite) https://github.com/apache/spark/blob/87bd1f9ef7d547ee54a8a83214b45462e0751efb/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala#L69 Followed by override def start() { reporter.start(pollPeriod, pollUnit) } I noticed that the error fails when we first fry to send a message but nowhere do I see graphite.connect() being called? https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L62 as it seems to fail on the send function.. https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L77 a with this.writer not initialized the writer.write will fail. The GraphiteBuilder doesn't call it either when creating the reporter object. https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteReporter.java#L113 Maybe I'm looking in the wrong area and I'm passing in the wrong values - but very little logging has me thinking it is a bug. EDIT: found out where the connect gets called. https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteReporter.java#L153 ad his is called from here https://github.com/dropwizard/metrics/blob/99dc540c2cbe6bb3be304e20449fb641c7f5382a/metrics-core/src/main/java/com/codahale/metrics/ScheduledReporter.java#L98 which is called form here https://github.com/dropwizard/metrics/blob/99dc540c2cbe6bb3be304e20449fb641c7f5382a/metrics-core/src/main/java/com/codahale/metrics/ScheduledReporter.java#L98 but the issue still stands. :/ Regards Steve was: Hi all, I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0 No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is installed. I've added the following to the metrics.properties *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink *.sink.graphite.host=HOSTNAME *.sink.graphite.port=8649 *.sink.graphite.period=1 *.sink.graphite.prefix=aa and I get this error message 14/07/31 05:39:00 WARN
[jira] [Commented] (SPARK-2700) Hidden files (such as .impala_insert_staging) should be filtered out by sqlContext.parquetFile
[ https://issues.apache.org/jira/browse/SPARK-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080969#comment-14080969 ] Apache Spark commented on SPARK-2700: - User 'chutium' has created a pull request for this issue: https://github.com/apache/spark/pull/1691 Hidden files (such as .impala_insert_staging) should be filtered out by sqlContext.parquetFile -- Key: SPARK-2700 URL: https://issues.apache.org/jira/browse/SPARK-2700 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.0.1 Reporter: Teng Qiu when creating a table in impala, a hidden folder .impala_insert_staging will be created in the folder of table. if we want to load such a table using Spark SQL API sqlContext.parquetFile, this hidden folder makes trouble, spark try to get metadata from this folder, you will see the exception: {code:borderStyle=solid} Caused by: java.io.IOException: Could not read footer for file FileStatus{path=hdfs://xxx:8020/user/hive/warehouse/parquet_strings/.impala_insert_staging; isDirectory=true; modification_time=1406333729252; access_time=0; owner=hdfs; group=hdfs; permission=rwxr-xr-x; isSymlink=false} ... ... Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is not a file: /user/hive/warehouse/parquet_strings/.impala_insert_staging {code} and impala side do not think this is their problem: https://issues.cloudera.org/browse/IMPALA-837 (IMPALA-837 Delete .impala_insert_staging directory after INSERT) so maybe we should filter out these hidden folder/file by reading parquet tables -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2381) streaming receiver crashed,but seems nothing happened
[ https://issues.apache.org/jira/browse/SPARK-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081021#comment-14081021 ] sunsc commented on SPARK-2381: -- Send a PR here : https://github.com/apache/spark/pull/1693/ streaming receiver crashed,but seems nothing happened - Key: SPARK-2381 URL: https://issues.apache.org/jira/browse/SPARK-2381 Project: Spark Issue Type: Bug Components: Streaming Reporter: sunsc when we submit a streaming job and if receivers doesn't start normally, the application should stop itself. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2379) stopReceive in dead loop, cause stackoverflow exception
[ https://issues.apache.org/jira/browse/SPARK-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081033#comment-14081033 ] Apache Spark commented on SPARK-2379: - User 'joyyoj' has created a pull request for this issue: https://github.com/apache/spark/pull/1694 stopReceive in dead loop, cause stackoverflow exception --- Key: SPARK-2379 URL: https://issues.apache.org/jira/browse/SPARK-2379 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0 Reporter: sunsc streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala stop will call stopReceiver and stopReceiver will call stop if exception occurs, that make a dead loop. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2769) Ganglia Support Broken / Not working
[ https://issues.apache.org/jira/browse/SPARK-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Walsh updated SPARK-2769: - Description: Hi all, I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0 No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is installed. I've added the following to the metrics.properties *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink *.sink.graphite.host=HOSTNAME *.sink.graphite.port=8649 *.sink.graphite.period=1 *.sink.graphite.prefix=aa and I get this error message 14/07/31 05:39:00 WARN graphite.GraphiteReporter: Unable to report to Graphite java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) at java.net.SocketOutputStream.write(SocketOutputStream.java:159) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229) at java.io.BufferedWriter.flush(BufferedWriter.java:254) at com.codahale.metrics.graphite.Graphite.send(Graphite.java:77) at com.codahale.metrics.graphite.GraphiteReporter.reportGauge(GraphiteReporter.java:254) at com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:156) at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:107) at com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:86) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) From looking at the code I see the following. val graphite: Graphite = new Graphite(new InetSocketAddress(host, port)) val reporter: GraphiteReporter = GraphiteReporter.forRegistry(registry) .convertDurationsTo(TimeUnit.MILLISECONDS) .convertRatesTo(TimeUnit.SECONDS) .prefixedWith(prefix) .build(graphite) https://github.com/apache/spark/blob/87bd1f9ef7d547ee54a8a83214b45462e0751efb/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala#L69 Followed by override def start() { reporter.start(pollPeriod, pollUnit) } I noticed that the error fails when we first fry to send a message but nowhere do I see graphite.connect() being called? https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L62 as it seems to fail on the send function.. https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L77 a with this.writer not initialized the writer.write will fail. The GraphiteBuilder doesn't call it either when creating the reporter object. https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteReporter.java#L113 Maybe I'm looking in the wrong area and I'm passing in the wrong values - but very little logging has me thinking it is a bug. EDIT: found out where the connect gets called. https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteReporter.java#L153 ad his is called from here https://github.com/dropwizard/metrics/blob/99dc540c2cbe6bb3be304e20449fb641c7f5382a/metrics-core/src/main/java/com/codahale/metrics/ScheduledReporter.java#L98 which is called form here https://github.com/dropwizard/metrics/blob/99dc540c2cbe6bb3be304e20449fb641c7f5382a/metrics-core/src/main/java/com/codahale/metrics/ScheduledReporter.java#L98 but the issue still stands. :/ Edit 2: my ports are open and listening [root@rtr-dev-spark4 ~]# lsof -i :8649 COMMAND PIDUSER FD TYPE DEVICE SIZE/OFF NODE NAME gmond 32173 ganglia5u IPv4 3480253 0t0 UDP rtr-dev-spark4.ord2012:8649 gmond 32173 ganglia6u IPv4 3480255 0t0 TCP rtr-dev-spark4.ord2012:8649 (LISTEN) gmond 32173 ganglia7u IPv4 3480257 0t0 UDP rtr-dev-spark4.ord2012:55523-rtr-dev-spark4.ord2012:8649 Regards Steve was: Hi all, I've build spark 1.1.0
[jira] [Created] (SPARK-2770) Rename spark-ganglia-lgpl to ganglia-lgpl
Chris Fregly created SPARK-2770: --- Summary: Rename spark-ganglia-lgpl to ganglia-lgpl Key: SPARK-2770 URL: https://issues.apache.org/jira/browse/SPARK-2770 Project: Spark Issue Type: Improvement Components: Build Reporter: Chris Fregly Priority: Minor Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2771) GenerateMIMAIgnore fails scalastyle check due to long line
Ted Yu created SPARK-2771: - Summary: GenerateMIMAIgnore fails scalastyle check due to long line Key: SPARK-2771 URL: https://issues.apache.org/jira/browse/SPARK-2771 Project: Spark Issue Type: Bug Reporter: Ted Yu Priority: Minor I got the following error building master branch: {code} [INFO] --- scalastyle-maven-plugin:0.4.0:check (default) @ spark-tools_2.10 --- error file=/homes/hortonzy/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala message=File line length exceeds 100 characters line=118 Saving to outputFile=/homes/hortonzy/spark/tools/scalastyle-output.xml Processed 3 file(s) {code} This is caused by 3rd line below: {code} classSymbol.typeSignature.members.filterNot(x = x.fullName.startsWith(java) || x.fullName.startsWith(scala)) .filter(x = isPackagePrivate(x) || isDeveloperApi(x) || isExperimental(x)).map(_.fullName) ++ {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2771) GenerateMIMAIgnore fails scalastyle check due to long line
[ https://issues.apache.org/jira/browse/SPARK-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081050#comment-14081050 ] Sean Owen commented on SPARK-2771: -- Already on it :) https://github.com/apache/spark/pull/1690 GenerateMIMAIgnore fails scalastyle check due to long line -- Key: SPARK-2771 URL: https://issues.apache.org/jira/browse/SPARK-2771 Project: Spark Issue Type: Bug Reporter: Ted Yu Priority: Minor I got the following error building master branch: {code} [INFO] --- scalastyle-maven-plugin:0.4.0:check (default) @ spark-tools_2.10 --- error file=/homes/hortonzy/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala message=File line length exceeds 100 characters line=118 Saving to outputFile=/homes/hortonzy/spark/tools/scalastyle-output.xml Processed 3 file(s) {code} This is caused by 3rd line below: {code} classSymbol.typeSignature.members.filterNot(x = x.fullName.startsWith(java) || x.fullName.startsWith(scala)) .filter(x = isPackagePrivate(x) || isDeveloperApi(x) || isExperimental(x)).map(_.fullName) ++ {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2771) GenerateMIMAIgnore fails scalastyle check due to long line
[ https://issues.apache.org/jira/browse/SPARK-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated SPARK-2771: -- Attachment: spark-2771-v1.txt Patch v1 shortens line 118 to 100 chars wide. GenerateMIMAIgnore fails scalastyle check due to long line -- Key: SPARK-2771 URL: https://issues.apache.org/jira/browse/SPARK-2771 Project: Spark Issue Type: Bug Reporter: Ted Yu Priority: Minor Attachments: spark-2771-v1.txt I got the following error building master branch: {code} [INFO] --- scalastyle-maven-plugin:0.4.0:check (default) @ spark-tools_2.10 --- error file=/homes/hortonzy/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala message=File line length exceeds 100 characters line=118 Saving to outputFile=/homes/hortonzy/spark/tools/scalastyle-output.xml Processed 3 file(s) {code} This is caused by 3rd line below: {code} classSymbol.typeSignature.members.filterNot(x = x.fullName.startsWith(java) || x.fullName.startsWith(scala)) .filter(x = isPackagePrivate(x) || isDeveloperApi(x) || isExperimental(x)).map(_.fullName) ++ {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2772) Spark Project SQL fails compilation
Ted Yu created SPARK-2772: - Summary: Spark Project SQL fails compilation Key: SPARK-2772 URL: https://issues.apache.org/jira/browse/SPARK-2772 Project: Spark Issue Type: Bug Reporter: Ted Yu I used the following command: {code} mvn clean -Pyarn -Phive -Phadoop-2.4 -DskipTests package {code} I got: {code} [ERROR] /homes/hortonzy/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala:52: wrong number of type arguments for org.apache.spark.rdd.ShuffledRDD, should be 4 [ERROR] val shuffled = new ShuffledRDD[Row, Row, Row](rdd, part) [ERROR]^ [ERROR] /homes/hortonzy/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala:65: wrong number of type arguments for org.apache.spark.rdd.ShuffledRDD, should be 4 [ERROR] val shuffled = new ShuffledRDD[Row, Null, Null](rdd, part) [ERROR]^ [ERROR] /homes/hortonzy/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala:76: wrong number of type arguments for org.apache.spark.rdd.ShuffledRDD, should be 4 [ERROR] val shuffled = new ShuffledRDD[Null, Row, Row](rdd, partitioner) [ERROR]^ [ERROR] /homes/hortonzy/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala:151: wrong number of type arguments for org.apache.spark.rdd.ShuffledRDD, should be 4 [ERROR] val shuffled = new ShuffledRDD[Boolean, Row, Row](rdd, part) [ERROR]^ [ERROR] four errors found {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2762) SparkILoop leaks memory in multi-repl configurations
[ https://issues.apache.org/jira/browse/SPARK-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2762: - Assignee: Timothy Hunter SparkILoop leaks memory in multi-repl configurations Key: SPARK-2762 URL: https://issues.apache.org/jira/browse/SPARK-2762 Project: Spark Issue Type: Bug Reporter: Timothy Hunter Assignee: Timothy Hunter Priority: Minor When subclassing SparkILoop and instantiating multiple objects, the SparkILoop instances do not get garbage collected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2762) SparkILoop leaks memory in multi-repl configurations
[ https://issues.apache.org/jira/browse/SPARK-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2762. -- Resolution: Fixed Fix Version/s: 1.1.0 SparkILoop leaks memory in multi-repl configurations Key: SPARK-2762 URL: https://issues.apache.org/jira/browse/SPARK-2762 Project: Spark Issue Type: Bug Reporter: Timothy Hunter Assignee: Timothy Hunter Priority: Minor Fix For: 1.1.0 When subclassing SparkILoop and instantiating multiple objects, the SparkILoop instances do not get garbage collected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2762) SparkILoop leaks memory in multi-repl configurations
[ https://issues.apache.org/jira/browse/SPARK-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081154#comment-14081154 ] Matei Zaharia commented on SPARK-2762: -- PR: https://github.com/apache/spark/pull/1674 SparkILoop leaks memory in multi-repl configurations Key: SPARK-2762 URL: https://issues.apache.org/jira/browse/SPARK-2762 Project: Spark Issue Type: Bug Reporter: Timothy Hunter Assignee: Timothy Hunter Priority: Minor Fix For: 1.1.0 When subclassing SparkILoop and instantiating multiple objects, the SparkILoop instances do not get garbage collected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2773) Shuffle:use growth rate to predict if need to spill
uncleGen created SPARK-2773: --- Summary: Shuffle:use growth rate to predict if need to spill Key: SPARK-2773 URL: https://issues.apache.org/jira/browse/SPARK-2773 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 1.0.0, 0.9.0 Reporter: uncleGen Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2773) Shuffle:use growth rate to predict if need to spill
[ https://issues.apache.org/jira/browse/SPARK-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081163#comment-14081163 ] Apache Spark commented on SPARK-2773: - User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/1696 Shuffle:use growth rate to predict if need to spill --- Key: SPARK-2773 URL: https://issues.apache.org/jira/browse/SPARK-2773 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 0.9.0, 1.0.0 Reporter: uncleGen Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2773) Shuffle:use growth rate to predict if need to spill
[ https://issues.apache.org/jira/browse/SPARK-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081161#comment-14081161 ] uncleGen commented on SPARK-2773: - here is my improvement: https://github.com/apache/spark/pull/1696 Shuffle:use growth rate to predict if need to spill --- Key: SPARK-2773 URL: https://issues.apache.org/jira/browse/SPARK-2773 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 0.9.0, 1.0.0 Reporter: uncleGen Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2744) The configuration spark.history.retainedApplications is invalid
[ https://issues.apache.org/jira/browse/SPARK-2744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081191#comment-14081191 ] Marcelo Vanzin commented on SPARK-2744: --- Sort of. The docs say: {quote} The number of application UIs to retain. {quote} That's still true. Only the configured number of UIs are kept in memory. The newer will list all available applications, though, while just keeping a few UIs in memory. So it's more useful, since you're able to browse more history. The configuration spark.history.retainedApplications is invalid - Key: SPARK-2744 URL: https://issues.apache.org/jira/browse/SPARK-2744 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Labels: historyserver when I set it in spark-env.sh like this:export SPARK_HISTORY_OPTS=$SPARK_HISTORY_OPTS -Dspark.history.ui.port=5678 -Dspark.history.retainedApplications=1 , the web of historyserver retains more than one application -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2743) Parquet has issues with capital letters and case insensitivity
[ https://issues.apache.org/jira/browse/SPARK-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2743. - Resolution: Fixed Fix Version/s: 1.1.0 Parquet has issues with capital letters and case insensitivity -- Key: SPARK-2743 URL: https://issues.apache.org/jira/browse/SPARK-2743 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2397) Get rid of LocalHiveContext
[ https://issues.apache.org/jira/browse/SPARK-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2397. - Resolution: Fixed Fix Version/s: 1.1.0 Get rid of LocalHiveContext --- Key: SPARK-2397 URL: https://issues.apache.org/jira/browse/SPARK-2397 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.1.0 HiveLocalContext is nearly completely redundant with HiveContext. We should consider deprecating it and removing all uses. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2774) Set preferred locations for reduce tasks
Shivaram Venkataraman created SPARK-2774: Summary: Set preferred locations for reduce tasks Key: SPARK-2774 URL: https://issues.apache.org/jira/browse/SPARK-2774 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shivaram Venkataraman Currently we do not set preferred locations for reduce tasks in Spark. This patch proposes setting preferred locations based on the map output sizes and locations tracked by the MapOutputTracker. This is useful in two conditions 1. When you have a small job in a large cluster it can be useful to co-locate map and reduce tasks to avoid going over the network 2. If there is a lot of data skew in the map stage outputs, then it is beneficial to place the reducer close to the largest output. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2028) Let users of HadoopRDD access the partition InputSplits
[ https://issues.apache.org/jira/browse/SPARK-2028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2028. -- Resolution: Fixed Fix Version/s: 1.1.0 Let users of HadoopRDD access the partition InputSplits --- Key: SPARK-2028 URL: https://issues.apache.org/jira/browse/SPARK-2028 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 1.1.0 If a user creates a HadoopRDD (e.g., via textFile), there is no way to find out which file it came from, though this information is contained in the InputSplit within the RDD. We should find a way to expose this publicly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2774) Set preferred locations for reduce tasks
[ https://issues.apache.org/jira/browse/SPARK-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081256#comment-14081256 ] Apache Spark commented on SPARK-2774: - User 'shivaram' has created a pull request for this issue: https://github.com/apache/spark/pull/1697 Set preferred locations for reduce tasks Key: SPARK-2774 URL: https://issues.apache.org/jira/browse/SPARK-2774 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shivaram Venkataraman Currently we do not set preferred locations for reduce tasks in Spark. This patch proposes setting preferred locations based on the map output sizes and locations tracked by the MapOutputTracker. This is useful in two conditions 1. When you have a small job in a large cluster it can be useful to co-locate map and reduce tasks to avoid going over the network 2. If there is a lot of data skew in the map stage outputs, then it is beneficial to place the reducer close to the largest output. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2775) HiveContext does not support dots in column names.
Yin Huai created SPARK-2775: --- Summary: HiveContext does not support dots in column names. Key: SPARK-2775 URL: https://issues.apache.org/jira/browse/SPARK-2775 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai When you try the following snippet in hive/console. {code} val data = sc.parallelize(Seq({key.number1: value1, key.number2: value2})) jsonRDD(data).registerAsTable(jt) hql(select `key.number1` from jt) {code} You will find the name of key.number1 cannot be resolved. {code} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: 'key.number1, tree: Project ['key.number1] LowerCaseSchema Subquery jt SparkLogicalPlan (ExistingRdd [key.number1#8,key.number2#9], MappedRDD[17] at map at JsonRDD.scala:37) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2664) Deal with `--conf` options in spark-submit that relate to flags
[ https://issues.apache.org/jira/browse/SPARK-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2664. Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1665 [https://github.com/apache/spark/pull/1665] Deal with `--conf` options in spark-submit that relate to flags --- Key: SPARK-2664 URL: https://issues.apache.org/jira/browse/SPARK-2664 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Assignee: Sandy Ryza Priority: Blocker Fix For: 1.1.0 If someone sets a spark conf that relates to an existing flag `--master`, we should set it correctly like we do with the defaults file. Otherwise it can have confusing semantics. I noticed this after merging it, otherwise I would have mentioned it in the review. I think it's as simple as modifying loadDefaults to check the user-supplied options also. We might change it to loadUserProperties since it's no longer just the defaults file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2776) Add normalizeByCol method to mllib.util.MLUtils
Andres Perez created SPARK-2776: --- Summary: Add normalizeByCol method to mllib.util.MLUtils Key: SPARK-2776 URL: https://issues.apache.org/jira/browse/SPARK-2776 Project: Spark Issue Type: New Feature Reporter: Andres Perez Priority: Minor Add the ability to compute the mean and standard deviations of each vector (LabeledPoint) component and normalize each vector in the RDD, using only RDD transformations. The result is an RDD of Vectors where each column has a mean of zero and standard deviation of one. See https://github.com/apache/spark/pull/1698 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2749) Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep
[ https://issues.apache.org/jira/browse/SPARK-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2749: --- Assignee: Sean Owen Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep --- Key: SPARK-2749 URL: https://issues.apache.org/jira/browse/SPARK-2749 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.1 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.1.0 The Maven-based builds in the build matrix have been failing for a few days: https://amplab.cs.berkeley.edu/jenkins/view/Spark/ On inspection, it looks like the Spark SQL Java tests don't compile: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull I confirmed it by repeating the command vs master: mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package The problem is that this module doesn't depend on JUnit. In fact, none of the modules do, but com.novocode:junit-interface (the SBT-JUnit bridge) pulls it in, in most places. However this module doesn't depend on com.novocode:junit-interface Adding the junit:junit dependency fixes the compile problem. In fact, the other modules with Java tests should probably depend on it explicitly instead of happening to get it via com.novocode:junit-interface, since that is a bit SBT/Scala-specific (and I am not even sure it's needed). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2749) Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep
[ https://issues.apache.org/jira/browse/SPARK-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2749. Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1690 [https://github.com/apache/spark/pull/1690] Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep --- Key: SPARK-2749 URL: https://issues.apache.org/jira/browse/SPARK-2749 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.1 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.1.0 The Maven-based builds in the build matrix have been failing for a few days: https://amplab.cs.berkeley.edu/jenkins/view/Spark/ On inspection, it looks like the Spark SQL Java tests don't compile: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull I confirmed it by repeating the command vs master: mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package The problem is that this module doesn't depend on JUnit. In fact, none of the modules do, but com.novocode:junit-interface (the SBT-JUnit bridge) pulls it in, in most places. However this module doesn't depend on com.novocode:junit-interface Adding the junit:junit dependency fixes the compile problem. In fact, the other modules with Java tests should probably depend on it explicitly instead of happening to get it via com.novocode:junit-interface, since that is a bit SBT/Scala-specific (and I am not even sure it's needed). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2772) Spark Project SQL fails compilation
[ https://issues.apache.org/jira/browse/SPARK-2772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu resolved SPARK-2772. --- Resolution: Not a Problem 'mvn install' solves the issue. Spark Project SQL fails compilation --- Key: SPARK-2772 URL: https://issues.apache.org/jira/browse/SPARK-2772 Project: Spark Issue Type: Bug Reporter: Ted Yu I used the following command: {code} mvn clean -Pyarn -Phive -Phadoop-2.4 -DskipTests package {code} I got: {code} [ERROR] /homes/hortonzy/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala:52: wrong number of type arguments for org.apache.spark.rdd.ShuffledRDD, should be 4 [ERROR] val shuffled = new ShuffledRDD[Row, Row, Row](rdd, part) [ERROR]^ [ERROR] /homes/hortonzy/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala:65: wrong number of type arguments for org.apache.spark.rdd.ShuffledRDD, should be 4 [ERROR] val shuffled = new ShuffledRDD[Row, Null, Null](rdd, part) [ERROR]^ [ERROR] /homes/hortonzy/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala:76: wrong number of type arguments for org.apache.spark.rdd.ShuffledRDD, should be 4 [ERROR] val shuffled = new ShuffledRDD[Null, Row, Row](rdd, partitioner) [ERROR]^ [ERROR] /homes/hortonzy/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala:151: wrong number of type arguments for org.apache.spark.rdd.ShuffledRDD, should be 4 [ERROR] val shuffled = new ShuffledRDD[Boolean, Row, Row](rdd, part) [ERROR]^ [ERROR] four errors found {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2678) `Spark-submit` overrides user application options
[ https://issues.apache.org/jira/browse/SPARK-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081302#comment-14081302 ] Apache Spark commented on SPARK-2678: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/1699 `Spark-submit` overrides user application options - Key: SPARK-2678 URL: https://issues.apache.org/jira/browse/SPARK-2678 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker Here is an example: {code} ./bin/spark-submit --class Foo some.jar --help {code} SInce {{--help}} appears behind the primary resource (i.e. {{some.jar}}), it should be recognized as a user application option. But it's actually overriden by {{spark-submit}} and will show {{spark-submit}} help message. When directly invoking {{spark-submit}}, the constraints here are: # Options before primary resource should be recognized as {{spark-submit}} options # Options after primary resource should be recognized as user application options The tricky part is how to handle scripts like {{spark-shell}} that delegate {{spark-submit}}. These scripts allow users specify both {{spark-submit}} options like {{--master}} and user defined application options together. For example, say we'd like to write a new script {{start-thriftserver.sh}} to start the Hive Thrift server, basically we may do this: {code} $SPARK_HOME/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal $@ {code} Then user may call this script like: {code} ./sbin/start-thriftserver.sh --master spark://some-host:7077 --hiveconf key=value {code} Notice that all options are captured by {{$@}}. If we put it before {{spark-internal}}, they are all recognized as {{spark-submit}} options, thus {{--hiveconf}} won't be passed to {{HiveThriftServer2}}; if we put it after {{spark-internal}}, they *should* all be recognized as options of {{HiveThriftServer2}}, but because of this bug, {{--master}} is still recognized as {{spark-submit}} option and leads to the right behavior. Although currently all scripts using {{spark-submit}} work correctly, we still should fix this bug, because it causes option name collision between {{spark-submit}} and user application, and every time we add a new option to {{spark-submit}}, some existing user applications may break. However, solving this bug may cause some incompatible changes. The suggested solution here is using {{--}} as separator of {{spark-submit}} options and user application options. For the Hive Thrift server example above, user should call it in this way: {code} ./sbin/start-thriftserver.sh --master spark://some-host:7077 -- --hiveconf key=value {code} And {{SparkSubmitArguments}} should be responsible for splitting two sets of options and pass them correctly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2646) log4j initialization not quite compatible with log4j 2.x
[ https://issues.apache.org/jira/browse/SPARK-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2646. Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1547 [https://github.com/apache/spark/pull/1547] log4j initialization not quite compatible with log4j 2.x Key: SPARK-2646 URL: https://issues.apache.org/jira/browse/SPARK-2646 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.1 Reporter: Sean Owen Priority: Minor Fix For: 1.1.0 The logging code that handles log4j initialization leads to an stack overflow error when used with log4j 2.x, which has just been released. This occurs even a downstream project has correctly adjusted SLF4J bindings, and that is the right thing to do for log4j 2.x, since it is effectively a separate project from 1.x. Here is the relevant bit of Logging.scala: {code} private def initializeLogging() { // If Log4j is being used, but is not initialized, load a default properties file val binder = StaticLoggerBinder.getSingleton val usingLog4j = binder.getLoggerFactoryClassStr.endsWith(Log4jLoggerFactory) val log4jInitialized = LogManager.getRootLogger.getAllAppenders.hasMoreElements if (!log4jInitialized usingLog4j) { val defaultLogProps = org/apache/spark/log4j-defaults.properties Option(Utils.getSparkClassLoader.getResource(defaultLogProps)) match { case Some(url) = PropertyConfigurator.configure(url) log.info(sUsing Spark's default log4j profile: $defaultLogProps) case None = System.err.println(sSpark was unable to load $defaultLogProps) } } Logging.initialized = true // Force a call into slf4j to initialize it. Avoids this happening from mutliple threads // and triggering this: http://mailman.qos.ch/pipermail/slf4j-dev/2010-April/002956.html log } {code} The first minor issue is that there is a call to a logger inside this method, which is initializing logging. In this situation, it ends up causing the initialization to be called recursively until the stack overflow. It would be slightly tidier to log this only after Logging.initialized = true. Or not at all. But it's not the root problem, or else, it would not work at all now. The calls to log4j classes here always reference log4j 1.2 no matter what. For example, there is not getAllAppenders in log4j 2.x. That's fine. Really, usingLog4j means using log4j 1.2 and log4jInitialized means log4j 1.2 is initialized. usingLog4j should be false for log4j 2.x, because the initialization only matters for log4j 1.2. But, it's true, and that's the real issue. And log4jInitialized is always false, since calls to the log4j 1.2 API are stubs and no-ops in this setup, where the caller has swapped in log4j 2.x. Hence the loop. This is fixed, I believe, if usingLog4j can be false for log4j 2.x. The SLF4J static binding class has the same name for both versions, unfortunately, which causes the issue. However they're in different packages. For example, if the test included ... and begins with org.slf4j, it should work, as the SLF4J binding for log4j 2.x is provided by log4j 2.x at the moment, and is in package org.apache.logging.slf4j. Of course, I assume that SLF4J will eventually offer its own binding. I hope to goodness they at least name the binding class differently, or else this will again not work. But then some other check can probably be made. (Credit to Agust Egilsson for finding this; at his request I'm opening a JIRA for him. I'll propose a PR too.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2511) Add TF-IDF featurizer
[ https://issues.apache.org/jira/browse/SPARK-2511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2511. -- Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1671 [https://github.com/apache/spark/pull/1671] Add TF-IDF featurizer - Key: SPARK-2511 URL: https://issues.apache.org/jira/browse/SPARK-2511 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.1.0 Port the TF-IDF implementation that was used in the Databricks Cloud demo to MLlib. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2468) zero-copy shuffle network communication
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081357#comment-14081357 ] Reynold Xin commented on SPARK-2468: It's something I'd like to prototype for 1.2. Do you have any thoughts on this? zero-copy shuffle network communication --- Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty NIO. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2756) Decision Tree bugs
[ https://issues.apache.org/jira/browse/SPARK-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-2756: - Description: 3 bugs: Bug 1: Indexing is inconsistent for aggregate calculations for unordered features (in multiclass classification with categorical features, where the features had few enough values such that they could be considered unordered, i.e., isSpaceSufficientForAllCategoricalSplits=true). * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, binIndex), where ** featureValue was from arr (so it was a feature value) ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1) * The rest of the code indexed agg as (node, feature, binIndex, label). Bug 2: calculateGainForSplit (for classification): * It returns dummy prediction values when either the right or left children had 0 weight. These are incorrect for multiclass classification. Bug 3: Off-by-1 when finding thresholds for splits for continuous features. * When finding thresholds for possible splits for continuous features in DecisionTree.findSplitsBins, the thresholds were set according to individual training examples’ feature values. This can cause problems for small datasets. was: 2 bugs: Bug 1: Indexing is inconsistent for aggregate calculations for unordered features (in multiclass classification with categorical features, where the features had few enough values such that they could be considered unordered, i.e., isSpaceSufficientForAllCategoricalSplits=true). * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, binIndex), where ** featureValue was from arr (so it was a feature value) ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1) * The rest of the code indexed agg as (node, feature, binIndex, label). Bug 2: calculateGainForSplit (for classification): * It returns dummy prediction values when either the right or left children had 0 weight. These are incorrect for multiclass classification. Decision Tree bugs -- Key: SPARK-2756 URL: https://issues.apache.org/jira/browse/SPARK-2756 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley 3 bugs: Bug 1: Indexing is inconsistent for aggregate calculations for unordered features (in multiclass classification with categorical features, where the features had few enough values such that they could be considered unordered, i.e., isSpaceSufficientForAllCategoricalSplits=true). * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, binIndex), where ** featureValue was from arr (so it was a feature value) ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1) * The rest of the code indexed agg as (node, feature, binIndex, label). Bug 2: calculateGainForSplit (for classification): * It returns dummy prediction values when either the right or left children had 0 weight. These are incorrect for multiclass classification. Bug 3: Off-by-1 when finding thresholds for splits for continuous features. * When finding thresholds for possible splits for continuous features in DecisionTree.findSplitsBins, the thresholds were set according to individual training examples’ feature values. This can cause problems for small datasets. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2777) ALS factors should be persist in memory and disk
[ https://issues.apache.org/jira/browse/SPARK-2777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081401#comment-14081401 ] Apache Spark commented on SPARK-2777: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/1700 ALS factors should be persist in memory and disk Key: SPARK-2777 URL: https://issues.apache.org/jira/browse/SPARK-2777 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Now the factors are persisted in memory only. If they get kicked off by later jobs, we have to start the computation from very beginning. A better solution should be changing the storage level to MEMORY_AND_DISK. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2272) Feature scaling which standardizes the range of independent variables or features of data.
[ https://issues.apache.org/jira/browse/SPARK-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2272: - Assignee: DB Tsai Feature scaling which standardizes the range of independent variables or features of data. -- Key: SPARK-2272 URL: https://issues.apache.org/jira/browse/SPARK-2272 Project: Spark Issue Type: New Feature Components: MLlib Reporter: DB Tsai Assignee: DB Tsai Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. In this work, a trait called `VectorTransformer` is defined for generic transformation of a vector. It contains two methods, `apply` which applies transformation on a vector and `unapply` which applies inverse transformation on a vector. There are three concrete implementations of `VectorTransformer`, and they all can be easily extended with PMML transformation support. 1) `VectorStandardizer` - Standardises a vector given the mean and variance. Since the standardization will densify the output, the output is always in dense vector format. 2) `VectorRescaler` - Rescales a vector into target range specified by a tuple of two double values or two vectors as new target minimum and maximum. Since the rescaling will substrate the minimum of each column first, the output will always be in dense vector regardless of input vector type. 3) `VectorDivider` - Transforms a vector by dividing a constant or diving a vector with element by element basis. This transformation will preserve the type of input vector without densifying the result. Utility helper methods are implemented for taking an input of RDD[Vector], and then transformed RDD[Vector] and transformer are returned for dividing, rescaling, normalization, and standardization. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-2776) Add normalizeByCol method to mllib.util.MLUtils
[ https://issues.apache.org/jira/browse/SPARK-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-2776. Resolution: Duplicate Add normalizeByCol method to mllib.util.MLUtils --- Key: SPARK-2776 URL: https://issues.apache.org/jira/browse/SPARK-2776 Project: Spark Issue Type: New Feature Reporter: Andres Perez Priority: Minor Add the ability to compute the mean and standard deviations of each vector (LabeledPoint) component and normalize each vector in the RDD, using only RDD transformations. The result is an RDD of Vectors where each column has a mean of zero and standard deviation of one. See https://github.com/apache/spark/pull/1698 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081465#comment-14081465 ] Marcelo Vanzin commented on SPARK-1537: --- I'm working on this but this all sort of depends on progress being made on the Yarn side, so at this moment I'm not yet ready to send any PRs. Add integration with Yarn's Application Timeline Server --- Key: SPARK-1537 URL: https://issues.apache.org/jira/browse/SPARK-1537 Project: Spark Issue Type: New Feature Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin It would be nice to have Spark integrate with Yarn's Application Timeline Server (see YARN-321, YARN-1530). This would allow users running Spark on Yarn to have a single place to go for all their history needs, and avoid having to manage a separate service (Spark's built-in server). At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, although there is still some ongoing work. But the basics are there, and I wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line
[ https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081490#comment-14081490 ] Marcelo Vanzin commented on SPARK-1576: --- With Sandy's recent patch (https://github.com/apache/spark/pull/1253) this should be easy to do (spark-submit --conf spark.executor.extraJavaOptions=blah). Passing of JAVA_OPTS to YARN on command line Key: SPARK-1576 URL: https://issues.apache.org/jira/browse/SPARK-1576 Project: Spark Issue Type: Improvement Affects Versions: 0.9.0, 0.9.1, 0.9.2 Reporter: Nishkam Ravi Attachments: SPARK-1576.patch JAVA_OPTS can be passed by using either env variables (i.e., SPARK_JAVA_OPTS) or as config vars (after Patrick's recent change). It would be good to allow the user to pass them on command line as well to restrict scope to single application invocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1131) Better document the --args option for yarn-standalone mode
[ https://issues.apache.org/jira/browse/SPARK-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081497#comment-14081497 ] Marcelo Vanzin commented on SPARK-1131: --- This is probably obsolete now with spark-submit. Better document the --args option for yarn-standalone mode -- Key: SPARK-1131 URL: https://issues.apache.org/jira/browse/SPARK-1131 Project: Spark Issue Type: Improvement Components: YARN Reporter: Sandy Pérez González Assignee: Karthik Kambatla It took me a while to figure out that the correct way to use it with multiple arguments was to include the option multiple times. I.e. --args arg1 --args arg2 instead of --args arg1 arg2 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2778) Add unit tests for Yarn integration
Marcelo Vanzin created SPARK-2778: - Summary: Add unit tests for Yarn integration Key: SPARK-2778 URL: https://issues.apache.org/jira/browse/SPARK-2778 Project: Spark Issue Type: Test Components: YARN Reporter: Marcelo Vanzin It would be nice to add some Yarn integration tests to the unit tests in Spark; Yarn provides a MiniYARNCluster class that can be used to spawn a cluster locally. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081514#comment-14081514 ] Ted Malaska commented on SPARK-2447: Tell me if I'm wrong but the core offering of 1127 is also provided with 2447. All I would have to do is provide the rdd functions to call the bulkPut or future bulkPartitionPut Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081524#comment-14081524 ] Tathagata Das commented on SPARK-2447: -- Exactly!! That's why I feel that both have its merits, 2447 provides lower-level, all-inclusive interfaces using which slightly advanced users can do arbitrary stuff with. But it requires programming against HBase types like Put, and all. However, 1127 provides the simple interface which allows not-so-advanced users to do a set of simple stuff without requiring too much HBase knowledge. They are complimentary, and the latter should be implemented on top of the former. Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2063) Creating a SchemaRDD via sql() does not correctly resolve nested types
[ https://issues.apache.org/jira/browse/SPARK-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2063: Target Version/s: 1.2.0 (was: 1.1.0) Creating a SchemaRDD via sql() does not correctly resolve nested types -- Key: SPARK-2063 URL: https://issues.apache.org/jira/browse/SPARK-2063 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Aaron Davidson Assignee: Cheng Lian For example, from the typical twitter dataset: {code} scala val popularTweets = sql(SELECT retweeted_status.text, MAX(retweeted_status.retweet_count) AS s FROM tweets WHERE retweeted_status is not NULL GROUP BY retweeted_status.text ORDER BY s DESC LIMIT 30) scala popularTweets.toString 14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for batch MultiInstanceRelations 14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to qualifiers on unresolved object, tree: 'retweeted_status.text at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:51) at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:47) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:67) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:65) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:65) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:100) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:97) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:51) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:65) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:64) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:69) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:40) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3.applyOrElse(Analyzer.scala:97) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3.applyOrElse(Analyzer.scala:94) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:217) at
[jira] [Updated] (SPARK-2096) Correctly parse dot notations for accessing an array of structs
[ https://issues.apache.org/jira/browse/SPARK-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2096: Target Version/s: 1.2.0 (was: 1.1.0) Correctly parse dot notations for accessing an array of structs --- Key: SPARK-2096 URL: https://issues.apache.org/jira/browse/SPARK-2096 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Yin Huai Priority: Minor Labels: starter For example, arrayOfStruct is an array of structs and every element of this array has a field called field1. arrayOfStruct[0].field1 means to access the value of field1 for the first element of arrayOfStruct, but the SQL parser (in sql-core) treats field1 as an alias. Also, arrayOfStruct.field1 means to access all values of field1 in this array of structs and the returns those values as an array. But, the SQL parser cannot resolve it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2740) In JavaPairRdd, allow user to specify ascending and numPartitions for sortByKey
[ https://issues.apache.org/jira/browse/SPARK-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2740. --- Resolution: Fixed Fix Version/s: 1.1.0 Assignee: Rui Li In JavaPairRdd, allow user to specify ascending and numPartitions for sortByKey --- Key: SPARK-2740 URL: https://issues.apache.org/jira/browse/SPARK-2740 Project: Spark Issue Type: Improvement Reporter: Rui Li Assignee: Rui Li Priority: Minor Fix For: 1.1.0 It should be more convenient if user can specify ascending and numPartitions when calling sortByKey. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large
[ https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081598#comment-14081598 ] Carlos Fuertes commented on SPARK-2017: --- I have done some tests with the solution where you use JSON to send the data. If you run with 50k tasks sc.parallelize(1 to 100, 5).count() the JSON [/stages/stage/tasks/json/?id=0] that represents the tasks table takes ~15Mb if you download it. You can get the JSON is some secs but the UI [/stages/stage/?id=0] will take still forever to render it (summary still shows up nonetheless). I did not change the way we are rendering, that is move to pagination or anything else, and still using sorttable to allow the sorting of the table. Maybe just converting to JSON is too simple and you still have to do streaming of the data if you want to go around 50k task and higher while maintaining responsiveness of the browser. And/or incorporate pagination directly with a global index for the tasks on the back. web ui stage page becomes unresponsive when the number of tasks is large Key: SPARK-2017 URL: https://issues.apache.org/jira/browse/SPARK-2017 Project: Spark Issue Type: Sub-task Components: Web UI Reporter: Reynold Xin Labels: starter {code} sc.parallelize(1 to 100, 100).count() {code} The above code creates one million tasks to be executed. The stage detail web ui page takes forever to load (if it ever completes). There are again a few different alternatives: 0. Limit the number of tasks we show. 1. Pagination 2. By default only show the aggregate metrics and failed tasks, and hide the successful ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large
[ https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081605#comment-14081605 ] Carlos Fuertes commented on SPARK-2017: --- I did not realize that the tasks all have their own index already so implementing the pagination on top of it should be simple. I'll give it a try. web ui stage page becomes unresponsive when the number of tasks is large Key: SPARK-2017 URL: https://issues.apache.org/jira/browse/SPARK-2017 Project: Spark Issue Type: Sub-task Components: Web UI Reporter: Reynold Xin Labels: starter {code} sc.parallelize(1 to 100, 100).count() {code} The above code creates one million tasks to be executed. The stage detail web ui page takes forever to load (if it ever completes). There are again a few different alternatives: 0. Limit the number of tasks we show. 1. Pagination 2. By default only show the aggregate metrics and failed tasks, and hide the successful ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081616#comment-14081616 ] Josh Rosen edited comment on SPARK-2282 at 7/31/14 10:37 PM: - Merged the improved fix from https://github.com/apache/spark/pull/1503 into 1.1. was (Author: joshrosen): Merged the improved fix from https://github.com/apache/spark/pull/1503 PySpark crashes if too many tasks complete quickly -- Key: SPARK-2282 URL: https://issues.apache.org/jira/browse/SPARK-2282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.1, 1.0.0, 1.0.1 Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 0.9.2, 1.0.0, 1.0.1, 1.1.0 Upon every task completion, PythonAccumulatorParam constructs a new socket to the Accumulator server running inside the pyspark daemon. This can cause a buildup of used ephemeral ports from sockets in the TIME_WAIT termination stage, which will cause the SparkContext to crash if too many tasks complete too quickly. We ran into this bug with 17k tasks completing in 15 seconds. This bug can be fixed outside of Spark by ensuring these properties are set (on a linux server); echo 1 /proc/sys/net/ipv4/tcp_tw_reuse echo 1 /proc/sys/net/ipv4/tcp_tw_recycle or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2282: -- Fix Version/s: 1.1.0 PySpark crashes if too many tasks complete quickly -- Key: SPARK-2282 URL: https://issues.apache.org/jira/browse/SPARK-2282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.1, 1.0.0, 1.0.1 Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 0.9.2, 1.0.0, 1.0.1, 1.1.0 Upon every task completion, PythonAccumulatorParam constructs a new socket to the Accumulator server running inside the pyspark daemon. This can cause a buildup of used ephemeral ports from sockets in the TIME_WAIT termination stage, which will cause the SparkContext to crash if too many tasks complete too quickly. We ran into this bug with 17k tasks completing in 15 seconds. This bug can be fixed outside of Spark by ensuring these properties are set (on a linux server); echo 1 /proc/sys/net/ipv4/tcp_tw_reuse echo 1 /proc/sys/net/ipv4/tcp_tw_recycle or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081616#comment-14081616 ] Josh Rosen commented on SPARK-2282: --- Merged the improved fix from https://github.com/apache/spark/pull/1503 PySpark crashes if too many tasks complete quickly -- Key: SPARK-2282 URL: https://issues.apache.org/jira/browse/SPARK-2282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.1, 1.0.0, 1.0.1 Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 0.9.2, 1.0.0, 1.0.1, 1.1.0 Upon every task completion, PythonAccumulatorParam constructs a new socket to the Accumulator server running inside the pyspark daemon. This can cause a buildup of used ephemeral ports from sockets in the TIME_WAIT termination stage, which will cause the SparkContext to crash if too many tasks complete too quickly. We ran into this bug with 17k tasks completing in 15 seconds. This bug can be fixed outside of Spark by ensuring these properties are set (on a linux server); echo 1 /proc/sys/net/ipv4/tcp_tw_reuse echo 1 /proc/sys/net/ipv4/tcp_tw_recycle or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2771) GenerateMIMAIgnore fails scalastyle check due to long line
[ https://issues.apache.org/jira/browse/SPARK-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu resolved SPARK-2771. --- Resolution: Fixed GenerateMIMAIgnore fails scalastyle check due to long line -- Key: SPARK-2771 URL: https://issues.apache.org/jira/browse/SPARK-2771 Project: Spark Issue Type: Bug Reporter: Ted Yu Priority: Minor Attachments: spark-2771-v1.txt I got the following error building master branch: {code} [INFO] --- scalastyle-maven-plugin:0.4.0:check (default) @ spark-tools_2.10 --- error file=/homes/hortonzy/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala message=File line length exceeds 100 characters line=118 Saving to outputFile=/homes/hortonzy/spark/tools/scalastyle-output.xml Processed 3 file(s) {code} This is caused by 3rd line below: {code} classSymbol.typeSignature.members.filterNot(x = x.fullName.startsWith(java) || x.fullName.startsWith(scala)) .filter(x = isPackagePrivate(x) || isDeveloperApi(x) || isExperimental(x)).map(_.fullName) ++ {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1812) Support cross-building with Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081642#comment-14081642 ] Apache Spark commented on SPARK-1812: - User 'avati' has created a pull request for this issue: https://github.com/apache/spark/pull/1701 Support cross-building with Scala 2.11 -- Key: SPARK-1812 URL: https://issues.apache.org/jira/browse/SPARK-1812 Project: Spark Issue Type: New Feature Components: Build, Spark Core Reporter: Matei Zaharia Assignee: Prashant Sharma Since Scala 2.10/2.11 are source compatible, we should be able to cross build for both versions. From what I understand there are basically three things we need to figure out: 1. Have a two versions of our dependency graph, one that uses 2.11 dependencies and the other that uses 2.10 dependencies. 2. Figure out how to publish different poms for 2.10 and 2.11. I think (1) can be accomplished by having a scala 2.11 profile. (2) isn't really well supported by Maven since published pom's aren't generated dynamically. But we can probably script around it to make it work. I've done some initial sanity checks with a simple build here: https://github.com/pwendell/scala-maven-crossbuild -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1812) Support cross-building with Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081650#comment-14081650 ] Apache Spark commented on SPARK-1812: - User 'avati' has created a pull request for this issue: https://github.com/apache/spark/pull/1702 Support cross-building with Scala 2.11 -- Key: SPARK-1812 URL: https://issues.apache.org/jira/browse/SPARK-1812 Project: Spark Issue Type: New Feature Components: Build, Spark Core Reporter: Matei Zaharia Assignee: Prashant Sharma Since Scala 2.10/2.11 are source compatible, we should be able to cross build for both versions. From what I understand there are basically three things we need to figure out: 1. Have a two versions of our dependency graph, one that uses 2.11 dependencies and the other that uses 2.10 dependencies. 2. Figure out how to publish different poms for 2.10 and 2.11. I think (1) can be accomplished by having a scala 2.11 profile. (2) isn't really well supported by Maven since published pom's aren't generated dynamically. But we can probably script around it to make it work. I've done some initial sanity checks with a simple build here: https://github.com/pwendell/scala-maven-crossbuild -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1812) Support cross-building with Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081657#comment-14081657 ] Apache Spark commented on SPARK-1812: - User 'avati' has created a pull request for this issue: https://github.com/apache/spark/pull/1703 Support cross-building with Scala 2.11 -- Key: SPARK-1812 URL: https://issues.apache.org/jira/browse/SPARK-1812 Project: Spark Issue Type: New Feature Components: Build, Spark Core Reporter: Matei Zaharia Assignee: Prashant Sharma Since Scala 2.10/2.11 are source compatible, we should be able to cross build for both versions. From what I understand there are basically three things we need to figure out: 1. Have a two versions of our dependency graph, one that uses 2.11 dependencies and the other that uses 2.10 dependencies. 2. Figure out how to publish different poms for 2.10 and 2.11. I think (1) can be accomplished by having a scala 2.11 profile. (2) isn't really well supported by Maven since published pom's aren't generated dynamically. But we can probably script around it to make it work. I've done some initial sanity checks with a simple build here: https://github.com/pwendell/scala-maven-crossbuild -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-695) Exponential recursion in getPreferredLocations
[ https://issues.apache.org/jira/browse/SPARK-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081669#comment-14081669 ] Aaron Staple commented on SPARK-695: Progress has been made on a PR here: https://github.com/apache/spark/pull/1362 Exponential recursion in getPreferredLocations -- Key: SPARK-695 URL: https://issues.apache.org/jira/browse/SPARK-695 Project: Spark Issue Type: Bug Reporter: Matei Zaharia This was reported to happen in DAGScheduler for graphs with many paths from the root up, though I haven't yet found a good test case for it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1812) Support cross-building with Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081671#comment-14081671 ] Apache Spark commented on SPARK-1812: - User 'avati' has created a pull request for this issue: https://github.com/apache/spark/pull/1704 Support cross-building with Scala 2.11 -- Key: SPARK-1812 URL: https://issues.apache.org/jira/browse/SPARK-1812 Project: Spark Issue Type: New Feature Components: Build, Spark Core Reporter: Matei Zaharia Assignee: Prashant Sharma Since Scala 2.10/2.11 are source compatible, we should be able to cross build for both versions. From what I understand there are basically three things we need to figure out: 1. Have a two versions of our dependency graph, one that uses 2.11 dependencies and the other that uses 2.10 dependencies. 2. Figure out how to publish different poms for 2.10 and 2.11. I think (1) can be accomplished by having a scala 2.11 profile. (2) isn't really well supported by Maven since published pom's aren't generated dynamically. But we can probably script around it to make it work. I've done some initial sanity checks with a simple build here: https://github.com/pwendell/scala-maven-crossbuild -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2779) asInstanceOf[Map[...]] should use scala.collection.Map instead of scala.collection.immutable.Map
Yin Huai created SPARK-2779: --- Summary: asInstanceOf[Map[...]] should use scala.collection.Map instead of scala.collection.immutable.Map Key: SPARK-2779 URL: https://issues.apache.org/jira/browse/SPARK-2779 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2779) asInstanceOf[Map[...]] should use scala.collection.Map instead of scala.collection.immutable.Map
[ https://issues.apache.org/jira/browse/SPARK-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081732#comment-14081732 ] Apache Spark commented on SPARK-2779: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/1705 asInstanceOf[Map[...]] should use scala.collection.Map instead of scala.collection.immutable.Map Key: SPARK-2779 URL: https://issues.apache.org/jira/browse/SPARK-2779 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2780) Create a StreamingContext.setLocalProperty for setting local property of jobs launched by streaming
Tathagata Das created SPARK-2780: Summary: Create a StreamingContext.setLocalProperty for setting local property of jobs launched by streaming Key: SPARK-2780 URL: https://issues.apache.org/jira/browse/SPARK-2780 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0, 1.1.0 Reporter: Tathagata Das Priority: Minor SparkContext.setLocalProperty makes all Spark jobs submitted using the current thread belong to the set pool. However, in Spark Streaming, all the jobs are actually launched in the background from a different thread. So this setting does not work. Currently, there is a work around. If you are doing any kind of output operations on DStreams, like DStream.foreachRDD(), you can set the property inside that dstream.foreachRDD(rdd = rdd.sparkContext.setLocalProperty(...) ) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2781) Analyzer should check resolution of LogicalPlans
Aaron Staple created SPARK-2781: --- Summary: Analyzer should check resolution of LogicalPlans Key: SPARK-2781 URL: https://issues.apache.org/jira/browse/SPARK-2781 Project: Spark Issue Type: Bug Components: SQL Reporter: Aaron Staple Currently the Analyzer’s CheckResolution rule checks that all attributes are resolved by searching for unresolved Expressions. But some LogicalPlans, including Union, contain custom implementations of the resolve attribute that validate other criteria in addition to checking for attribute resolution of their descendants. These LogicalPlans are not currently validated by the CheckResolution implementation. As a result, it is currently possible to execute a query generated from unresolved LogicalPlans. One example is a UNION query that produces rows with different data types in the same column: {noformat} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class T1(value:Seq[Int]) val t1 = sc.parallelize(Seq(T1(Seq(0,1 t1.registerAsTable(t1) sqlContext.sql(SELECT value FROM t1 UNION SELECT 2 FROM t1”).collect() {noformat} In this example, the type coercion implementation cannot unify array and integer types. One row contains an array in the returned column and the other row contains an integer. The result is: {noformat} res3: Array[org.apache.spark.sql.Row] = Array([List(0, 1)], [2]) {noformat} I believe fixing this is a first step toward improving validation for Union (and similar) plans. (For instance, Union does not currently validate that its children contain the same number of columns.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2781) Analyzer should check resolution of LogicalPlans
[ https://issues.apache.org/jira/browse/SPARK-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2781. - Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 Assignee: Michael Armbrust We actually fixed this a long time ago: https://github.com/apache/spark/commit/b3e768e154bd7175db44c3ffc3d8f783f15ab776 Analyzer should check resolution of LogicalPlans Key: SPARK-2781 URL: https://issues.apache.org/jira/browse/SPARK-2781 Project: Spark Issue Type: Bug Components: SQL Reporter: Aaron Staple Assignee: Michael Armbrust Fix For: 1.0.1, 1.1.0 Currently the Analyzer’s CheckResolution rule checks that all attributes are resolved by searching for unresolved Expressions. But some LogicalPlans, including Union, contain custom implementations of the resolve attribute that validate other criteria in addition to checking for attribute resolution of their descendants. These LogicalPlans are not currently validated by the CheckResolution implementation. As a result, it is currently possible to execute a query generated from unresolved LogicalPlans. One example is a UNION query that produces rows with different data types in the same column: {noformat} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class T1(value:Seq[Int]) val t1 = sc.parallelize(Seq(T1(Seq(0,1 t1.registerAsTable(t1) sqlContext.sql(SELECT value FROM t1 UNION SELECT 2 FROM t1”).collect() {noformat} In this example, the type coercion implementation cannot unify array and integer types. One row contains an array in the returned column and the other row contains an integer. The result is: {noformat} res3: Array[org.apache.spark.sql.Row] = Array([List(0, 1)], [2]) {noformat} I believe fixing this is a first step toward improving validation for Union (and similar) plans. (For instance, Union does not currently validate that its children contain the same number of columns.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2781) Analyzer should check resolution of LogicalPlans
[ https://issues.apache.org/jira/browse/SPARK-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081828#comment-14081828 ] Apache Spark commented on SPARK-2781: - User 'staple' has created a pull request for this issue: https://github.com/apache/spark/pull/1706 Analyzer should check resolution of LogicalPlans Key: SPARK-2781 URL: https://issues.apache.org/jira/browse/SPARK-2781 Project: Spark Issue Type: Bug Components: SQL Reporter: Aaron Staple Assignee: Michael Armbrust Fix For: 1.0.1, 1.1.0 Currently the Analyzer’s CheckResolution rule checks that all attributes are resolved by searching for unresolved Expressions. But some LogicalPlans, including Union, contain custom implementations of the resolve attribute that validate other criteria in addition to checking for attribute resolution of their descendants. These LogicalPlans are not currently validated by the CheckResolution implementation. As a result, it is currently possible to execute a query generated from unresolved LogicalPlans. One example is a UNION query that produces rows with different data types in the same column: {noformat} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class T1(value:Seq[Int]) val t1 = sc.parallelize(Seq(T1(Seq(0,1 t1.registerAsTable(t1) sqlContext.sql(SELECT value FROM t1 UNION SELECT 2 FROM t1”).collect() {noformat} In this example, the type coercion implementation cannot unify array and integer types. One row contains an array in the returned column and the other row contains an integer. The result is: {noformat} res3: Array[org.apache.spark.sql.Row] = Array([List(0, 1)], [2]) {noformat} I believe fixing this is a first step toward improving validation for Union (and similar) plans. (For instance, Union does not currently validate that its children contain the same number of columns.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (SPARK-2781) Analyzer should check resolution of LogicalPlans
[ https://issues.apache.org/jira/browse/SPARK-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reopened SPARK-2781: - I'm sorry... I thought this was stale and did not read it carefully. Reopening. Analyzer should check resolution of LogicalPlans Key: SPARK-2781 URL: https://issues.apache.org/jira/browse/SPARK-2781 Project: Spark Issue Type: Bug Components: SQL Reporter: Aaron Staple Assignee: Michael Armbrust Fix For: 1.0.1, 1.1.0 Currently the Analyzer’s CheckResolution rule checks that all attributes are resolved by searching for unresolved Expressions. But some LogicalPlans, including Union, contain custom implementations of the resolve attribute that validate other criteria in addition to checking for attribute resolution of their descendants. These LogicalPlans are not currently validated by the CheckResolution implementation. As a result, it is currently possible to execute a query generated from unresolved LogicalPlans. One example is a UNION query that produces rows with different data types in the same column: {noformat} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class T1(value:Seq[Int]) val t1 = sc.parallelize(Seq(T1(Seq(0,1 t1.registerAsTable(t1) sqlContext.sql(SELECT value FROM t1 UNION SELECT 2 FROM t1”).collect() {noformat} In this example, the type coercion implementation cannot unify array and integer types. One row contains an array in the returned column and the other row contains an integer. The result is: {noformat} res3: Array[org.apache.spark.sql.Row] = Array([List(0, 1)], [2]) {noformat} I believe fixing this is a first step toward improving validation for Union (and similar) plans. (For instance, Union does not currently validate that its children contain the same number of columns.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2781) Analyzer should check resolution of LogicalPlans
[ https://issues.apache.org/jira/browse/SPARK-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081833#comment-14081833 ] Aaron Staple commented on SPARK-2781: - No problem, I think the current validation checks Expressions but there are some cases where LogicalPlans might not be resolved even though Expressions are resolved. Analyzer should check resolution of LogicalPlans Key: SPARK-2781 URL: https://issues.apache.org/jira/browse/SPARK-2781 Project: Spark Issue Type: Bug Components: SQL Reporter: Aaron Staple Assignee: Michael Armbrust Fix For: 1.0.1, 1.1.0 Currently the Analyzer’s CheckResolution rule checks that all attributes are resolved by searching for unresolved Expressions. But some LogicalPlans, including Union, contain custom implementations of the resolve attribute that validate other criteria in addition to checking for attribute resolution of their descendants. These LogicalPlans are not currently validated by the CheckResolution implementation. As a result, it is currently possible to execute a query generated from unresolved LogicalPlans. One example is a UNION query that produces rows with different data types in the same column: {noformat} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class T1(value:Seq[Int]) val t1 = sc.parallelize(Seq(T1(Seq(0,1 t1.registerAsTable(t1) sqlContext.sql(SELECT value FROM t1 UNION SELECT 2 FROM t1”).collect() {noformat} In this example, the type coercion implementation cannot unify array and integer types. One row contains an array in the returned column and the other row contains an integer. The result is: {noformat} res3: Array[org.apache.spark.sql.Row] = Array([List(0, 1)], [2]) {noformat} I believe fixing this is a first step toward improving validation for Union (and similar) plans. (For instance, Union does not currently validate that its children contain the same number of columns.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2781) Analyzer should check resolution of LogicalPlans
[ https://issues.apache.org/jira/browse/SPARK-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2781: Target Version/s: 1.2.0 Analyzer should check resolution of LogicalPlans Key: SPARK-2781 URL: https://issues.apache.org/jira/browse/SPARK-2781 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.1, 1.1.0 Reporter: Aaron Staple Assignee: Michael Armbrust Currently the Analyzer’s CheckResolution rule checks that all attributes are resolved by searching for unresolved Expressions. But some LogicalPlans, including Union, contain custom implementations of the resolve attribute that validate other criteria in addition to checking for attribute resolution of their descendants. These LogicalPlans are not currently validated by the CheckResolution implementation. As a result, it is currently possible to execute a query generated from unresolved LogicalPlans. One example is a UNION query that produces rows with different data types in the same column: {noformat} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class T1(value:Seq[Int]) val t1 = sc.parallelize(Seq(T1(Seq(0,1 t1.registerAsTable(t1) sqlContext.sql(SELECT value FROM t1 UNION SELECT 2 FROM t1”).collect() {noformat} In this example, the type coercion implementation cannot unify array and integer types. One row contains an array in the returned column and the other row contains an integer. The result is: {noformat} res3: Array[org.apache.spark.sql.Row] = Array([List(0, 1)], [2]) {noformat} I believe fixing this is a first step toward improving validation for Union (and similar) plans. (For instance, Union does not currently validate that its children contain the same number of columns.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2781) Analyzer should check resolution of LogicalPlans
[ https://issues.apache.org/jira/browse/SPARK-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2781: Fix Version/s: (was: 1.0.1) (was: 1.1.0) Analyzer should check resolution of LogicalPlans Key: SPARK-2781 URL: https://issues.apache.org/jira/browse/SPARK-2781 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.1, 1.1.0 Reporter: Aaron Staple Assignee: Michael Armbrust Currently the Analyzer’s CheckResolution rule checks that all attributes are resolved by searching for unresolved Expressions. But some LogicalPlans, including Union, contain custom implementations of the resolve attribute that validate other criteria in addition to checking for attribute resolution of their descendants. These LogicalPlans are not currently validated by the CheckResolution implementation. As a result, it is currently possible to execute a query generated from unresolved LogicalPlans. One example is a UNION query that produces rows with different data types in the same column: {noformat} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class T1(value:Seq[Int]) val t1 = sc.parallelize(Seq(T1(Seq(0,1 t1.registerAsTable(t1) sqlContext.sql(SELECT value FROM t1 UNION SELECT 2 FROM t1”).collect() {noformat} In this example, the type coercion implementation cannot unify array and integer types. One row contains an array in the returned column and the other row contains an integer. The result is: {noformat} res3: Array[org.apache.spark.sql.Row] = Array([List(0, 1)], [2]) {noformat} I believe fixing this is a first step toward improving validation for Union (and similar) plans. (For instance, Union does not currently validate that its children contain the same number of columns.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2781) Analyzer should check resolution of LogicalPlans
[ https://issues.apache.org/jira/browse/SPARK-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2781: Affects Version/s: 1.1.0 1.0.1 Analyzer should check resolution of LogicalPlans Key: SPARK-2781 URL: https://issues.apache.org/jira/browse/SPARK-2781 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.1, 1.1.0 Reporter: Aaron Staple Assignee: Michael Armbrust Currently the Analyzer’s CheckResolution rule checks that all attributes are resolved by searching for unresolved Expressions. But some LogicalPlans, including Union, contain custom implementations of the resolve attribute that validate other criteria in addition to checking for attribute resolution of their descendants. These LogicalPlans are not currently validated by the CheckResolution implementation. As a result, it is currently possible to execute a query generated from unresolved LogicalPlans. One example is a UNION query that produces rows with different data types in the same column: {noformat} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class T1(value:Seq[Int]) val t1 = sc.parallelize(Seq(T1(Seq(0,1 t1.registerAsTable(t1) sqlContext.sql(SELECT value FROM t1 UNION SELECT 2 FROM t1”).collect() {noformat} In this example, the type coercion implementation cannot unify array and integer types. One row contains an array in the returned column and the other row contains an integer. The result is: {noformat} res3: Array[org.apache.spark.sql.Row] = Array([List(0, 1)], [2]) {noformat} I believe fixing this is a first step toward improving validation for Union (and similar) plans. (For instance, Union does not currently validate that its children contain the same number of columns.) -- This message was sent by Atlassian JIRA (v6.2#6252)