[jira] [Commented] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large

2014-07-31 Thread Carlos Fuertes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080539#comment-14080539
 ] 

Carlos Fuertes commented on SPARK-2017:
---

Hi, I have implemented under https://github.com/apache/spark/pull/1682 the 
solution where you serve the data for the tables as JSON for tasks under 
'stages' and also 'storage' (this is issue SPARK-2016 which boils to same 
bottom problem). 

Main addition is exposing paths with the JSON data as:

/stages/stage/tasks/json/?id=nnn
/storage/json
/storage/rdd/workers/json?id=nnn
/storage/rdd/blocks/json?id=nnn

and using javascript to built the tables from an ajax request of those JSON. 

This solves partially the issue of responsiveness since the data is served 
asynchronously to the loading of the page. However since the driver is sending 
for every refresh all the data again, with very big number of tasks as they 
progress, that means that it starts taking longer and longer to send all the 
data. But at least the Summary table loads much faster with no need to wait for 
all the task table to complete.

A better solution would be to stream the data by chunks as they are ready or 
keep a cache of the previos results. I have not explored the latter yet but the 
above could be a start to build on it.


 web ui stage page becomes unresponsive when the number of tasks is large
 

 Key: SPARK-2017
 URL: https://issues.apache.org/jira/browse/SPARK-2017
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Reporter: Reynold Xin
  Labels: starter

 {code}
 sc.parallelize(1 to 100, 100).count()
 {code}
 The above code creates one million tasks to be executed. The stage detail web 
 ui page takes forever to load (if it ever completes).
 There are again a few different alternatives:
 0. Limit the number of tasks we show.
 1. Pagination
 2. By default only show the aggregate metrics and failed tasks, and hide the 
 successful ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2766) ScalaReflectionSuite throw an llegalArgumentException in JDK 6

2014-07-31 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2766:
---

Summary: ScalaReflectionSuite  throw an llegalArgumentException in JDK 6  
(was: ScalaReflectionSuite  throw an llegalArgumentException in jdk6)

 ScalaReflectionSuite  throw an llegalArgumentException in JDK 6
 ---

 Key: SPARK-2766
 URL: https://issues.apache.org/jira/browse/SPARK-2766
 Project: Spark
  Issue Type: Bug
Reporter: Guoqiang Li
Assignee: Guoqiang Li





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large

2014-07-31 Thread Carlos Fuertes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080539#comment-14080539
 ] 

Carlos Fuertes edited comment on SPARK-2017 at 7/31/14 6:05 AM:


Hi, I have implemented under https://github.com/apache/spark/pull/1682 the 
solution where you serve the data for the tables as JSON for tasks under 
'stages' and also 'storage' (this is issue SPARK-2016 which boils to same 
bottom problem). 

Main addition is exposing paths with the JSON data as:

/stages/stage/tasks/json/?id=nnn
/storage/json
/storage/rdd/workers/json?id=nnn
/storage/rdd/blocks/json?id=nnn

and using javascript to built the tables from an ajax request of those JSON. 

This solves partially the issue of responsiveness since the data is served 
asynchronously to the loading of the page. However since the driver is sending 
for every refresh all the data again, with very big number of tasks as they 
progress, that means that it starts taking longer and longer to send all the 
data. But at least the Summary table loads much faster with no need to wait for 
all the task table to complete.

A better solution would be to stream the data by chunks as they are ready or 
keep a cache of the previos results. I have not explored the latter yet but the 
above could be a start to build on it.

Let me know how this looks to you.


was (Author: carlosfuertes):
Hi, I have implemented under https://github.com/apache/spark/pull/1682 the 
solution where you serve the data for the tables as JSON for tasks under 
'stages' and also 'storage' (this is issue SPARK-2016 which boils to same 
bottom problem). 

Main addition is exposing paths with the JSON data as:

/stages/stage/tasks/json/?id=nnn
/storage/json
/storage/rdd/workers/json?id=nnn
/storage/rdd/blocks/json?id=nnn

and using javascript to built the tables from an ajax request of those JSON. 

This solves partially the issue of responsiveness since the data is served 
asynchronously to the loading of the page. However since the driver is sending 
for every refresh all the data again, with very big number of tasks as they 
progress, that means that it starts taking longer and longer to send all the 
data. But at least the Summary table loads much faster with no need to wait for 
all the task table to complete.

A better solution would be to stream the data by chunks as they are ready or 
keep a cache of the previos results. I have not explored the latter yet but the 
above could be a start to build on it.


 web ui stage page becomes unresponsive when the number of tasks is large
 

 Key: SPARK-2017
 URL: https://issues.apache.org/jira/browse/SPARK-2017
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Reporter: Reynold Xin
  Labels: starter

 {code}
 sc.parallelize(1 to 100, 100).count()
 {code}
 The above code creates one million tasks to be executed. The stage detail web 
 ui page takes forever to load (if it ever completes).
 There are again a few different alternatives:
 0. Limit the number of tasks we show.
 1. Pagination
 2. By default only show the aggregate metrics and failed tasks, and hide the 
 successful ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2016) rdd in-memory storage UI becomes unresponsive when the number of RDD partitions is large

2014-07-31 Thread Carlos Fuertes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080542#comment-14080542
 ] 

Carlos Fuertes commented on SPARK-2016:
---

I have created a pull request https://github.com/apache/spark/pull/1682 that 
deals with this issue. The idea follow the discussion of issue SPARK-2017 where 
the data for the tables is served as JSON and later rendered javascript. 

See https://issues.apache.org/jira/browse/SPARK-2017 for all the discussion.

 rdd in-memory storage UI becomes unresponsive when the number of RDD 
 partitions is large
 

 Key: SPARK-2016
 URL: https://issues.apache.org/jira/browse/SPARK-2016
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
  Labels: starter

 Try run
 {code}
 sc.parallelize(1 to 100, 100).cache().count()
 {code}
 And open the storage UI for this RDD. It takes forever to load the page.
 When the number of partitions is very large, I think there are a few 
 alternatives:
 0. Only show the top 1000.
 1. Pagination
 2. Instead of grouping by RDD blocks, group by executors



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2022) Spark 1.0.0 is failing if mesos.coarse set to true

2014-07-31 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2022:
---

Target Version/s: 1.1.0

 Spark 1.0.0 is failing if mesos.coarse set to true
 --

 Key: SPARK-2022
 URL: https://issues.apache.org/jira/browse/SPARK-2022
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Marek Wiewiorka
Assignee: Tim Chen
Priority: Critical

 more stderr
 ---
 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0603 16:07:53.721132 61192 exec.cpp:131] Version: 0.18.2
 I0603 16:07:53.725230 61200 exec.cpp:205] Executor registered on slave 
 201405220917-134217738-5050-27119-0
 Exception in thread main java.lang.NumberFormatException: For input string: 
 sparkseq003.cloudapp.net
 at 
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.lang.Integer.parseInt(Integer.java:492)
 at java.lang.Integer.parseInt(Integer.java:527)
 at 
 scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
 at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:135)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
 more stdout
 ---
 Registered executor on sparkseq003.cloudapp.net
 Starting task 5
 Forked command at 61202
 sh -c '/home/mesos/spark-1.0.0/bin/spark-class 
 org.apache.spark.executor.CoarseGrainedExecutorBackend 
 -Dspark.mesos.coarse=true 
 akka.tcp://sp...@sparkseq001.cloudapp.net:40312/user/CoarseG
 rainedScheduler 201405220917-134217738-5050-27119-0 sparkseq003.cloudapp.net 
 4'
 Command exited with status 1 (pid: 61202)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2633) support register spark listener to listener bus with Java API

2014-07-31 Thread Chengxiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated SPARK-2633:
-

Attachment: Spark listener enhancement for Hive on Spark job monitor and 
statistic.docx

I add a doc to collect requirement from hive on spark side, it may looks mussy 
in the comments. we could keep on discussing based on this file.

 support register spark listener to listener bus with Java API
 -

 Key: SPARK-2633
 URL: https://issues.apache.org/jira/browse/SPARK-2633
 Project: Spark
  Issue Type: New Feature
  Components: Java API
Reporter: Chengxiang Li
 Attachments: Spark listener enhancement for Hive on Spark job monitor 
 and statistic.docx


 Currently user can only register spark listener with Scala API, we should add 
 this feature to Java API as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2712) Add a small note that mvn package must happen before test

2014-07-31 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080579#comment-14080579
 ] 

Patrick Wendell commented on SPARK-2712:


I'm a bit confused about the doc request because we already have a section in 
the maven doc pertaining explicitly to this:

http://spark.apache.org/docs/latest/building-with-maven.html#spark-tests-in-maven

 Add a small note that mvn package must happen before test
 -

 Key: SPARK-2712
 URL: https://issues.apache.org/jira/browse/SPARK-2712
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 0.9.1, 1.0.0, 1.1.1
 Environment: all
Reporter: Stephen Boesch
Priority: Trivial
  Labels: documentation
 Fix For: 1.1.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 Add to the building-with-maven.md:
 Requirement: build packages before running tests
 Tests must be run AFTER the package target has already been executed. The 
 following is an example of a correct (build, test) sequence:
 mvn -Pyarn -Phadoop-2.3 -DskipTests -Phive clean package
 mvn -Pyarn -Phadoop-2.3 -Phive test
 BTW Reynold Xin requested this tiny doc improvement.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2712) Add a small note that mvn package must happen before test

2014-07-31 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2712:
---

Assignee: Stephen Boesch

 Add a small note that mvn package must happen before test
 -

 Key: SPARK-2712
 URL: https://issues.apache.org/jira/browse/SPARK-2712
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 0.9.1, 1.0.0, 1.1.1
 Environment: all
Reporter: Stephen Boesch
Assignee: Stephen Boesch
Priority: Trivial
  Labels: documentation
 Fix For: 1.1.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 Add to the building-with-maven.md:
 Requirement: build packages before running tests
 Tests must be run AFTER the package target has already been executed. The 
 following is an example of a correct (build, test) sequence:
 mvn -Pyarn -Phadoop-2.3 -DskipTests -Phive clean package
 mvn -Pyarn -Phadoop-2.3 -Phive test
 BTW Reynold Xin requested this tiny doc improvement.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner

2014-07-31 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080594#comment-14080594
 ] 

Larry Xiao commented on SPARK-2728:
---

I checked-out newest commit 5a110da25f15694773d6f7c6ee63c5b08ada4eb0 and find 
it's very different now.

 Integer overflow in partition index calculation RangePartitioner
 

 Key: SPARK-2728
 URL: https://issues.apache.org/jira/browse/SPARK-2728
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Spark 1.0.1
Reporter: Jianshi Huang
  Labels: easyfix

 If the partition number are greater than 10362, then spark will report 
 ArrayOutofIndex error. 
 The reason is in the partition index calculation in rangeBounds:
 #Line: 112
 val bounds = new Array[K](partitions - 1)
 for (i - 0 until partitions - 1) {
   val index = (rddSample.length - 1) * (i + 1) / partitions
   bounds(i) = rddSample(index)
 }
 Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int.
 Cast rddSample.length - 1 to Long should be enough for a fix?
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner

2014-07-31 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080594#comment-14080594
 ] 

Larry Xiao edited comment on SPARK-2728 at 7/31/14 7:23 AM:


I checked-out newest commit 5a110da25f15694773d6f7c6ee63c5b08ada4eb0 and find 
it's very different now. It uses Int still, but don't have multiplication like 
before.

Can you test on it?


was (Author: larryxiao):
I checked-out newest commit 5a110da25f15694773d6f7c6ee63c5b08ada4eb0 and find 
it's very different now.

 Integer overflow in partition index calculation RangePartitioner
 

 Key: SPARK-2728
 URL: https://issues.apache.org/jira/browse/SPARK-2728
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Spark 1.0.1
Reporter: Jianshi Huang
  Labels: easyfix

 If the partition number are greater than 10362, then spark will report 
 ArrayOutofIndex error. 
 The reason is in the partition index calculation in rangeBounds:
 #Line: 112
 val bounds = new Array[K](partitions - 1)
 for (i - 0 until partitions - 1) {
   val index = (rddSample.length - 1) * (i + 1) / partitions
   bounds(i) = rddSample(index)
 }
 Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int.
 Cast rddSample.length - 1 to Long should be enough for a fix?
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner

2014-07-31 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080594#comment-14080594
 ] 

Larry Xiao edited comment on SPARK-2728 at 7/31/14 7:24 AM:


I checked-out newest commit 5a110da25f15694773d6f7c6ee63c5b08ada4eb0 and find 
it's very different now. It uses Int still, but don't have multiplication like 
before.

Can you test again? [~huangjs]


was (Author: larryxiao):
I checked-out newest commit 5a110da25f15694773d6f7c6ee63c5b08ada4eb0 and find 
it's very different now. It uses Int still, but don't have multiplication like 
before.

Can you test on it?

 Integer overflow in partition index calculation RangePartitioner
 

 Key: SPARK-2728
 URL: https://issues.apache.org/jira/browse/SPARK-2728
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Spark 1.0.1
Reporter: Jianshi Huang
  Labels: easyfix

 If the partition number are greater than 10362, then spark will report 
 ArrayOutofIndex error. 
 The reason is in the partition index calculation in rangeBounds:
 #Line: 112
 val bounds = new Array[K](partitions - 1)
 for (i - 0 until partitions - 1) {
   val index = (rddSample.length - 1) * (i + 1) / partitions
   bounds(i) = rddSample(index)
 }
 Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int.
 Cast rddSample.length - 1 to Long should be enough for a fix?
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2729) Forgot to match Timestamp type in ColumnBuilder

2014-07-31 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080624#comment-14080624
 ] 

Cheng Lian commented on SPARK-2729:
---

Steps to reproduce this bug within {{sbt -Phive hive/console}}:
{code}
scala hql(create table dates as select cast('2011-01-01 01:01:01' as 
timestamp) from src)
...
scala cacheTable(dates)
...
scala hql(select count(*) from dates).collect()
...
14/07/31 16:00:37 ERROR executor.Executor: Exception in task 0.0 in stage 3.0 
(TID 6)
scala.MatchError: 8 (of class java.lang.Integer)
at 
org.apache.spark.sql.columnar.ColumnBuilder$.apply(ColumnBuilder.scala:146)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anonfun$2.apply(InMemoryColumnarTableScan.scala:48)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anonfun$2.apply(InMemoryColumnarTableScan.scala:47)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1.apply(InMemoryColumnarTableScan.scala:47)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1.apply(InMemoryColumnarTableScan.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:595)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:595)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:189)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}


 Forgot to match Timestamp type in ColumnBuilder
 ---

 Key: SPARK-2729
 URL: https://issues.apache.org/jira/browse/SPARK-2729
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Teng Qiu

 after SPARK-2710 we can create a table in Spark SQL with ColumnType Timestamp 
 from jdbc.
 when i try to
 {code}
 sqlContext.cacheTable(myJdbcTable)
 {code}
 then
 {code}
 sqlContext.sql(select count(*) from myJdbcTable)
 {code}
 i got exception:
 {code}
 scala.MatchError: 8 (of class java.lang.Integer)
 at 
 org.apache.spark.sql.columnar.ColumnBuilder$.apply(ColumnBuilder.scala:146)
 {code}
 i checked the code ColumnBuilder.scala:146
 it is just missing a match of Timestamp typeid.
 so it is easy to fix.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1812) Support cross-building with Scala 2.11

2014-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080627#comment-14080627
 ] 

Apache Spark commented on SPARK-1812:
-

User 'avati' has created a pull request for this issue:
https://github.com/apache/spark/pull/1685

 Support cross-building with Scala 2.11
 --

 Key: SPARK-1812
 URL: https://issues.apache.org/jira/browse/SPARK-1812
 Project: Spark
  Issue Type: New Feature
  Components: Build, Spark Core
Reporter: Matei Zaharia
Assignee: Prashant Sharma

 Since Scala 2.10/2.11 are source compatible, we should be able to cross build 
 for both versions. From what I understand there are basically three things we 
 need to figure out:
 1. Have a two versions of our dependency graph, one that uses 2.11 
 dependencies and the other that uses 2.10 dependencies.
 2. Figure out how to publish different poms for 2.10 and 2.11.
 I think (1) can be accomplished by having a scala 2.11 profile. (2) isn't 
 really well supported by Maven since published pom's aren't generated 
 dynamically. But we can probably script around it to make it work. I've done 
 some initial sanity checks with a simple build here:
 https://github.com/pwendell/scala-maven-crossbuild



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner

2014-07-31 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080647#comment-14080647
 ] 

Sean Owen commented on SPARK-2728:
--

Yes, this was likely fixed by recent changes to RangePartitioner already.

 Integer overflow in partition index calculation RangePartitioner
 

 Key: SPARK-2728
 URL: https://issues.apache.org/jira/browse/SPARK-2728
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Spark 1.0.1
Reporter: Jianshi Huang
  Labels: easyfix

 If the partition number are greater than 10362, then spark will report 
 ArrayOutofIndex error. 
 The reason is in the partition index calculation in rangeBounds:
 #Line: 112
 val bounds = new Array[K](partitions - 1)
 for (i - 0 until partitions - 1) {
   val index = (rddSample.length - 1) * (i + 1) / partitions
   bounds(i) = rddSample(index)
 }
 Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int.
 Cast rddSample.length - 1 to Long should be enough for a fix?
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-07-31 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080653#comment-14080653
 ] 

Tathagata Das commented on SPARK-2447:
--

I took a brief look at SPARK-1127 as well. I think both the PRs have their 
merits. We should consider consolidating the functionalities that they provide.

 

 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-07-31 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080653#comment-14080653
 ] 

Tathagata Das edited comment on SPARK-2447 at 7/31/14 8:39 AM:
---

I took a brief look at SPARK-1127 as well. I think both the PRs have their 
merits. We should consider consolidating the functionalities that they provide.

The relevant PR is this https://github.com/apache/spark/pull/194/files

 


was (Author: tdas):
I took a brief look at SPARK-1127 as well. I think both the PRs have their 
merits. We should consider consolidating the functionalities that they provide.

 

 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-07-31 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080653#comment-14080653
 ] 

Tathagata Das edited comment on SPARK-2447 at 7/31/14 8:40 AM:
---

I took a brief look at SPARK-1127 as well. I think both the PRs have their 
merits. We should consider consolidating the functionalities that they provide.

The relevant PR is this https://github.com/apache/spark/pull/194/files

 [~ted.m]


was (Author: tdas):
I took a brief look at SPARK-1127 as well. I think both the PRs have their 
merits. We should consider consolidating the functionalities that they provide.

The relevant PR is this https://github.com/apache/spark/pull/194/files

 

 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-07-31 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080653#comment-14080653
 ] 

Tathagata Das edited comment on SPARK-2447 at 7/31/14 8:41 AM:
---

I took a brief look at SPARK-1127 as well. I think both the PRs have their 
merits. We should consider consolidating the functionalities that they provide.

The relevant PR is this https://github.com/apache/spark/pull/194/files

 [~ted.m] can you take a look at this PR as well. I think the saveAsHBaseFile 
is a simpler interface that may be worth supporting if there is enough use of 
this simple interface (which assumes that all rows have same column structure). 


was (Author: tdas):
I took a brief look at SPARK-1127 as well. I think both the PRs have their 
merits. We should consider consolidating the functionalities that they provide.

The relevant PR is this https://github.com/apache/spark/pull/194/files

 [~ted.m]

 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2468) zero-copy shuffle network communication

2014-07-31 Thread Raymond Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080657#comment-14080657
 ] 

Raymond Liu commented on SPARK-2468:


so, is there anyone working on this?

 zero-copy shuffle network communication
 ---

 Key: SPARK-2468
 URL: https://issues.apache.org/jira/browse/SPARK-2468
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical

 Right now shuffle send goes through the block manager. This is inefficient 
 because it requires loading a block from disk into a kernel buffer, then into 
 a user space buffer, and then back to a kernel send buffer before it reaches 
 the NIC. It does multiple copies of the data and context switching between 
 kernel/user. It also creates unnecessary buffer in the JVM that increases GC
 Instead, we should use FileChannel.transferTo, which handles this in the 
 kernel space with zero-copy. See 
 http://www.ibm.com/developerworks/library/j-zerocopy/
 One potential solution is to use Netty NIO.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2767) SparkSQL CLI doens't output error message if query failed.

2014-07-31 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-2767:


 Summary: SparkSQL CLI doens't output error message if query failed.
 Key: SPARK-2767
 URL: https://issues.apache.org/jira/browse/SPARK-2767
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2767) SparkSQL CLI doens't output error message if query failed.

2014-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080660#comment-14080660
 ] 

Apache Spark commented on SPARK-2767:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/1686

 SparkSQL CLI doens't output error message if query failed.
 --

 Key: SPARK-2767
 URL: https://issues.apache.org/jira/browse/SPARK-2767
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2749) Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep

2014-07-31 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080661#comment-14080661
 ] 

Sean Owen commented on SPARK-2749:
--

Thanks! PR merged so this can be closed as fixed.

 Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing 
 junit:junit dep
 ---

 Key: SPARK-2749
 URL: https://issues.apache.org/jira/browse/SPARK-2749
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.1
Reporter: Sean Owen
Priority: Minor

 The Maven-based builds in the build matrix have been failing for a few days:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/
 On inspection, it looks like the Spark SQL Java tests don't compile:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull
 I confirmed it by repeating the command vs master:
 mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package
 The problem is that this module doesn't depend on JUnit. In fact, none of the 
 modules do, but com.novocode:junit-interface (the SBT-JUnit bridge) pulls it 
 in, in most places. However this module doesn't depend on 
 com.novocode:junit-interface
 Adding the junit:junit dependency fixes the compile problem. In fact, the 
 other modules with Java tests should probably depend on it explicitly instead 
 of happening to get it via com.novocode:junit-interface, since that is a bit 
 SBT/Scala-specific (and I am not even sure it's needed).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2768) Add product, user recommend method to MatrixFactorizationModel

2014-07-31 Thread Sean Owen (JIRA)
Sean Owen created SPARK-2768:


 Summary: Add product, user recommend method to 
MatrixFactorizationModel
 Key: SPARK-2768
 URL: https://issues.apache.org/jira/browse/SPARK-2768
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.1
Reporter: Sean Owen
Priority: Minor


Right now, MatrixFactorizationModel can only predict a score for one or more 
(user,product) tuples. As a comment in the file notes, it would be more useful 
to expose a recommend method, that computes top N scoring products for a user 
(or vice versa -- users for a product).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2768) Add product, user recommend method to MatrixFactorizationModel

2014-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080715#comment-14080715
 ] 

Apache Spark commented on SPARK-2768:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/1687

 Add product, user recommend method to MatrixFactorizationModel
 --

 Key: SPARK-2768
 URL: https://issues.apache.org/jira/browse/SPARK-2768
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.1
Reporter: Sean Owen
Priority: Minor

 Right now, MatrixFactorizationModel can only predict a score for one or more 
 (user,product) tuples. As a comment in the file notes, it would be more 
 useful to expose a recommend method, that computes top N scoring products for 
 a user (or vice versa -- users for a product).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner

2014-07-31 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080729#comment-14080729
 ] 

Jianshi Huang commented on SPARK-2728:
--

I see. Thanks for the fix Sean and Larry!

 Integer overflow in partition index calculation RangePartitioner
 

 Key: SPARK-2728
 URL: https://issues.apache.org/jira/browse/SPARK-2728
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Spark 1.0.1
Reporter: Jianshi Huang
  Labels: easyfix

 If the partition number are greater than 10362, then spark will report 
 ArrayOutofIndex error. 
 The reason is in the partition index calculation in rangeBounds:
 #Line: 112
 val bounds = new Array[K](partitions - 1)
 for (i - 0 until partitions - 1) {
   val index = (rddSample.length - 1) * (i + 1) / partitions
   bounds(i) = rddSample(index)
 }
 Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int.
 Cast rddSample.length - 1 to Long should be enough for a fix?
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner

2014-07-31 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080741#comment-14080741
 ] 

Jianshi Huang commented on SPARK-2728:
--

Anyone can test it? I'll close the issue. My build for HDP2.1 couldn't work 
with YARN... strange.

 Integer overflow in partition index calculation RangePartitioner
 

 Key: SPARK-2728
 URL: https://issues.apache.org/jira/browse/SPARK-2728
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Spark 1.0.1
Reporter: Jianshi Huang
  Labels: easyfix

 If the partition number are greater than 10362, then spark will report 
 ArrayOutofIndex error. 
 The reason is in the partition index calculation in rangeBounds:
 #Line: 112
 val bounds = new Array[K](partitions - 1)
 for (i - 0 until partitions - 1) {
   val index = (rddSample.length - 1) * (i + 1) / partitions
   bounds(i) = rddSample(index)
 }
 Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int.
 Cast rddSample.length - 1 to Long should be enough for a fix?
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2769) Ganglia Support Broken / Not working

2014-07-31 Thread Stephen Walsh (JIRA)
Stephen Walsh created SPARK-2769:


 Summary: Ganglia Support Broken / Not working
 Key: SPARK-2769
 URL: https://issues.apache.org/jira/browse/SPARK-2769
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Linux Red Hat 6.4 on Spark 1.1.0
Reporter: Stephen Walsh


Hi all,
I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0

No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is 
installed.

I've added the following to the metrics.properties


*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=HOSTNAME
*.sink.graphite.port=8649
*.sink.graphite.period=1
*.sink.graphite.prefix=aa




and I get this error message

java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at java.io.BufferedWriter.flush(BufferedWriter.java:254)
at com.codahale.metrics.graphite.Graphite.send(Graphite.java:77)
at 
com.codahale.metrics.graphite.GraphiteReporter.reportGauge(GraphiteReporter.java:254)
at 
com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:156)
at 
com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:107)
at 
com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:86)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)




From looking at the code I see the following.


  val graphite: Graphite = new Graphite(new InetSocketAddress(host, port))

  val reporter: GraphiteReporter = GraphiteReporter.forRegistry(registry)
  .convertDurationsTo(TimeUnit.MILLISECONDS)
  .convertRatesTo(TimeUnit.SECONDS)
  .prefixedWith(prefix)
  .build(graphite)


Followed by

override def start() {
reporter.start(pollPeriod, pollUnit)
  }


I noticed that the error fails when we first fry to send a message but nowhere 
do I see  graphite.connect() being called?

https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L62

The GraphiteBuilder doesn't call it either when creating the reporter object.
https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteReporter.java#L113


Maybe I'm looking in the wrong area and I'm passing in the wrong values - but 
very little logging has me thinking it is a bug.

Regards
Steve




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2769) Ganglia Support Broken / Not working

2014-07-31 Thread Stephen Walsh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Walsh updated SPARK-2769:
-

Description: 
Hi all,
I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0

No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is 
installed.

I've added the following to the metrics.properties


*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=HOSTNAME
*.sink.graphite.port=8649
*.sink.graphite.period=1
*.sink.graphite.prefix=aa




and I get this error message

14/07/31 05:39:00 WARN graphite.GraphiteReporter: Unable to report to Graphite
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at java.io.BufferedWriter.flush(BufferedWriter.java:254)
at com.codahale.metrics.graphite.Graphite.send(Graphite.java:77)
at 
com.codahale.metrics.graphite.GraphiteReporter.reportGauge(GraphiteReporter.java:254)
at 
com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:156)
at 
com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:107)
at 
com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:86)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)




From looking at the code I see the following.


  val graphite: Graphite = new Graphite(new InetSocketAddress(host, port))

  val reporter: GraphiteReporter = GraphiteReporter.forRegistry(registry)
  .convertDurationsTo(TimeUnit.MILLISECONDS)
  .convertRatesTo(TimeUnit.SECONDS)
  .prefixedWith(prefix)
  .build(graphite)


Followed by

override def start() {
reporter.start(pollPeriod, pollUnit)
  }


I noticed that the error fails when we first fry to send a message but nowhere 
do I see  graphite.connect() being called?

https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L62

The GraphiteBuilder doesn't call it either when creating the reporter object.
https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteReporter.java#L113


Maybe I'm looking in the wrong area and I'm passing in the wrong values - but 
very little logging has me thinking it is a bug.

Regards
Steve


  was:
Hi all,
I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0

No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is 
installed.

I've added the following to the metrics.properties


*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=HOSTNAME
*.sink.graphite.port=8649
*.sink.graphite.period=1
*.sink.graphite.prefix=aa




and I get this error message

java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at java.io.BufferedWriter.flush(BufferedWriter.java:254)
at com.codahale.metrics.graphite.Graphite.send(Graphite.java:77)
at 
com.codahale.metrics.graphite.GraphiteReporter.reportGauge(GraphiteReporter.java:254)
at 
com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:156)
at 

[jira] [Updated] (SPARK-2769) Ganglia Support Broken / Not working

2014-07-31 Thread Stephen Walsh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Walsh updated SPARK-2769:
-

Description: 
Hi all,
I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0

No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is 
installed.

I've added the following to the metrics.properties


*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=HOSTNAME
*.sink.graphite.port=8649
*.sink.graphite.period=1
*.sink.graphite.prefix=aa




and I get this error message

14/07/31 05:39:00 WARN graphite.GraphiteReporter: Unable to report to Graphite
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at java.io.BufferedWriter.flush(BufferedWriter.java:254)
at com.codahale.metrics.graphite.Graphite.send(Graphite.java:77)
at 
com.codahale.metrics.graphite.GraphiteReporter.reportGauge(GraphiteReporter.java:254)
at 
com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:156)
at 
com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:107)
at 
com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:86)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)




From looking at the code I see the following.


  val graphite: Graphite = new Graphite(new InetSocketAddress(host, port))

  val reporter: GraphiteReporter = GraphiteReporter.forRegistry(registry)
  .convertDurationsTo(TimeUnit.MILLISECONDS)
  .convertRatesTo(TimeUnit.SECONDS)
  .prefixedWith(prefix)
  .build(graphite)


Followed by

override def start() {
reporter.start(pollPeriod, pollUnit)
  }


I noticed that the error fails when we first fry to send a message but nowhere 
do I see  graphite.connect() being called?

https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L62

as it seems to fail on the send function..
https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L77

a with this.writer not initialized the writer.write will fail.

The GraphiteBuilder doesn't call it either when creating the reporter object.
https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteReporter.java#L113


Maybe I'm looking in the wrong area and I'm passing in the wrong values - but 
very little logging has me thinking it is a bug.

Regards
Steve


  was:
Hi all,
I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0

No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is 
installed.

I've added the following to the metrics.properties


*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=HOSTNAME
*.sink.graphite.port=8649
*.sink.graphite.period=1
*.sink.graphite.prefix=aa




and I get this error message

14/07/31 05:39:00 WARN graphite.GraphiteReporter: Unable to report to Graphite
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at java.io.BufferedWriter.flush(BufferedWriter.java:254)
at 

[jira] [Updated] (SPARK-2769) Ganglia Support Broken / Not working

2014-07-31 Thread Stephen Walsh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Walsh updated SPARK-2769:
-

Description: 
Hi all,
I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0

No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is 
installed.

I've added the following to the metrics.properties


*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=HOSTNAME
*.sink.graphite.port=8649
*.sink.graphite.period=1
*.sink.graphite.prefix=aa




and I get this error message

14/07/31 05:39:00 WARN graphite.GraphiteReporter: Unable to report to Graphite
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at java.io.BufferedWriter.flush(BufferedWriter.java:254)
at com.codahale.metrics.graphite.Graphite.send(Graphite.java:77)
at 
com.codahale.metrics.graphite.GraphiteReporter.reportGauge(GraphiteReporter.java:254)
at 
com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:156)
at 
com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:107)
at 
com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:86)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)




From looking at the code I see the following.


  val graphite: Graphite = new Graphite(new InetSocketAddress(host, port))

  val reporter: GraphiteReporter = GraphiteReporter.forRegistry(registry)
  .convertDurationsTo(TimeUnit.MILLISECONDS)
  .convertRatesTo(TimeUnit.SECONDS)
  .prefixedWith(prefix)
  .build(graphite)
https://github.com/apache/spark/blob/87bd1f9ef7d547ee54a8a83214b45462e0751efb/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala#L69


Followed by

override def start() {
reporter.start(pollPeriod, pollUnit)
  }


I noticed that the error fails when we first fry to send a message but nowhere 
do I see  graphite.connect() being called?

https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L62

as it seems to fail on the send function..
https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L77

a with this.writer not initialized the writer.write will fail.

The GraphiteBuilder doesn't call it either when creating the reporter object.
https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteReporter.java#L113


Maybe I'm looking in the wrong area and I'm passing in the wrong values - but 
very little logging has me thinking it is a bug.

Regards
Steve


  was:
Hi all,
I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0

No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is 
installed.

I've added the following to the metrics.properties


*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=HOSTNAME
*.sink.graphite.port=8649
*.sink.graphite.period=1
*.sink.graphite.prefix=aa




and I get this error message

14/07/31 05:39:00 WARN graphite.GraphiteReporter: Unable to report to Graphite
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
   

[jira] [Commented] (SPARK-2709) Add a tool for certifying Spark API compatiblity

2014-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080910#comment-14080910
 ] 

Apache Spark commented on SPARK-2709:
-

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/1688

 Add a tool for certifying Spark API compatiblity
 

 Key: SPARK-2709
 URL: https://issues.apache.org/jira/browse/SPARK-2709
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Prashant Sharma

 As Spark is packaged by more and more distributors, it would be good to have 
 a tool that verifies API compatiblity of a provided Spark package. The tool 
 would certify that a vendor distrubtion of Spark contains all of the API's 
 present in a particular upstream Spark version.
 This will help vendors make sure they remain API compliant when they make 
 changes or back ports to Spark. It will also discourage vendors from 
 knowingly breaking API's, because anyone can audit their distribution and see 
 that they have removed support for certain API's.
 I'm hoping a tool like this will avoid API fragmentation in the Spark 
 community.
 One poor man's implementation of this is that a vendor can just run the 
 binary compatibility checks in the spark build against an upstream version of 
 Spark. That's a pretty good start, but it means you can't come as a third 
 party and audit a distribution.
 Another approach would be to have something where anyone can come in and 
 audit a distribution even if they don't have access to the packaging and 
 source code. That would look something like this:
 1. For each release we publish a manifest of all public API's (we might 
 borrow the MIMA string representation of bye code signatures)
 2. We package an auditing tool as a jar file.
 3. The user runs a tool with spark-submit that reflectively walks through all 
 exposed Spark API's and makes sure that everything on the manifest is 
 encountered.
 From the implementation side, this is just brainstorming at this point.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't

2014-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080935#comment-14080935
 ] 

Apache Spark commented on SPARK-1021:
-

User 'erikerlandson' has created a pull request for this issue:
https://github.com/apache/spark/pull/1689

 sortByKey() launches a cluster job when it shouldn't
 

 Key: SPARK-1021
 URL: https://issues.apache.org/jira/browse/SPARK-1021
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.8.0, 0.9.0
Reporter: Andrew Ash
Assignee: Mark Hamstra
  Labels: starter

 The sortByKey() method is listed as a transformation, not an action, in the 
 documentation.  But it launches a cluster job regardless.
 http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html
 Some discussion on the mailing list suggested that this is a problem with the 
 rdd.count() call inside Partitioner.scala's rangeBounds method.
 https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102
 Josh Rosen suggests that rangeBounds should be made into a lazy variable:
 {quote}
 I wonder whether making RangePartitoner .rangeBounds into a lazy val would 
 fix this 
 (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
   We'd need to make sure that rangeBounds() is never called before an action 
 is performed.  This could be tricky because it's called in the 
 RangePartitioner.equals() method.  Maybe it's sufficient to just compare the 
 number of partitions, the ids of the RDDs used to create the 
 RangePartitioner, and the sort ordering.  This still supports the case where 
 I range-partition one RDD and pass the same partitioner to a different RDD.  
 It breaks support for the case where two range partitioners created on 
 different RDDs happened to have the same rangeBounds(), but it seems unlikely 
 that this would really harm performance since it's probably unlikely that the 
 range partitioners are equal by chance.
 {quote}
 Can we please make this happen?  I'll send a PR on GitHub to start the 
 discussion and testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2749) Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep

2014-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080953#comment-14080953
 ] 

Apache Spark commented on SPARK-2749:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/1690

 Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing 
 junit:junit dep
 ---

 Key: SPARK-2749
 URL: https://issues.apache.org/jira/browse/SPARK-2749
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.1
Reporter: Sean Owen
Priority: Minor

 The Maven-based builds in the build matrix have been failing for a few days:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/
 On inspection, it looks like the Spark SQL Java tests don't compile:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull
 I confirmed it by repeating the command vs master:
 mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package
 The problem is that this module doesn't depend on JUnit. In fact, none of the 
 modules do, but com.novocode:junit-interface (the SBT-JUnit bridge) pulls it 
 in, in most places. However this module doesn't depend on 
 com.novocode:junit-interface
 Adding the junit:junit dependency fixes the compile problem. In fact, the 
 other modules with Java tests should probably depend on it explicitly instead 
 of happening to get it via com.novocode:junit-interface, since that is a bit 
 SBT/Scala-specific (and I am not even sure it's needed).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2769) Ganglia Support Broken / Not working

2014-07-31 Thread Stephen Walsh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Walsh updated SPARK-2769:
-

Description: 
Hi all,
I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0

No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is 
installed.

I've added the following to the metrics.properties


*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=HOSTNAME
*.sink.graphite.port=8649
*.sink.graphite.period=1
*.sink.graphite.prefix=aa




and I get this error message

14/07/31 05:39:00 WARN graphite.GraphiteReporter: Unable to report to Graphite
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at java.io.BufferedWriter.flush(BufferedWriter.java:254)
at com.codahale.metrics.graphite.Graphite.send(Graphite.java:77)
at 
com.codahale.metrics.graphite.GraphiteReporter.reportGauge(GraphiteReporter.java:254)
at 
com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:156)
at 
com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:107)
at 
com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:86)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)




From looking at the code I see the following.


  val graphite: Graphite = new Graphite(new InetSocketAddress(host, port))

  val reporter: GraphiteReporter = GraphiteReporter.forRegistry(registry)
  .convertDurationsTo(TimeUnit.MILLISECONDS)
  .convertRatesTo(TimeUnit.SECONDS)
  .prefixedWith(prefix)
  .build(graphite)
https://github.com/apache/spark/blob/87bd1f9ef7d547ee54a8a83214b45462e0751efb/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala#L69


Followed by

override def start() {
reporter.start(pollPeriod, pollUnit)
  }


I noticed that the error fails when we first fry to send a message but nowhere 
do I see  graphite.connect() being called?

https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L62

as it seems to fail on the send function..
https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L77

a with this.writer not initialized the writer.write will fail.

The GraphiteBuilder doesn't call it either when creating the reporter object.
https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteReporter.java#L113


Maybe I'm looking in the wrong area and I'm passing in the wrong values - but 
very little logging has me thinking it is a bug.

EDIT:
found out where the connect gets called.
https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteReporter.java#L153

ad his is called from  here

https://github.com/dropwizard/metrics/blob/99dc540c2cbe6bb3be304e20449fb641c7f5382a/metrics-core/src/main/java/com/codahale/metrics/ScheduledReporter.java#L98

which is called form here

https://github.com/dropwizard/metrics/blob/99dc540c2cbe6bb3be304e20449fb641c7f5382a/metrics-core/src/main/java/com/codahale/metrics/ScheduledReporter.java#L98


but the issue still stands. :/

Regards
Steve


  was:
Hi all,
I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0

No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is 
installed.

I've added the following to the metrics.properties


*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=HOSTNAME
*.sink.graphite.port=8649
*.sink.graphite.period=1
*.sink.graphite.prefix=aa




and I get this error message

14/07/31 05:39:00 WARN 

[jira] [Commented] (SPARK-2700) Hidden files (such as .impala_insert_staging) should be filtered out by sqlContext.parquetFile

2014-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080969#comment-14080969
 ] 

Apache Spark commented on SPARK-2700:
-

User 'chutium' has created a pull request for this issue:
https://github.com/apache/spark/pull/1691

 Hidden files (such as .impala_insert_staging) should be filtered out by 
 sqlContext.parquetFile
 --

 Key: SPARK-2700
 URL: https://issues.apache.org/jira/browse/SPARK-2700
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.0.1
Reporter: Teng Qiu

 when creating a table in impala, a hidden folder .impala_insert_staging will 
 be created in the folder of table.
 if we want to load such a table using Spark SQL API sqlContext.parquetFile, 
 this hidden folder makes trouble, spark try to get metadata from this folder, 
 you will see the exception:
 {code:borderStyle=solid}
 Caused by: java.io.IOException: Could not read footer for file 
 FileStatus{path=hdfs://xxx:8020/user/hive/warehouse/parquet_strings/.impala_insert_staging;
  isDirectory=true; modification_time=1406333729252; access_time=0; 
 owner=hdfs; group=hdfs; permission=rwxr-xr-x; isSymlink=false}
 ...
 ...
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is 
 not a file: /user/hive/warehouse/parquet_strings/.impala_insert_staging
 {code}
 and impala side do not think this is their problem: 
 https://issues.cloudera.org/browse/IMPALA-837 (IMPALA-837 Delete 
 .impala_insert_staging directory after INSERT)
 so maybe we should filter out these hidden folder/file by reading parquet 
 tables



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2381) streaming receiver crashed,but seems nothing happened

2014-07-31 Thread sunsc (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081021#comment-14081021
 ] 

sunsc commented on SPARK-2381:
--

Send a PR here : https://github.com/apache/spark/pull/1693/

 streaming receiver crashed,but seems nothing happened
 -

 Key: SPARK-2381
 URL: https://issues.apache.org/jira/browse/SPARK-2381
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: sunsc

 when we submit a streaming job and if receivers doesn't start normally, the 
 application should stop itself. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2379) stopReceive in dead loop, cause stackoverflow exception

2014-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081033#comment-14081033
 ] 

Apache Spark commented on SPARK-2379:
-

User 'joyyoj' has created a pull request for this issue:
https://github.com/apache/spark/pull/1694

 stopReceive in dead loop, cause stackoverflow exception
 ---

 Key: SPARK-2379
 URL: https://issues.apache.org/jira/browse/SPARK-2379
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0
Reporter: sunsc

 streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala
 stop will call stopReceiver and stopReceiver will call stop if exception 
 occurs, that make a dead loop.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2769) Ganglia Support Broken / Not working

2014-07-31 Thread Stephen Walsh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Walsh updated SPARK-2769:
-

Description: 
Hi all,
I've build spark 1.1.0 with sbt with ganglia enabled and hadoop version 2.4.0

No issues there, spark works fine on hadoop 2.4.0 and ganglia (GraphiteSink) is 
installed.

I've added the following to the metrics.properties


*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=HOSTNAME
*.sink.graphite.port=8649
*.sink.graphite.period=1
*.sink.graphite.prefix=aa




and I get this error message

14/07/31 05:39:00 WARN graphite.GraphiteReporter: Unable to report to Graphite
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at java.io.BufferedWriter.flush(BufferedWriter.java:254)
at com.codahale.metrics.graphite.Graphite.send(Graphite.java:77)
at 
com.codahale.metrics.graphite.GraphiteReporter.reportGauge(GraphiteReporter.java:254)
at 
com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:156)
at 
com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:107)
at 
com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:86)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)




From looking at the code I see the following.


  val graphite: Graphite = new Graphite(new InetSocketAddress(host, port))

  val reporter: GraphiteReporter = GraphiteReporter.forRegistry(registry)
  .convertDurationsTo(TimeUnit.MILLISECONDS)
  .convertRatesTo(TimeUnit.SECONDS)
  .prefixedWith(prefix)
  .build(graphite)
https://github.com/apache/spark/blob/87bd1f9ef7d547ee54a8a83214b45462e0751efb/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala#L69


Followed by

override def start() {
reporter.start(pollPeriod, pollUnit)
  }


I noticed that the error fails when we first fry to send a message but nowhere 
do I see  graphite.connect() being called?

https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L62

as it seems to fail on the send function..
https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L77

a with this.writer not initialized the writer.write will fail.

The GraphiteBuilder doesn't call it either when creating the reporter object.
https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteReporter.java#L113


Maybe I'm looking in the wrong area and I'm passing in the wrong values - but 
very little logging has me thinking it is a bug.

EDIT:
found out where the connect gets called.
https://github.com/dropwizard/metrics/blob/master/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteReporter.java#L153

ad his is called from  here

https://github.com/dropwizard/metrics/blob/99dc540c2cbe6bb3be304e20449fb641c7f5382a/metrics-core/src/main/java/com/codahale/metrics/ScheduledReporter.java#L98

which is called form here

https://github.com/dropwizard/metrics/blob/99dc540c2cbe6bb3be304e20449fb641c7f5382a/metrics-core/src/main/java/com/codahale/metrics/ScheduledReporter.java#L98


but the issue still stands. :/



Edit 2:

my ports are open and listening

[root@rtr-dev-spark4 ~]# lsof -i :8649
COMMAND   PIDUSER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
gmond   32173 ganglia5u  IPv4 3480253  0t0  UDP 
rtr-dev-spark4.ord2012:8649
gmond   32173 ganglia6u  IPv4 3480255  0t0  TCP 
rtr-dev-spark4.ord2012:8649 (LISTEN)
gmond   32173 ganglia7u  IPv4 3480257  0t0  UDP 
rtr-dev-spark4.ord2012:55523-rtr-dev-spark4.ord2012:8649


Regards
Steve


  was:
Hi all,
I've build spark 1.1.0 

[jira] [Created] (SPARK-2770) Rename spark-ganglia-lgpl to ganglia-lgpl

2014-07-31 Thread Chris Fregly (JIRA)
Chris Fregly created SPARK-2770:
---

 Summary: Rename spark-ganglia-lgpl to ganglia-lgpl
 Key: SPARK-2770
 URL: https://issues.apache.org/jira/browse/SPARK-2770
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Chris Fregly
Priority: Minor
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2771) GenerateMIMAIgnore fails scalastyle check due to long line

2014-07-31 Thread Ted Yu (JIRA)
Ted Yu created SPARK-2771:
-

 Summary: GenerateMIMAIgnore fails scalastyle check due to long line
 Key: SPARK-2771
 URL: https://issues.apache.org/jira/browse/SPARK-2771
 Project: Spark
  Issue Type: Bug
Reporter: Ted Yu
Priority: Minor


I got the following error building master branch:
{code}
[INFO] --- scalastyle-maven-plugin:0.4.0:check (default) @ spark-tools_2.10 ---
error 
file=/homes/hortonzy/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala
 message=File line length exceeds 100 characters line=118
Saving to outputFile=/homes/hortonzy/spark/tools/scalastyle-output.xml
Processed 3 file(s)
{code}
This is caused by 3rd line below:
{code}
classSymbol.typeSignature.members.filterNot(x =
  x.fullName.startsWith(java) || x.fullName.startsWith(scala))
.filter(x = isPackagePrivate(x) || isDeveloperApi(x) || 
isExperimental(x)).map(_.fullName) ++
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2771) GenerateMIMAIgnore fails scalastyle check due to long line

2014-07-31 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081050#comment-14081050
 ] 

Sean Owen commented on SPARK-2771:
--

Already on it :)
https://github.com/apache/spark/pull/1690

 GenerateMIMAIgnore fails scalastyle check due to long line
 --

 Key: SPARK-2771
 URL: https://issues.apache.org/jira/browse/SPARK-2771
 Project: Spark
  Issue Type: Bug
Reporter: Ted Yu
Priority: Minor

 I got the following error building master branch:
 {code}
 [INFO] --- scalastyle-maven-plugin:0.4.0:check (default) @ spark-tools_2.10 
 ---
 error 
 file=/homes/hortonzy/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala
  message=File line length exceeds 100 characters line=118
 Saving to outputFile=/homes/hortonzy/spark/tools/scalastyle-output.xml
 Processed 3 file(s)
 {code}
 This is caused by 3rd line below:
 {code}
 classSymbol.typeSignature.members.filterNot(x =
   x.fullName.startsWith(java) || x.fullName.startsWith(scala))
 .filter(x = isPackagePrivate(x) || isDeveloperApi(x) || 
 isExperimental(x)).map(_.fullName) ++
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2771) GenerateMIMAIgnore fails scalastyle check due to long line

2014-07-31 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-2771:
--

Attachment: spark-2771-v1.txt

Patch v1 shortens line 118 to 100 chars wide.

 GenerateMIMAIgnore fails scalastyle check due to long line
 --

 Key: SPARK-2771
 URL: https://issues.apache.org/jira/browse/SPARK-2771
 Project: Spark
  Issue Type: Bug
Reporter: Ted Yu
Priority: Minor
 Attachments: spark-2771-v1.txt


 I got the following error building master branch:
 {code}
 [INFO] --- scalastyle-maven-plugin:0.4.0:check (default) @ spark-tools_2.10 
 ---
 error 
 file=/homes/hortonzy/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala
  message=File line length exceeds 100 characters line=118
 Saving to outputFile=/homes/hortonzy/spark/tools/scalastyle-output.xml
 Processed 3 file(s)
 {code}
 This is caused by 3rd line below:
 {code}
 classSymbol.typeSignature.members.filterNot(x =
   x.fullName.startsWith(java) || x.fullName.startsWith(scala))
 .filter(x = isPackagePrivate(x) || isDeveloperApi(x) || 
 isExperimental(x)).map(_.fullName) ++
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2772) Spark Project SQL fails compilation

2014-07-31 Thread Ted Yu (JIRA)
Ted Yu created SPARK-2772:
-

 Summary: Spark Project SQL fails compilation
 Key: SPARK-2772
 URL: https://issues.apache.org/jira/browse/SPARK-2772
 Project: Spark
  Issue Type: Bug
Reporter: Ted Yu


I used the following command:
{code}
mvn clean -Pyarn -Phive -Phadoop-2.4 -DskipTests package
{code}
I got:
{code}
[ERROR] 
/homes/hortonzy/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala:52:
 wrong number of type arguments for org.apache.spark.rdd.ShuffledRDD, should be 
4
[ERROR] val shuffled = new ShuffledRDD[Row, Row, Row](rdd, part)
[ERROR]^
[ERROR] 
/homes/hortonzy/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala:65:
 wrong number of type arguments for org.apache.spark.rdd.ShuffledRDD, should be 
4
[ERROR] val shuffled = new ShuffledRDD[Row, Null, Null](rdd, part)
[ERROR]^
[ERROR] 
/homes/hortonzy/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala:76:
 wrong number of type arguments for org.apache.spark.rdd.ShuffledRDD, should be 
4
[ERROR] val shuffled = new ShuffledRDD[Null, Row, Row](rdd, partitioner)
[ERROR]^
[ERROR] 
/homes/hortonzy/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala:151:
 wrong number of type arguments for org.apache.spark.rdd.ShuffledRDD, should be 
4
[ERROR] val shuffled = new ShuffledRDD[Boolean, Row, Row](rdd, part)
[ERROR]^
[ERROR] four errors found
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2762) SparkILoop leaks memory in multi-repl configurations

2014-07-31 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2762:
-

Assignee: Timothy Hunter

 SparkILoop leaks memory in multi-repl configurations
 

 Key: SPARK-2762
 URL: https://issues.apache.org/jira/browse/SPARK-2762
 Project: Spark
  Issue Type: Bug
Reporter: Timothy Hunter
Assignee: Timothy Hunter
Priority: Minor

 When subclassing SparkILoop and instantiating multiple objects, the 
 SparkILoop instances do not get garbage collected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2762) SparkILoop leaks memory in multi-repl configurations

2014-07-31 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-2762.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

 SparkILoop leaks memory in multi-repl configurations
 

 Key: SPARK-2762
 URL: https://issues.apache.org/jira/browse/SPARK-2762
 Project: Spark
  Issue Type: Bug
Reporter: Timothy Hunter
Assignee: Timothy Hunter
Priority: Minor
 Fix For: 1.1.0


 When subclassing SparkILoop and instantiating multiple objects, the 
 SparkILoop instances do not get garbage collected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2762) SparkILoop leaks memory in multi-repl configurations

2014-07-31 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081154#comment-14081154
 ] 

Matei Zaharia commented on SPARK-2762:
--

PR: https://github.com/apache/spark/pull/1674

 SparkILoop leaks memory in multi-repl configurations
 

 Key: SPARK-2762
 URL: https://issues.apache.org/jira/browse/SPARK-2762
 Project: Spark
  Issue Type: Bug
Reporter: Timothy Hunter
Assignee: Timothy Hunter
Priority: Minor
 Fix For: 1.1.0


 When subclassing SparkILoop and instantiating multiple objects, the 
 SparkILoop instances do not get garbage collected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2773) Shuffle:use growth rate to predict if need to spill

2014-07-31 Thread uncleGen (JIRA)
uncleGen created SPARK-2773:
---

 Summary: Shuffle:use growth rate to predict if need to spill
 Key: SPARK-2773
 URL: https://issues.apache.org/jira/browse/SPARK-2773
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.0.0, 0.9.0
Reporter: uncleGen
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2773) Shuffle:use growth rate to predict if need to spill

2014-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081163#comment-14081163
 ] 

Apache Spark commented on SPARK-2773:
-

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/1696

 Shuffle:use growth rate to predict if need to spill
 ---

 Key: SPARK-2773
 URL: https://issues.apache.org/jira/browse/SPARK-2773
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 0.9.0, 1.0.0
Reporter: uncleGen
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2773) Shuffle:use growth rate to predict if need to spill

2014-07-31 Thread uncleGen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081161#comment-14081161
 ] 

uncleGen commented on SPARK-2773:
-

here is my improvement: https://github.com/apache/spark/pull/1696

 Shuffle:use growth rate to predict if need to spill
 ---

 Key: SPARK-2773
 URL: https://issues.apache.org/jira/browse/SPARK-2773
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 0.9.0, 1.0.0
Reporter: uncleGen
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2744) The configuration spark.history.retainedApplications is invalid

2014-07-31 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081191#comment-14081191
 ] 

Marcelo Vanzin commented on SPARK-2744:
---

Sort of. The docs say:

{quote}
The number of application UIs to retain.
{quote}

That's still true. Only the configured number of UIs are kept in memory. The 
newer will list all available applications, though, while just keeping a few 
UIs in memory. So it's more useful, since you're able to browse more history.

 The configuration spark.history.retainedApplications is invalid
 -

 Key: SPARK-2744
 URL: https://issues.apache.org/jira/browse/SPARK-2744
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula
  Labels: historyserver

 when I set it in spark-env.sh like this:export 
 SPARK_HISTORY_OPTS=$SPARK_HISTORY_OPTS -Dspark.history.ui.port=5678 
 -Dspark.history.retainedApplications=1 , the web of historyserver retains 
 more than one application



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2743) Parquet has issues with capital letters and case insensitivity

2014-07-31 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2743.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

 Parquet has issues with capital letters and case insensitivity
 --

 Key: SPARK-2743
 URL: https://issues.apache.org/jira/browse/SPARK-2743
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2397) Get rid of LocalHiveContext

2014-07-31 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2397.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

 Get rid of LocalHiveContext
 ---

 Key: SPARK-2397
 URL: https://issues.apache.org/jira/browse/SPARK-2397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.1.0


 HiveLocalContext is nearly completely redundant with HiveContext.  We should 
 consider deprecating it and removing all uses.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2774) Set preferred locations for reduce tasks

2014-07-31 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-2774:


 Summary: Set preferred locations for reduce tasks
 Key: SPARK-2774
 URL: https://issues.apache.org/jira/browse/SPARK-2774
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shivaram Venkataraman


Currently we do not set preferred locations for reduce tasks in Spark. This 
patch proposes setting preferred locations based on the map output sizes and 
locations tracked by the MapOutputTracker. This is useful in two conditions

1. When you have a small job in a large cluster it can be useful to co-locate 
map and reduce tasks to avoid going over the network
2. If there is a lot of data skew in the map stage outputs, then it is 
beneficial to place the reducer close to the largest output.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2028) Let users of HadoopRDD access the partition InputSplits

2014-07-31 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-2028.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

 Let users of HadoopRDD access the partition InputSplits
 ---

 Key: SPARK-2028
 URL: https://issues.apache.org/jira/browse/SPARK-2028
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 1.1.0


 If a user creates a HadoopRDD (e.g., via textFile), there is no way to find 
 out which file it came from, though this information is contained in the 
 InputSplit within the RDD. We should find a way to expose this publicly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2774) Set preferred locations for reduce tasks

2014-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081256#comment-14081256
 ] 

Apache Spark commented on SPARK-2774:
-

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/1697

 Set preferred locations for reduce tasks
 

 Key: SPARK-2774
 URL: https://issues.apache.org/jira/browse/SPARK-2774
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shivaram Venkataraman

 Currently we do not set preferred locations for reduce tasks in Spark. This 
 patch proposes setting preferred locations based on the map output sizes and 
 locations tracked by the MapOutputTracker. This is useful in two conditions
 1. When you have a small job in a large cluster it can be useful to co-locate 
 map and reduce tasks to avoid going over the network
 2. If there is a lot of data skew in the map stage outputs, then it is 
 beneficial to place the reducer close to the largest output.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2775) HiveContext does not support dots in column names.

2014-07-31 Thread Yin Huai (JIRA)
Yin Huai created SPARK-2775:
---

 Summary: HiveContext does not support dots in column names. 
 Key: SPARK-2775
 URL: https://issues.apache.org/jira/browse/SPARK-2775
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai


When you try the following snippet in hive/console. 
{code}
val data = sc.parallelize(Seq({key.number1: value1, key.number2: 
value2}))
jsonRDD(data).registerAsTable(jt)
hql(select `key.number1` from jt)
{code}
You will find the name of key.number1 cannot be resolved.
{code}
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: 'key.number1, tree:
Project ['key.number1]
 LowerCaseSchema 
  Subquery jt
   SparkLogicalPlan (ExistingRdd [key.number1#8,key.number2#9], MappedRDD[17] 
at map at JsonRDD.scala:37)
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2664) Deal with `--conf` options in spark-submit that relate to flags

2014-07-31 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2664.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1665
[https://github.com/apache/spark/pull/1665]

 Deal with `--conf` options in spark-submit that relate to flags
 ---

 Key: SPARK-2664
 URL: https://issues.apache.org/jira/browse/SPARK-2664
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Sandy Ryza
Priority: Blocker
 Fix For: 1.1.0


 If someone sets a spark conf that relates to an existing flag `--master`, we 
 should set it correctly like we do with the defaults file. Otherwise it can 
 have confusing semantics. I noticed this after merging it, otherwise I would 
 have mentioned it in the review.
 I think it's as simple as modifying loadDefaults to check the user-supplied 
 options also. We might change it to loadUserProperties since it's no longer 
 just the defaults file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2776) Add normalizeByCol method to mllib.util.MLUtils

2014-07-31 Thread Andres Perez (JIRA)
Andres Perez created SPARK-2776:
---

 Summary: Add normalizeByCol method to mllib.util.MLUtils
 Key: SPARK-2776
 URL: https://issues.apache.org/jira/browse/SPARK-2776
 Project: Spark
  Issue Type: New Feature
Reporter: Andres Perez
Priority: Minor


Add the ability to compute the mean and standard deviations of each vector 
(LabeledPoint) component and normalize each vector in the RDD, using only RDD 
transformations. The result is an RDD of Vectors where each column has a mean 
of zero and standard deviation of one.

See https://github.com/apache/spark/pull/1698



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2749) Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep

2014-07-31 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2749:
---

Assignee: Sean Owen

 Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing 
 junit:junit dep
 ---

 Key: SPARK-2749
 URL: https://issues.apache.org/jira/browse/SPARK-2749
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.1
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.1.0


 The Maven-based builds in the build matrix have been failing for a few days:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/
 On inspection, it looks like the Spark SQL Java tests don't compile:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull
 I confirmed it by repeating the command vs master:
 mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package
 The problem is that this module doesn't depend on JUnit. In fact, none of the 
 modules do, but com.novocode:junit-interface (the SBT-JUnit bridge) pulls it 
 in, in most places. However this module doesn't depend on 
 com.novocode:junit-interface
 Adding the junit:junit dependency fixes the compile problem. In fact, the 
 other modules with Java tests should probably depend on it explicitly instead 
 of happening to get it via com.novocode:junit-interface, since that is a bit 
 SBT/Scala-specific (and I am not even sure it's needed).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2749) Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep

2014-07-31 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2749.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1690
[https://github.com/apache/spark/pull/1690]

 Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing 
 junit:junit dep
 ---

 Key: SPARK-2749
 URL: https://issues.apache.org/jira/browse/SPARK-2749
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.1
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.1.0


 The Maven-based builds in the build matrix have been failing for a few days:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/
 On inspection, it looks like the Spark SQL Java tests don't compile:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull
 I confirmed it by repeating the command vs master:
 mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package
 The problem is that this module doesn't depend on JUnit. In fact, none of the 
 modules do, but com.novocode:junit-interface (the SBT-JUnit bridge) pulls it 
 in, in most places. However this module doesn't depend on 
 com.novocode:junit-interface
 Adding the junit:junit dependency fixes the compile problem. In fact, the 
 other modules with Java tests should probably depend on it explicitly instead 
 of happening to get it via com.novocode:junit-interface, since that is a bit 
 SBT/Scala-specific (and I am not even sure it's needed).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2772) Spark Project SQL fails compilation

2014-07-31 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu resolved SPARK-2772.
---

Resolution: Not a Problem

'mvn install' solves the issue.

 Spark Project SQL fails compilation
 ---

 Key: SPARK-2772
 URL: https://issues.apache.org/jira/browse/SPARK-2772
 Project: Spark
  Issue Type: Bug
Reporter: Ted Yu

 I used the following command:
 {code}
 mvn clean -Pyarn -Phive -Phadoop-2.4 -DskipTests package
 {code}
 I got:
 {code}
 [ERROR] 
 /homes/hortonzy/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala:52:
  wrong number of type arguments for org.apache.spark.rdd.ShuffledRDD, should 
 be 4
 [ERROR] val shuffled = new ShuffledRDD[Row, Row, Row](rdd, part)
 [ERROR]^
 [ERROR] 
 /homes/hortonzy/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala:65:
  wrong number of type arguments for org.apache.spark.rdd.ShuffledRDD, should 
 be 4
 [ERROR] val shuffled = new ShuffledRDD[Row, Null, Null](rdd, part)
 [ERROR]^
 [ERROR] 
 /homes/hortonzy/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala:76:
  wrong number of type arguments for org.apache.spark.rdd.ShuffledRDD, should 
 be 4
 [ERROR] val shuffled = new ShuffledRDD[Null, Row, Row](rdd, 
 partitioner)
 [ERROR]^
 [ERROR] 
 /homes/hortonzy/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala:151:
  wrong number of type arguments for org.apache.spark.rdd.ShuffledRDD, should 
 be 4
 [ERROR] val shuffled = new ShuffledRDD[Boolean, Row, Row](rdd, part)
 [ERROR]^
 [ERROR] four errors found
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2678) `Spark-submit` overrides user application options

2014-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081302#comment-14081302
 ] 

Apache Spark commented on SPARK-2678:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/1699

 `Spark-submit` overrides user application options
 -

 Key: SPARK-2678
 URL: https://issues.apache.org/jira/browse/SPARK-2678
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.1, 1.0.2
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker

 Here is an example:
 {code}
 ./bin/spark-submit --class Foo some.jar --help
 {code}
 SInce {{--help}} appears behind the primary resource (i.e. {{some.jar}}), it 
 should be recognized as a user application option. But it's actually 
 overriden by {{spark-submit}} and will show {{spark-submit}} help message.
 When directly invoking {{spark-submit}}, the constraints here are:
 # Options before primary resource should be recognized as {{spark-submit}} 
 options
 # Options after primary resource should be recognized as user application 
 options
 The tricky part is how to handle scripts like {{spark-shell}} that delegate  
 {{spark-submit}}. These scripts allow users specify both {{spark-submit}} 
 options like {{--master}} and user defined application options together. For 
 example, say we'd like to write a new script {{start-thriftserver.sh}} to 
 start the Hive Thrift server, basically we may do this:
 {code}
 $SPARK_HOME/bin/spark-submit --class 
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal $@
 {code}
 Then user may call this script like:
 {code}
 ./sbin/start-thriftserver.sh --master spark://some-host:7077 --hiveconf 
 key=value
 {code}
 Notice that all options are captured by {{$@}}. If we put it before 
 {{spark-internal}}, they are all recognized as {{spark-submit}} options, thus 
 {{--hiveconf}} won't be passed to {{HiveThriftServer2}}; if we put it after 
 {{spark-internal}}, they *should* all be recognized as options of 
 {{HiveThriftServer2}}, but because of this bug, {{--master}} is still 
 recognized as {{spark-submit}} option and leads to the right behavior.
 Although currently all scripts using {{spark-submit}} work correctly, we 
 still should fix this bug, because it causes option name collision between 
 {{spark-submit}} and user application, and every time we add a new option to 
 {{spark-submit}}, some existing user applications may break. However, solving 
 this bug may cause some incompatible changes.
 The suggested solution here is using {{--}} as separator of {{spark-submit}} 
 options and user application options. For the Hive Thrift server example 
 above, user should call it in this way:
 {code}
 ./sbin/start-thriftserver.sh --master spark://some-host:7077 -- --hiveconf 
 key=value
 {code}
 And {{SparkSubmitArguments}} should be responsible for splitting two sets of 
 options and pass them correctly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2646) log4j initialization not quite compatible with log4j 2.x

2014-07-31 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2646.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1547
[https://github.com/apache/spark/pull/1547]

 log4j initialization not quite compatible with log4j 2.x
 

 Key: SPARK-2646
 URL: https://issues.apache.org/jira/browse/SPARK-2646
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.1
Reporter: Sean Owen
Priority: Minor
 Fix For: 1.1.0


 The logging code that handles log4j initialization leads to an stack overflow 
 error when used with log4j 2.x, which has just been released. This occurs 
 even a downstream project has correctly adjusted SLF4J bindings, and that is 
 the right thing to do for log4j 2.x, since it is effectively a separate 
 project from 1.x.
 Here is the relevant bit of Logging.scala:
 {code}
   private def initializeLogging() {
 // If Log4j is being used, but is not initialized, load a default 
 properties file
 val binder = StaticLoggerBinder.getSingleton
 val usingLog4j = 
 binder.getLoggerFactoryClassStr.endsWith(Log4jLoggerFactory)
 val log4jInitialized = 
 LogManager.getRootLogger.getAllAppenders.hasMoreElements
 if (!log4jInitialized  usingLog4j) {
   val defaultLogProps = org/apache/spark/log4j-defaults.properties
   Option(Utils.getSparkClassLoader.getResource(defaultLogProps)) match {
 case Some(url) =
   PropertyConfigurator.configure(url)
   log.info(sUsing Spark's default log4j profile: $defaultLogProps)
 case None =
   System.err.println(sSpark was unable to load $defaultLogProps)
   }
 }
 Logging.initialized = true
 // Force a call into slf4j to initialize it. Avoids this happening from 
 mutliple threads
 // and triggering this: 
 http://mailman.qos.ch/pipermail/slf4j-dev/2010-April/002956.html
 log
   }
 {code}
 The first minor issue is that there is a call to a logger inside this method, 
 which is initializing logging. In this situation, it ends up causing the 
 initialization to be called recursively until the stack overflow. It would be 
 slightly tidier to log this only after Logging.initialized = true. Or not at 
 all. But it's not the root problem, or else, it would not work at all now. 
 The calls to log4j classes here always reference log4j 1.2 no matter what. 
 For example, there is not getAllAppenders in log4j 2.x. That's fine. Really, 
 usingLog4j means using log4j 1.2 and log4jInitialized means log4j 1.2 
 is initialized.
 usingLog4j should be false for log4j 2.x, because the initialization only 
 matters for log4j 1.2. But, it's true, and that's the real issue. And 
 log4jInitialized is always false, since calls to the log4j 1.2 API are stubs 
 and no-ops in this setup, where the caller has swapped in log4j 2.x. Hence 
 the loop.
 This is fixed, I believe, if usingLog4j can be false for log4j 2.x. The 
 SLF4J static binding class has the same name for both versions, 
 unfortunately, which causes the issue. However they're in different packages. 
 For example, if the test included ... and begins with org.slf4j, it should 
 work, as the SLF4J binding for log4j 2.x is provided by log4j 2.x at the 
 moment, and is in package org.apache.logging.slf4j.
 Of course, I assume that SLF4J will eventually offer its own binding. I hope 
 to goodness they at least name the binding class differently, or else this 
 will again not work. But then some other check can probably be made.
 (Credit to Agust Egilsson for finding this; at his request I'm opening a JIRA 
 for him. I'll propose a PR too.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2511) Add TF-IDF featurizer

2014-07-31 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2511.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1671
[https://github.com/apache/spark/pull/1671]

 Add TF-IDF featurizer
 -

 Key: SPARK-2511
 URL: https://issues.apache.org/jira/browse/SPARK-2511
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.1.0


 Port the TF-IDF implementation that was used in the Databricks Cloud demo to 
 MLlib.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2468) zero-copy shuffle network communication

2014-07-31 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081357#comment-14081357
 ] 

Reynold Xin commented on SPARK-2468:


It's something I'd like to prototype for 1.2. Do you have any thoughts on this?

 zero-copy shuffle network communication
 ---

 Key: SPARK-2468
 URL: https://issues.apache.org/jira/browse/SPARK-2468
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical

 Right now shuffle send goes through the block manager. This is inefficient 
 because it requires loading a block from disk into a kernel buffer, then into 
 a user space buffer, and then back to a kernel send buffer before it reaches 
 the NIC. It does multiple copies of the data and context switching between 
 kernel/user. It also creates unnecessary buffer in the JVM that increases GC
 Instead, we should use FileChannel.transferTo, which handles this in the 
 kernel space with zero-copy. See 
 http://www.ibm.com/developerworks/library/j-zerocopy/
 One potential solution is to use Netty NIO.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2756) Decision Tree bugs

2014-07-31 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-2756:
-

Description: 
3 bugs:

Bug 1: Indexing is inconsistent for aggregate calculations for unordered 
features (in multiclass classification with categorical features, where the 
features had few enough values such that they could be considered unordered, 
i.e., isSpaceSufficientForAllCategoricalSplits=true).

* updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, 
binIndex), where
** featureValue was from arr (so it was a feature value)
** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
* The rest of the code indexed agg as (node, feature, binIndex, label).

Bug 2: calculateGainForSplit (for classification):
* It returns dummy prediction values when either the right or left children had 
0 weight.  These are incorrect for multiclass classification.

Bug 3: Off-by-1 when finding thresholds for splits for continuous features.
* When finding thresholds for possible splits for continuous features in 
DecisionTree.findSplitsBins, the thresholds were set according to individual 
training examples’ feature values.  This can cause problems for small datasets.


  was:
2 bugs:

Bug 1: Indexing is inconsistent for aggregate calculations for unordered 
features (in multiclass classification with categorical features, where the 
features had few enough values such that they could be considered unordered, 
i.e., isSpaceSufficientForAllCategoricalSplits=true).

* updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, 
binIndex), where
** featureValue was from arr (so it was a feature value)
** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
* The rest of the code indexed agg as (node, feature, binIndex, label).

Bug 2: calculateGainForSplit (for classification):
* It returns dummy prediction values when either the right or left children had 
0 weight.  These are incorrect for multiclass classification.



 Decision Tree bugs
 --

 Key: SPARK-2756
 URL: https://issues.apache.org/jira/browse/SPARK-2756
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 3 bugs:
 Bug 1: Indexing is inconsistent for aggregate calculations for unordered 
 features (in multiclass classification with categorical features, where the 
 features had few enough values such that they could be considered unordered, 
 i.e., isSpaceSufficientForAllCategoricalSplits=true).
 * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, 
 binIndex), where
 ** featureValue was from arr (so it was a feature value)
 ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
 * The rest of the code indexed agg as (node, feature, binIndex, label).
 Bug 2: calculateGainForSplit (for classification):
 * It returns dummy prediction values when either the right or left children 
 had 0 weight.  These are incorrect for multiclass classification.
 Bug 3: Off-by-1 when finding thresholds for splits for continuous features.
 * When finding thresholds for possible splits for continuous features in 
 DecisionTree.findSplitsBins, the thresholds were set according to individual 
 training examples’ feature values.  This can cause problems for small 
 datasets.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2777) ALS factors should be persist in memory and disk

2014-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081401#comment-14081401
 ] 

Apache Spark commented on SPARK-2777:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/1700

 ALS factors should be persist in memory and disk
 

 Key: SPARK-2777
 URL: https://issues.apache.org/jira/browse/SPARK-2777
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 Now the factors are persisted in memory only. If they get kicked off by later 
 jobs, we have to start the computation from very beginning. A better solution 
 should be changing the storage level to MEMORY_AND_DISK.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2272) Feature scaling which standardizes the range of independent variables or features of data.

2014-07-31 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2272:
-

Assignee: DB Tsai

 Feature scaling which standardizes the range of independent variables or 
 features of data.
 --

 Key: SPARK-2272
 URL: https://issues.apache.org/jira/browse/SPARK-2272
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: DB Tsai
Assignee: DB Tsai

 Feature scaling is a method used to standardize the range of independent 
 variables or features of data. In data processing, it is also known as data 
 normalization and is generally performed during the data preprocessing step.
 In this work, a trait called `VectorTransformer` is defined for generic 
 transformation of a vector. It contains two methods, `apply` which applies 
 transformation on a vector and `unapply` which applies inverse transformation 
 on a vector.
 There are three concrete implementations of `VectorTransformer`, and they all 
 can be easily extended with PMML transformation support. 
 1) `VectorStandardizer` - Standardises a vector given the mean and variance. 
 Since the standardization will densify the output, the output is always in 
 dense vector format.
  
 2) `VectorRescaler` -  Rescales a vector into target range specified by a 
 tuple of two double values or two vectors as new target minimum and maximum. 
 Since the rescaling will substrate the minimum of each column first, the 
 output will always be in dense vector regardless of input vector type.
 3) `VectorDivider` -  Transforms a vector by dividing a constant or diving a 
 vector with element by element basis. This transformation will preserve the 
 type of input vector without densifying the result.
 Utility helper methods are implemented for taking an input of RDD[Vector], 
 and then transformed RDD[Vector] and transformer are returned for dividing, 
 rescaling, normalization, and standardization. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-2776) Add normalizeByCol method to mllib.util.MLUtils

2014-07-31 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-2776.


Resolution: Duplicate

 Add normalizeByCol method to mllib.util.MLUtils
 ---

 Key: SPARK-2776
 URL: https://issues.apache.org/jira/browse/SPARK-2776
 Project: Spark
  Issue Type: New Feature
Reporter: Andres Perez
Priority: Minor

 Add the ability to compute the mean and standard deviations of each vector 
 (LabeledPoint) component and normalize each vector in the RDD, using only RDD 
 transformations. The result is an RDD of Vectors where each column has a mean 
 of zero and standard deviation of one.
 See https://github.com/apache/spark/pull/1698



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2014-07-31 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081465#comment-14081465
 ] 

Marcelo Vanzin commented on SPARK-1537:
---

I'm working on this but this all sort of depends on progress being made on the 
Yarn side, so at this moment I'm not yet ready to send any PRs.

 Add integration with Yarn's Application Timeline Server
 ---

 Key: SPARK-1537
 URL: https://issues.apache.org/jira/browse/SPARK-1537
 Project: Spark
  Issue Type: New Feature
  Components: YARN
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin

 It would be nice to have Spark integrate with Yarn's Application Timeline 
 Server (see YARN-321, YARN-1530). This would allow users running Spark on 
 Yarn to have a single place to go for all their history needs, and avoid 
 having to manage a separate service (Spark's built-in server).
 At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
 although there is still some ongoing work. But the basics are there, and I 
 wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line

2014-07-31 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081490#comment-14081490
 ] 

Marcelo Vanzin commented on SPARK-1576:
---

With Sandy's recent patch (https://github.com/apache/spark/pull/1253) this 
should be easy to do (spark-submit --conf spark.executor.extraJavaOptions=blah).

 Passing of JAVA_OPTS to YARN on command line
 

 Key: SPARK-1576
 URL: https://issues.apache.org/jira/browse/SPARK-1576
 Project: Spark
  Issue Type: Improvement
Affects Versions: 0.9.0, 0.9.1, 0.9.2
Reporter: Nishkam Ravi
 Attachments: SPARK-1576.patch


 JAVA_OPTS can be passed by using either env variables (i.e., SPARK_JAVA_OPTS) 
 or as config vars (after Patrick's recent change). It would be good to allow 
 the user to pass them on command line as well to restrict scope to single 
 application invocation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1131) Better document the --args option for yarn-standalone mode

2014-07-31 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081497#comment-14081497
 ] 

Marcelo Vanzin commented on SPARK-1131:
---

This is probably obsolete now with spark-submit.

 Better document the --args option for yarn-standalone mode
 --

 Key: SPARK-1131
 URL: https://issues.apache.org/jira/browse/SPARK-1131
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Sandy Pérez González
Assignee: Karthik Kambatla

 It took me a while to figure out that the correct way to use it with multiple 
 arguments was to include the option multiple times.
 I.e.
 --args arg1
 --args arg2
 instead of
 --args arg1 arg2 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2778) Add unit tests for Yarn integration

2014-07-31 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-2778:
-

 Summary: Add unit tests for Yarn integration
 Key: SPARK-2778
 URL: https://issues.apache.org/jira/browse/SPARK-2778
 Project: Spark
  Issue Type: Test
  Components: YARN
Reporter: Marcelo Vanzin


It would be nice to add some Yarn integration tests to the unit tests in Spark; 
Yarn provides a MiniYARNCluster class that can be used to spawn a cluster 
locally.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-07-31 Thread Ted Malaska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081514#comment-14081514
 ] 

Ted Malaska commented on SPARK-2447:


Tell me if I'm wrong but the core offering of 1127 is also provided with 2447.

All I would have to do is provide the rdd functions to call the bulkPut or 
future bulkPartitionPut



 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-07-31 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081524#comment-14081524
 ] 

Tathagata Das commented on SPARK-2447:
--

Exactly!! That's why I feel that both have its merits, 2447 provides 
lower-level, all-inclusive interfaces using which slightly advanced users can 
do arbitrary stuff with. But it requires programming against HBase types like 
Put, and all. However, 1127 provides the simple interface which allows 
not-so-advanced users to do a set of simple stuff without requiring too much 
HBase knowledge. They are complimentary, and the latter should be implemented 
on top of the former. 



 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2063) Creating a SchemaRDD via sql() does not correctly resolve nested types

2014-07-31 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2063:


Target Version/s: 1.2.0  (was: 1.1.0)

 Creating a SchemaRDD via sql() does not correctly resolve nested types
 --

 Key: SPARK-2063
 URL: https://issues.apache.org/jira/browse/SPARK-2063
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Assignee: Cheng Lian

 For example, from the typical twitter dataset:
 {code}
 scala val popularTweets = sql(SELECT retweeted_status.text, 
 MAX(retweeted_status.retweet_count) AS s FROM tweets WHERE retweeted_status 
 is not NULL GROUP BY retweeted_status.text ORDER BY s DESC LIMIT 30)
 scala popularTweets.toString
 14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for 
 batch MultiInstanceRelations
 14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for 
 batch CaseInsensitiveAttributeReferences
 org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
 qualifiers on unresolved object, tree: 'retweeted_status.text
   at 
 org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:51)
   at 
 org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:47)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:67)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:65)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:65)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:100)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:97)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:51)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:65)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:64)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:69)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:40)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3.applyOrElse(Analyzer.scala:97)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3.applyOrElse(Analyzer.scala:94)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:217)
   at 
 

[jira] [Updated] (SPARK-2096) Correctly parse dot notations for accessing an array of structs

2014-07-31 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2096:


Target Version/s: 1.2.0  (was: 1.1.0)

 Correctly parse dot notations for accessing an array of structs
 ---

 Key: SPARK-2096
 URL: https://issues.apache.org/jira/browse/SPARK-2096
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Yin Huai
Priority: Minor
  Labels: starter

 For example, arrayOfStruct is an array of structs and every element of this 
 array has a field called field1. arrayOfStruct[0].field1 means to access 
 the value of field1 for the first element of arrayOfStruct, but the SQL 
 parser (in sql-core) treats field1 as an alias. Also, 
 arrayOfStruct.field1 means to access all values of field1 in this array 
 of structs and the returns those values as an array. But, the SQL parser 
 cannot resolve it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2740) In JavaPairRdd, allow user to specify ascending and numPartitions for sortByKey

2014-07-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2740.
---

   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: Rui Li

 In JavaPairRdd, allow user to specify ascending and numPartitions for 
 sortByKey
 ---

 Key: SPARK-2740
 URL: https://issues.apache.org/jira/browse/SPARK-2740
 Project: Spark
  Issue Type: Improvement
Reporter: Rui Li
Assignee: Rui Li
Priority: Minor
 Fix For: 1.1.0


 It should be more convenient if user can specify ascending and numPartitions 
 when calling sortByKey.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large

2014-07-31 Thread Carlos Fuertes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081598#comment-14081598
 ] 

Carlos Fuertes commented on SPARK-2017:
---

I have done some tests with the solution where you use JSON to send the data. 
If you run with 50k tasks

sc.parallelize(1 to 100, 5).count()

the JSON [/stages/stage/tasks/json/?id=0] that represents the tasks table takes 
~15Mb if you download it. You can get the JSON is some secs but the UI 
[/stages/stage/?id=0] will take still forever to render it (summary still shows 
up nonetheless).

I did not change the way we are rendering, that is move to pagination or 
anything else, and still using sorttable to allow the sorting of the table.

Maybe just converting to JSON is too simple and you still have to do streaming 
of the data if you want to go around 50k task and higher while maintaining 
responsiveness of the browser. And/or incorporate pagination directly with a 
global index for the tasks on the back. 











 web ui stage page becomes unresponsive when the number of tasks is large
 

 Key: SPARK-2017
 URL: https://issues.apache.org/jira/browse/SPARK-2017
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Reporter: Reynold Xin
  Labels: starter

 {code}
 sc.parallelize(1 to 100, 100).count()
 {code}
 The above code creates one million tasks to be executed. The stage detail web 
 ui page takes forever to load (if it ever completes).
 There are again a few different alternatives:
 0. Limit the number of tasks we show.
 1. Pagination
 2. By default only show the aggregate metrics and failed tasks, and hide the 
 successful ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large

2014-07-31 Thread Carlos Fuertes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081605#comment-14081605
 ] 

Carlos Fuertes commented on SPARK-2017:
---

I did not realize that the tasks all have their own index already so 
implementing the pagination on top of it should be simple. I'll give it a try.

 web ui stage page becomes unresponsive when the number of tasks is large
 

 Key: SPARK-2017
 URL: https://issues.apache.org/jira/browse/SPARK-2017
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Reporter: Reynold Xin
  Labels: starter

 {code}
 sc.parallelize(1 to 100, 100).count()
 {code}
 The above code creates one million tasks to be executed. The stage detail web 
 ui page takes forever to load (if it ever completes).
 There are again a few different alternatives:
 0. Limit the number of tasks we show.
 1. Pagination
 2. By default only show the aggregate metrics and failed tasks, and hide the 
 successful ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-31 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081616#comment-14081616
 ] 

Josh Rosen edited comment on SPARK-2282 at 7/31/14 10:37 PM:
-

Merged the improved fix from https://github.com/apache/spark/pull/1503 into 1.1.


was (Author: joshrosen):
Merged the improved fix from https://github.com/apache/spark/pull/1503

 PySpark crashes if too many tasks complete quickly
 --

 Key: SPARK-2282
 URL: https://issues.apache.org/jira/browse/SPARK-2282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.1, 1.0.0, 1.0.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 0.9.2, 1.0.0, 1.0.1, 1.1.0


 Upon every task completion, PythonAccumulatorParam constructs a new socket to 
 the Accumulator server running inside the pyspark daemon. This can cause a 
 buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
 stage, which will cause the SparkContext to crash if too many tasks complete 
 too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
 This bug can be fixed outside of Spark by ensuring these properties are set 
 (on a linux server);
 echo 1  /proc/sys/net/ipv4/tcp_tw_reuse
 echo 1  /proc/sys/net/ipv4/tcp_tw_recycle
 or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2282:
--

Fix Version/s: 1.1.0

 PySpark crashes if too many tasks complete quickly
 --

 Key: SPARK-2282
 URL: https://issues.apache.org/jira/browse/SPARK-2282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.1, 1.0.0, 1.0.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 0.9.2, 1.0.0, 1.0.1, 1.1.0


 Upon every task completion, PythonAccumulatorParam constructs a new socket to 
 the Accumulator server running inside the pyspark daemon. This can cause a 
 buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
 stage, which will cause the SparkContext to crash if too many tasks complete 
 too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
 This bug can be fixed outside of Spark by ensuring these properties are set 
 (on a linux server);
 echo 1  /proc/sys/net/ipv4/tcp_tw_reuse
 echo 1  /proc/sys/net/ipv4/tcp_tw_recycle
 or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-31 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081616#comment-14081616
 ] 

Josh Rosen commented on SPARK-2282:
---

Merged the improved fix from https://github.com/apache/spark/pull/1503

 PySpark crashes if too many tasks complete quickly
 --

 Key: SPARK-2282
 URL: https://issues.apache.org/jira/browse/SPARK-2282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.1, 1.0.0, 1.0.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 0.9.2, 1.0.0, 1.0.1, 1.1.0


 Upon every task completion, PythonAccumulatorParam constructs a new socket to 
 the Accumulator server running inside the pyspark daemon. This can cause a 
 buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
 stage, which will cause the SparkContext to crash if too many tasks complete 
 too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
 This bug can be fixed outside of Spark by ensuring these properties are set 
 (on a linux server);
 echo 1  /proc/sys/net/ipv4/tcp_tw_reuse
 echo 1  /proc/sys/net/ipv4/tcp_tw_recycle
 or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2771) GenerateMIMAIgnore fails scalastyle check due to long line

2014-07-31 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu resolved SPARK-2771.
---

Resolution: Fixed

 GenerateMIMAIgnore fails scalastyle check due to long line
 --

 Key: SPARK-2771
 URL: https://issues.apache.org/jira/browse/SPARK-2771
 Project: Spark
  Issue Type: Bug
Reporter: Ted Yu
Priority: Minor
 Attachments: spark-2771-v1.txt


 I got the following error building master branch:
 {code}
 [INFO] --- scalastyle-maven-plugin:0.4.0:check (default) @ spark-tools_2.10 
 ---
 error 
 file=/homes/hortonzy/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala
  message=File line length exceeds 100 characters line=118
 Saving to outputFile=/homes/hortonzy/spark/tools/scalastyle-output.xml
 Processed 3 file(s)
 {code}
 This is caused by 3rd line below:
 {code}
 classSymbol.typeSignature.members.filterNot(x =
   x.fullName.startsWith(java) || x.fullName.startsWith(scala))
 .filter(x = isPackagePrivate(x) || isDeveloperApi(x) || 
 isExperimental(x)).map(_.fullName) ++
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1812) Support cross-building with Scala 2.11

2014-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081642#comment-14081642
 ] 

Apache Spark commented on SPARK-1812:
-

User 'avati' has created a pull request for this issue:
https://github.com/apache/spark/pull/1701

 Support cross-building with Scala 2.11
 --

 Key: SPARK-1812
 URL: https://issues.apache.org/jira/browse/SPARK-1812
 Project: Spark
  Issue Type: New Feature
  Components: Build, Spark Core
Reporter: Matei Zaharia
Assignee: Prashant Sharma

 Since Scala 2.10/2.11 are source compatible, we should be able to cross build 
 for both versions. From what I understand there are basically three things we 
 need to figure out:
 1. Have a two versions of our dependency graph, one that uses 2.11 
 dependencies and the other that uses 2.10 dependencies.
 2. Figure out how to publish different poms for 2.10 and 2.11.
 I think (1) can be accomplished by having a scala 2.11 profile. (2) isn't 
 really well supported by Maven since published pom's aren't generated 
 dynamically. But we can probably script around it to make it work. I've done 
 some initial sanity checks with a simple build here:
 https://github.com/pwendell/scala-maven-crossbuild



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1812) Support cross-building with Scala 2.11

2014-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081650#comment-14081650
 ] 

Apache Spark commented on SPARK-1812:
-

User 'avati' has created a pull request for this issue:
https://github.com/apache/spark/pull/1702

 Support cross-building with Scala 2.11
 --

 Key: SPARK-1812
 URL: https://issues.apache.org/jira/browse/SPARK-1812
 Project: Spark
  Issue Type: New Feature
  Components: Build, Spark Core
Reporter: Matei Zaharia
Assignee: Prashant Sharma

 Since Scala 2.10/2.11 are source compatible, we should be able to cross build 
 for both versions. From what I understand there are basically three things we 
 need to figure out:
 1. Have a two versions of our dependency graph, one that uses 2.11 
 dependencies and the other that uses 2.10 dependencies.
 2. Figure out how to publish different poms for 2.10 and 2.11.
 I think (1) can be accomplished by having a scala 2.11 profile. (2) isn't 
 really well supported by Maven since published pom's aren't generated 
 dynamically. But we can probably script around it to make it work. I've done 
 some initial sanity checks with a simple build here:
 https://github.com/pwendell/scala-maven-crossbuild



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1812) Support cross-building with Scala 2.11

2014-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081657#comment-14081657
 ] 

Apache Spark commented on SPARK-1812:
-

User 'avati' has created a pull request for this issue:
https://github.com/apache/spark/pull/1703

 Support cross-building with Scala 2.11
 --

 Key: SPARK-1812
 URL: https://issues.apache.org/jira/browse/SPARK-1812
 Project: Spark
  Issue Type: New Feature
  Components: Build, Spark Core
Reporter: Matei Zaharia
Assignee: Prashant Sharma

 Since Scala 2.10/2.11 are source compatible, we should be able to cross build 
 for both versions. From what I understand there are basically three things we 
 need to figure out:
 1. Have a two versions of our dependency graph, one that uses 2.11 
 dependencies and the other that uses 2.10 dependencies.
 2. Figure out how to publish different poms for 2.10 and 2.11.
 I think (1) can be accomplished by having a scala 2.11 profile. (2) isn't 
 really well supported by Maven since published pom's aren't generated 
 dynamically. But we can probably script around it to make it work. I've done 
 some initial sanity checks with a simple build here:
 https://github.com/pwendell/scala-maven-crossbuild



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-695) Exponential recursion in getPreferredLocations

2014-07-31 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081669#comment-14081669
 ] 

Aaron Staple commented on SPARK-695:


Progress has been made on a PR here:
https://github.com/apache/spark/pull/1362

 Exponential recursion in getPreferredLocations
 --

 Key: SPARK-695
 URL: https://issues.apache.org/jira/browse/SPARK-695
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia

 This was reported to happen in DAGScheduler for graphs with many paths from 
 the root up, though I haven't yet found a good test case for it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1812) Support cross-building with Scala 2.11

2014-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081671#comment-14081671
 ] 

Apache Spark commented on SPARK-1812:
-

User 'avati' has created a pull request for this issue:
https://github.com/apache/spark/pull/1704

 Support cross-building with Scala 2.11
 --

 Key: SPARK-1812
 URL: https://issues.apache.org/jira/browse/SPARK-1812
 Project: Spark
  Issue Type: New Feature
  Components: Build, Spark Core
Reporter: Matei Zaharia
Assignee: Prashant Sharma

 Since Scala 2.10/2.11 are source compatible, we should be able to cross build 
 for both versions. From what I understand there are basically three things we 
 need to figure out:
 1. Have a two versions of our dependency graph, one that uses 2.11 
 dependencies and the other that uses 2.10 dependencies.
 2. Figure out how to publish different poms for 2.10 and 2.11.
 I think (1) can be accomplished by having a scala 2.11 profile. (2) isn't 
 really well supported by Maven since published pom's aren't generated 
 dynamically. But we can probably script around it to make it work. I've done 
 some initial sanity checks with a simple build here:
 https://github.com/pwendell/scala-maven-crossbuild



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2779) asInstanceOf[Map[...]] should use scala.collection.Map instead of scala.collection.immutable.Map

2014-07-31 Thread Yin Huai (JIRA)
Yin Huai created SPARK-2779:
---

 Summary: asInstanceOf[Map[...]] should use scala.collection.Map 
instead of scala.collection.immutable.Map
 Key: SPARK-2779
 URL: https://issues.apache.org/jira/browse/SPARK-2779
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2779) asInstanceOf[Map[...]] should use scala.collection.Map instead of scala.collection.immutable.Map

2014-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081732#comment-14081732
 ] 

Apache Spark commented on SPARK-2779:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/1705

 asInstanceOf[Map[...]] should use scala.collection.Map instead of 
 scala.collection.immutable.Map
 

 Key: SPARK-2779
 URL: https://issues.apache.org/jira/browse/SPARK-2779
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2780) Create a StreamingContext.setLocalProperty for setting local property of jobs launched by streaming

2014-07-31 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-2780:


 Summary: Create a StreamingContext.setLocalProperty for setting 
local property of jobs launched by streaming
 Key: SPARK-2780
 URL: https://issues.apache.org/jira/browse/SPARK-2780
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0, 1.1.0
Reporter: Tathagata Das
Priority: Minor


SparkContext.setLocalProperty makes all Spark jobs submitted using
the current thread belong to the set pool. However, in Spark
Streaming, all the jobs are actually launched in the background from a
different thread. So this setting does not work. 

Currently, there is a
work around. If you are doing any kind of output operations on
DStreams, like DStream.foreachRDD(), you can set the property inside
that

dstream.foreachRDD(rdd =
   rdd.sparkContext.setLocalProperty(...)
)




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2781) Analyzer should check resolution of LogicalPlans

2014-07-31 Thread Aaron Staple (JIRA)
Aaron Staple created SPARK-2781:
---

 Summary: Analyzer should check resolution of LogicalPlans
 Key: SPARK-2781
 URL: https://issues.apache.org/jira/browse/SPARK-2781
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Aaron Staple


Currently the Analyzer’s CheckResolution rule checks that all attributes are 
resolved by searching for unresolved Expressions.  But some LogicalPlans, 
including Union, contain custom implementations of the resolve attribute that 
validate other criteria in addition to checking for attribute resolution of 
their descendants.  These LogicalPlans are not currently validated by the 
CheckResolution implementation.

As a result, it is currently possible to execute a query generated from 
unresolved LogicalPlans.  One example is a UNION query that produces rows with 
different data types in the same column:

{noformat}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class T1(value:Seq[Int])
val t1 = sc.parallelize(Seq(T1(Seq(0,1
t1.registerAsTable(t1)
sqlContext.sql(SELECT value FROM t1 UNION SELECT 2 FROM t1”).collect()
{noformat}

In this example, the type coercion implementation cannot unify array and 
integer types.  One row contains an array in the returned column and the other 
row contains an integer.  The result is:

{noformat}
res3: Array[org.apache.spark.sql.Row] = Array([List(0, 1)], [2])
{noformat}

I believe fixing this is a first step toward improving validation for Union 
(and similar) plans.  (For instance, Union does not currently validate that its 
children contain the same number of columns.)




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2781) Analyzer should check resolution of LogicalPlans

2014-07-31 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2781.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1
 Assignee: Michael Armbrust

We actually fixed this a long time ago: 
https://github.com/apache/spark/commit/b3e768e154bd7175db44c3ffc3d8f783f15ab776

 Analyzer should check resolution of LogicalPlans
 

 Key: SPARK-2781
 URL: https://issues.apache.org/jira/browse/SPARK-2781
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Aaron Staple
Assignee: Michael Armbrust
 Fix For: 1.0.1, 1.1.0


 Currently the Analyzer’s CheckResolution rule checks that all attributes are 
 resolved by searching for unresolved Expressions.  But some LogicalPlans, 
 including Union, contain custom implementations of the resolve attribute that 
 validate other criteria in addition to checking for attribute resolution of 
 their descendants.  These LogicalPlans are not currently validated by the 
 CheckResolution implementation.
 As a result, it is currently possible to execute a query generated from 
 unresolved LogicalPlans.  One example is a UNION query that produces rows 
 with different data types in the same column:
 {noformat}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext._
 case class T1(value:Seq[Int])
 val t1 = sc.parallelize(Seq(T1(Seq(0,1
 t1.registerAsTable(t1)
 sqlContext.sql(SELECT value FROM t1 UNION SELECT 2 FROM t1”).collect()
 {noformat}
 In this example, the type coercion implementation cannot unify array and 
 integer types.  One row contains an array in the returned column and the 
 other row contains an integer.  The result is:
 {noformat}
 res3: Array[org.apache.spark.sql.Row] = Array([List(0, 1)], [2])
 {noformat}
 I believe fixing this is a first step toward improving validation for Union 
 (and similar) plans.  (For instance, Union does not currently validate that 
 its children contain the same number of columns.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2781) Analyzer should check resolution of LogicalPlans

2014-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081828#comment-14081828
 ] 

Apache Spark commented on SPARK-2781:
-

User 'staple' has created a pull request for this issue:
https://github.com/apache/spark/pull/1706

 Analyzer should check resolution of LogicalPlans
 

 Key: SPARK-2781
 URL: https://issues.apache.org/jira/browse/SPARK-2781
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Aaron Staple
Assignee: Michael Armbrust
 Fix For: 1.0.1, 1.1.0


 Currently the Analyzer’s CheckResolution rule checks that all attributes are 
 resolved by searching for unresolved Expressions.  But some LogicalPlans, 
 including Union, contain custom implementations of the resolve attribute that 
 validate other criteria in addition to checking for attribute resolution of 
 their descendants.  These LogicalPlans are not currently validated by the 
 CheckResolution implementation.
 As a result, it is currently possible to execute a query generated from 
 unresolved LogicalPlans.  One example is a UNION query that produces rows 
 with different data types in the same column:
 {noformat}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext._
 case class T1(value:Seq[Int])
 val t1 = sc.parallelize(Seq(T1(Seq(0,1
 t1.registerAsTable(t1)
 sqlContext.sql(SELECT value FROM t1 UNION SELECT 2 FROM t1”).collect()
 {noformat}
 In this example, the type coercion implementation cannot unify array and 
 integer types.  One row contains an array in the returned column and the 
 other row contains an integer.  The result is:
 {noformat}
 res3: Array[org.apache.spark.sql.Row] = Array([List(0, 1)], [2])
 {noformat}
 I believe fixing this is a first step toward improving validation for Union 
 (and similar) plans.  (For instance, Union does not currently validate that 
 its children contain the same number of columns.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (SPARK-2781) Analyzer should check resolution of LogicalPlans

2014-07-31 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reopened SPARK-2781:
-


I'm sorry... I thought this was stale and did not read it carefully. Reopening.

 Analyzer should check resolution of LogicalPlans
 

 Key: SPARK-2781
 URL: https://issues.apache.org/jira/browse/SPARK-2781
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Aaron Staple
Assignee: Michael Armbrust
 Fix For: 1.0.1, 1.1.0


 Currently the Analyzer’s CheckResolution rule checks that all attributes are 
 resolved by searching for unresolved Expressions.  But some LogicalPlans, 
 including Union, contain custom implementations of the resolve attribute that 
 validate other criteria in addition to checking for attribute resolution of 
 their descendants.  These LogicalPlans are not currently validated by the 
 CheckResolution implementation.
 As a result, it is currently possible to execute a query generated from 
 unresolved LogicalPlans.  One example is a UNION query that produces rows 
 with different data types in the same column:
 {noformat}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext._
 case class T1(value:Seq[Int])
 val t1 = sc.parallelize(Seq(T1(Seq(0,1
 t1.registerAsTable(t1)
 sqlContext.sql(SELECT value FROM t1 UNION SELECT 2 FROM t1”).collect()
 {noformat}
 In this example, the type coercion implementation cannot unify array and 
 integer types.  One row contains an array in the returned column and the 
 other row contains an integer.  The result is:
 {noformat}
 res3: Array[org.apache.spark.sql.Row] = Array([List(0, 1)], [2])
 {noformat}
 I believe fixing this is a first step toward improving validation for Union 
 (and similar) plans.  (For instance, Union does not currently validate that 
 its children contain the same number of columns.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2781) Analyzer should check resolution of LogicalPlans

2014-07-31 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081833#comment-14081833
 ] 

Aaron Staple commented on SPARK-2781:
-

No problem, I think the current validation checks Expressions but there are 
some cases where LogicalPlans might not be resolved even though Expressions are 
resolved.

 Analyzer should check resolution of LogicalPlans
 

 Key: SPARK-2781
 URL: https://issues.apache.org/jira/browse/SPARK-2781
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Aaron Staple
Assignee: Michael Armbrust
 Fix For: 1.0.1, 1.1.0


 Currently the Analyzer’s CheckResolution rule checks that all attributes are 
 resolved by searching for unresolved Expressions.  But some LogicalPlans, 
 including Union, contain custom implementations of the resolve attribute that 
 validate other criteria in addition to checking for attribute resolution of 
 their descendants.  These LogicalPlans are not currently validated by the 
 CheckResolution implementation.
 As a result, it is currently possible to execute a query generated from 
 unresolved LogicalPlans.  One example is a UNION query that produces rows 
 with different data types in the same column:
 {noformat}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext._
 case class T1(value:Seq[Int])
 val t1 = sc.parallelize(Seq(T1(Seq(0,1
 t1.registerAsTable(t1)
 sqlContext.sql(SELECT value FROM t1 UNION SELECT 2 FROM t1”).collect()
 {noformat}
 In this example, the type coercion implementation cannot unify array and 
 integer types.  One row contains an array in the returned column and the 
 other row contains an integer.  The result is:
 {noformat}
 res3: Array[org.apache.spark.sql.Row] = Array([List(0, 1)], [2])
 {noformat}
 I believe fixing this is a first step toward improving validation for Union 
 (and similar) plans.  (For instance, Union does not currently validate that 
 its children contain the same number of columns.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2781) Analyzer should check resolution of LogicalPlans

2014-07-31 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2781:


Target Version/s: 1.2.0

 Analyzer should check resolution of LogicalPlans
 

 Key: SPARK-2781
 URL: https://issues.apache.org/jira/browse/SPARK-2781
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.1, 1.1.0
Reporter: Aaron Staple
Assignee: Michael Armbrust

 Currently the Analyzer’s CheckResolution rule checks that all attributes are 
 resolved by searching for unresolved Expressions.  But some LogicalPlans, 
 including Union, contain custom implementations of the resolve attribute that 
 validate other criteria in addition to checking for attribute resolution of 
 their descendants.  These LogicalPlans are not currently validated by the 
 CheckResolution implementation.
 As a result, it is currently possible to execute a query generated from 
 unresolved LogicalPlans.  One example is a UNION query that produces rows 
 with different data types in the same column:
 {noformat}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext._
 case class T1(value:Seq[Int])
 val t1 = sc.parallelize(Seq(T1(Seq(0,1
 t1.registerAsTable(t1)
 sqlContext.sql(SELECT value FROM t1 UNION SELECT 2 FROM t1”).collect()
 {noformat}
 In this example, the type coercion implementation cannot unify array and 
 integer types.  One row contains an array in the returned column and the 
 other row contains an integer.  The result is:
 {noformat}
 res3: Array[org.apache.spark.sql.Row] = Array([List(0, 1)], [2])
 {noformat}
 I believe fixing this is a first step toward improving validation for Union 
 (and similar) plans.  (For instance, Union does not currently validate that 
 its children contain the same number of columns.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2781) Analyzer should check resolution of LogicalPlans

2014-07-31 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2781:


Fix Version/s: (was: 1.0.1)
   (was: 1.1.0)

 Analyzer should check resolution of LogicalPlans
 

 Key: SPARK-2781
 URL: https://issues.apache.org/jira/browse/SPARK-2781
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.1, 1.1.0
Reporter: Aaron Staple
Assignee: Michael Armbrust

 Currently the Analyzer’s CheckResolution rule checks that all attributes are 
 resolved by searching for unresolved Expressions.  But some LogicalPlans, 
 including Union, contain custom implementations of the resolve attribute that 
 validate other criteria in addition to checking for attribute resolution of 
 their descendants.  These LogicalPlans are not currently validated by the 
 CheckResolution implementation.
 As a result, it is currently possible to execute a query generated from 
 unresolved LogicalPlans.  One example is a UNION query that produces rows 
 with different data types in the same column:
 {noformat}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext._
 case class T1(value:Seq[Int])
 val t1 = sc.parallelize(Seq(T1(Seq(0,1
 t1.registerAsTable(t1)
 sqlContext.sql(SELECT value FROM t1 UNION SELECT 2 FROM t1”).collect()
 {noformat}
 In this example, the type coercion implementation cannot unify array and 
 integer types.  One row contains an array in the returned column and the 
 other row contains an integer.  The result is:
 {noformat}
 res3: Array[org.apache.spark.sql.Row] = Array([List(0, 1)], [2])
 {noformat}
 I believe fixing this is a first step toward improving validation for Union 
 (and similar) plans.  (For instance, Union does not currently validate that 
 its children contain the same number of columns.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2781) Analyzer should check resolution of LogicalPlans

2014-07-31 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2781:


Affects Version/s: 1.1.0
   1.0.1

 Analyzer should check resolution of LogicalPlans
 

 Key: SPARK-2781
 URL: https://issues.apache.org/jira/browse/SPARK-2781
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.1, 1.1.0
Reporter: Aaron Staple
Assignee: Michael Armbrust

 Currently the Analyzer’s CheckResolution rule checks that all attributes are 
 resolved by searching for unresolved Expressions.  But some LogicalPlans, 
 including Union, contain custom implementations of the resolve attribute that 
 validate other criteria in addition to checking for attribute resolution of 
 their descendants.  These LogicalPlans are not currently validated by the 
 CheckResolution implementation.
 As a result, it is currently possible to execute a query generated from 
 unresolved LogicalPlans.  One example is a UNION query that produces rows 
 with different data types in the same column:
 {noformat}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext._
 case class T1(value:Seq[Int])
 val t1 = sc.parallelize(Seq(T1(Seq(0,1
 t1.registerAsTable(t1)
 sqlContext.sql(SELECT value FROM t1 UNION SELECT 2 FROM t1”).collect()
 {noformat}
 In this example, the type coercion implementation cannot unify array and 
 integer types.  One row contains an array in the returned column and the 
 other row contains an integer.  The result is:
 {noformat}
 res3: Array[org.apache.spark.sql.Row] = Array([List(0, 1)], [2])
 {noformat}
 I believe fixing this is a first step toward improving validation for Union 
 (and similar) plans.  (For instance, Union does not currently validate that 
 its children contain the same number of columns.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


  1   2   >