[jira] [Created] (SPARK-3675) Allow starting JDBC server on an existing context

2014-09-24 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-3675:
---

 Summary: Allow starting JDBC server on an existing context
 Key: SPARK-3675
 URL: https://issues.apache.org/jira/browse/SPARK-3675
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust


A common question on the mailing list is how to read from temporary tables over 
JDBC.  While we should try and support most of this in SQL, it would also be 
nice to query generic RDDs over JDBC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3675) Allow starting JDBC server on an existing context

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145969#comment-14145969
 ] 

Apache Spark commented on SPARK-3675:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/2515

 Allow starting JDBC server on an existing context
 -

 Key: SPARK-3675
 URL: https://issues.apache.org/jira/browse/SPARK-3675
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust

 A common question on the mailing list is how to read from temporary tables 
 over JDBC.  While we should try and support most of this in SQL, it would 
 also be nice to query generic RDDs over JDBC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3675) Allow starting JDBC server on an existing context

2014-09-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3675:

Target Version/s: 1.2.0

 Allow starting JDBC server on an existing context
 -

 Key: SPARK-3675
 URL: https://issues.apache.org/jira/browse/SPARK-3675
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust

 A common question on the mailing list is how to read from temporary tables 
 over JDBC.  While we should try and support most of this in SQL, it would 
 also be nice to query generic RDDs over JDBC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3676) jdk version lead to spark hql test suite error

2014-09-24 Thread wangfei (JIRA)
wangfei created SPARK-3676:
--

 Summary: jdk version lead to spark hql test suite error
 Key: SPARK-3676
 URL: https://issues.apache.org/jira/browse/SPARK-3676
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


System.out.println(1/500d)  get different result in diff jdk version
jdk 1.6.0(_31)  0.0020
jdk 1.7.0(_05)  0.002

this will lead to  spark sql hive test suite failed (replay by set jdk version 
= 1.6.0_31)--- 
[info] - division *** FAILED ***
[info]   Results do not match for division:
[info]   SELECT 2 / 1, 1 / 2, 1 / 3, 1 / COUNT(*) FROM src LIMIT 1
[info]   == Parsed Logical Plan ==
[info]   Limit 1
[info]Project [(2 / 1) AS c_0#692,(1 / 2) AS c_1#693,(1 / 3) AS c_2#694,(1 
/ COUNT(1)) AS c_3#695]
[info] UnresolvedRelation None, src, None
[info]   
[info]   == Analyzed Logical Plan ==
[info]   Limit 1
[info]Aggregate [], [(CAST(2, DoubleType) / CAST(1, DoubleType)) AS 
c_0#692,(CAST(1, DoubleType) / CAST(2, DoubleType)) AS c_1#693,(CAST(1, 
DoubleType) / CAST(3, DoubleType)) AS c_2#694,(CAST(CAST(1, LongType), Doub
leType) / CAST(COUNT(1), DoubleType)) AS c_3#695]
[info] MetastoreRelation default, src, None
[info]   
[info]   == Optimized Logical Plan ==
[info]   Limit 1
[info]Aggregate [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS 
c_2#694,(1.0 / CAST(COUNT(1), DoubleType)) AS c_3#695]
[info] Project []
[info]  MetastoreRelation default, src, None
[info]   
[info]   == Physical Plan ==
[info]   Limit 1
[info]Aggregate false, [], [2.0 AS c_0#692,0.5 AS 
c_1#693,0. AS c_2#694,(1.0 / CAST(SUM(PartialCount#699L), 
DoubleType)) AS c_3#695]
[info] Exchange SinglePartition
[info]  Aggregate true, [], [COUNT(1) AS PartialCount#699L]
[info]   HiveTableScan [], (MetastoreRelation default, src, None), None
[info]   
[info]   Code Generation: false
[info]   == RDD ==
[info]   c_0c_1 c_2 c_3
[info]   !== HIVE - 1 row(s) ==  == CATALYST - 1 row(s) ==
[info]   !2.0   0.5 0.  0.002   2.0 0.5 
0.  0.0020 (HiveComparisonTest.scala:370)


[info] - timestamp cast #1 *** FAILED ***
[info]   Results do not match for timestamp cast #1:
[info]   SELECT CAST(CAST(1 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1
[info]   == Parsed Logical Plan ==
[info]   Limit 1
[info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
[info] UnresolvedRelation None, src, None
[info]   
[info]   == Analyzed Logical Plan ==
[info]   Limit 1
[info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
[info] MetastoreRelation default, src, None
[info]   
[info]   == Optimized Logical Plan ==
[info]   Limit 1
[info]Project [0.0010 AS c_0#995]
[info] MetastoreRelation default, src, None
[info]   
[info]   == Physical Plan ==
[info]   Limit 1
[info]Project [0.0010 AS c_0#995]
[info] HiveTableScan [], (MetastoreRelation default, src, None), None
[info]   
[info]   Code Generation: false
[info]   == RDD ==
[info]   c_0
[info]   !== HIVE - 1 row(s) ==   == CATALYST - 1 row(s) ==
[info]   !0.001   0.0010 (HiveComparisonTest.scala:370)






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3676) jdk version lead to spark sql test suite error

2014-09-24 Thread wangfei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei updated SPARK-3676:
---
Summary: jdk version lead to spark sql test suite error  (was: jdk version 
lead to spark hql test suite error)

 jdk version lead to spark sql test suite error
 --

 Key: SPARK-3676
 URL: https://issues.apache.org/jira/browse/SPARK-3676
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


 System.out.println(1/500d)  get different result in diff jdk version
 jdk 1.6.0(_31)  0.0020
 jdk 1.7.0(_05)  0.002
 this will lead to  spark sql hive test suite failed (replay by set jdk 
 version = 1.6.0_31)--- 
 [info] - division *** FAILED ***
 [info]   Results do not match for division:
 [info]   SELECT 2 / 1, 1 / 2, 1 / 3, 1 / COUNT(*) FROM src LIMIT 1
 [info]   == Parsed Logical Plan ==
 [info]   Limit 1
 [info]Project [(2 / 1) AS c_0#692,(1 / 2) AS c_1#693,(1 / 3) AS 
 c_2#694,(1 / COUNT(1)) AS c_3#695]
 [info] UnresolvedRelation None, src, None
 [info]   
 [info]   == Analyzed Logical Plan ==
 [info]   Limit 1
 [info]Aggregate [], [(CAST(2, DoubleType) / CAST(1, DoubleType)) AS 
 c_0#692,(CAST(1, DoubleType) / CAST(2, DoubleType)) AS c_1#693,(CAST(1, 
 DoubleType) / CAST(3, DoubleType)) AS c_2#694,(CAST(CAST(1, LongType), Doub
 leType) / CAST(COUNT(1), DoubleType)) AS c_3#695]
 [info] MetastoreRelation default, src, None
 [info]   
 [info]   == Optimized Logical Plan ==
 [info]   Limit 1
 [info]Aggregate [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS 
 c_2#694,(1.0 / CAST(COUNT(1), DoubleType)) AS c_3#695]
 [info] Project []
 [info]  MetastoreRelation default, src, None
 [info]   
 [info]   == Physical Plan ==
 [info]   Limit 1
 [info]Aggregate false, [], [2.0 AS c_0#692,0.5 AS 
 c_1#693,0. AS c_2#694,(1.0 / CAST(SUM(PartialCount#699L), 
 DoubleType)) AS c_3#695]
 [info] Exchange SinglePartition
 [info]  Aggregate true, [], [COUNT(1) AS PartialCount#699L]
 [info]   HiveTableScan [], (MetastoreRelation default, src, None), None
 [info]   
 [info]   Code Generation: false
 [info]   == RDD ==
 [info]   c_0c_1 c_2 c_3
 [info]   !== HIVE - 1 row(s) ==  == CATALYST - 1 row(s) ==
 [info]   !2.0   0.5 0.  0.002   2.0 0.5 
 0.  0.0020 (HiveComparisonTest.scala:370)
 [info] - timestamp cast #1 *** FAILED ***
 [info]   Results do not match for timestamp cast #1:
 [info]   SELECT CAST(CAST(1 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1
 [info]   == Parsed Logical Plan ==
 [info]   Limit 1
 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
 [info] UnresolvedRelation None, src, None
 [info]   
 [info]   == Analyzed Logical Plan ==
 [info]   Limit 1
 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
 [info] MetastoreRelation default, src, None
 [info]   
 [info]   == Optimized Logical Plan ==
 [info]   Limit 1
 [info]Project [0.0010 AS c_0#995]
 [info] MetastoreRelation default, src, None
 [info]   
 [info]   == Physical Plan ==
 [info]   Limit 1
 [info]Project [0.0010 AS c_0#995]
 [info] HiveTableScan [], (MetastoreRelation default, src, None), None
 [info]   
 [info]   Code Generation: false
 [info]   == RDD ==
 [info]   c_0
 [info]   !== HIVE - 1 row(s) ==   == CATALYST - 1 row(s) ==
 [info]   !0.001   0.0010 (HiveComparisonTest.scala:370)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3620) Refactor config option handling code for spark-submit

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146011#comment-14146011
 ] 

Apache Spark commented on SPARK-3620:
-

User 'tigerquoll' has created a pull request for this issue:
https://github.com/apache/spark/pull/2516

 Refactor config option handling code for spark-submit
 -

 Key: SPARK-3620
 URL: https://issues.apache.org/jira/browse/SPARK-3620
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.0, 1.1.0
Reporter: Dale Richardson
Assignee: Dale Richardson
Priority: Minor

 I'm proposing its time to refactor the configuration argument handling code 
 in spark-submit. The code has grown organically in a short period of time, 
 handles a pretty complicated logic flow, and is now pretty fragile. Some 
 issues that have been identified:
 1. Hand-crafted property file readers that do not support the property file 
 format as specified in 
 http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader)
 2. ResolveURI not called on paths read from conf/prop files
 3. inconsistent means of merging / overriding values from different sources 
 (Some get overridden by file, others by manual settings of field on object, 
 Some by properties)
 4. Argument validation should be done after combining config files, system 
 properties and command line arguments, 
 5. Alternate conf file location not handled in shell scripts
 6. Some options can only be passed as command line arguments
 7. Defaults for options are hard-coded (and sometimes overridden multiple 
 times) in many through-out the code e.g. master = local[*]
 Initial proposal is to use typesafe conf to read in the config information 
 and merge the various config sources



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3662) Importing pandas breaks included pi.py example

2014-09-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146017#comment-14146017
 ] 

Sean Owen commented on SPARK-3662:
--

Maybe I miss something, but, does this just mean you can't import pandas 
entirely? If you're modifying the example, you should import only what you need 
from pandas. Or, it may be that you need to modify the import random, indeed, 
to accommodate other modifications you want to make.

But what is the problem with the included example? it runs fine without 
modifications, no?

 Importing pandas breaks included pi.py example
 --

 Key: SPARK-3662
 URL: https://issues.apache.org/jira/browse/SPARK-3662
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.1.0
 Environment: Xubuntu 14.04.  Yarn cluster running on Ubuntu 12.04.
Reporter: Evan Samanas

 If I add import pandas at the top of the included pi.py example and submit 
 using spark-submit --master yarn-client, I get this stack trace:
 {code}
 Traceback (most recent call last):
   File /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi.py, line 
 39, in module
 count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add)
   File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 759, in reduce
 vals = self.mapPartitions(func).collect()
   File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 723, in collect
 bytesInJava = self._jrdd.collect().iterator()
   File 
 /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 538, in __call__
   File 
 /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, 
 line 300, in get_return_value
 py4j.protocol.Py4JJavaError14/09/23 15:51:58 INFO TaskSetManager: Lost task 
 2.3 in stage 0.0 (TID 10) on executor SERVERNAMEREMOVED: 
 org.apache.spark.api.python.PythonException (Traceback (most recent call 
 last):
   File 
 /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py,
  line 75, in main
 command = pickleSer._read_with_length(infile)
   File 
 /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py,
  line 150, in _read_with_length
 return self.loads(obj)
 ImportError: No module named algos
 {code}
 The example works fine if I move the statement from random import random 
 from the top and into the function (def f(_)) defined in the example.  Near 
 as I can tell, random is getting confused with a function of the same name 
 within pandas.algos.  
 Submitting the same script using --master local works, but gives a 
 distressing amount of random characters to stdout or stderr and messes up my 
 terminal:
 {code}
 ...
 @J@J@J@J@J@J@J@J@J@J@J@J@J@JJ@J@J@J@J 
 @J!@J@J#@J$@J%@J@J'@J(@J)@J*@J+@J,@J-@J.@J/@J0@J1@J2@J3@J4@J5@J6@J7@J8@J9@J:@J;@J@J=@J@J?@J@@JA@JB@JC@JD@JE@JF@JG@JH@JI@JJ@JK@JL@JM@JN@JO@JP@JQ@JR@JS@JT@JU@JV@JW@JX@JY@JZ@J[@J\@J]@J^@J_@J`@Ja@Jb@Jc@Jd@Je@Jf@Jg@Jh@Ji@Jj@Jk@Jl@Jm@Jn@Jo@Jp@Jq@Jr@Js@Jt@Ju@Jv@Jw@Jx@Jy@Jz@J{@J|@J}@J~@J@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JJJ�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JAJAJAJAJAJAJAJAAJ
AJ
 AJ
   AJ
 AJAJAJAJAJAJAJAJAJAJAJAJAJAJJAJAJAJAJ 
 AJ!AJAJ#AJ$AJ%AJAJ'AJ(AJ)AJ*AJ+AJ,AJ-AJ.AJ/AJ0AJ1AJ2AJ3AJ4AJ5AJ6AJ7AJ8AJ9AJ:AJ;AJAJ=AJAJ?AJ@AJAAJBAJCAJDAJEAJFAJGAJHAJIAJJAJKAJLAJMAJNAJOAJPAJQAJRAJSAJTAJUAJVAJWAJXAJYAJZAJ[AJ\AJ]AJ^AJ_AJ`AJaAJbAJcAJdAJeAJfAJgAJhAJiAJjAJkAJlAJmAJnAJoAJpAJqAJrAJsAJtAJuAJvAJwAJxAJyAJzAJ{AJ|AJ}AJ~AJAJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJJJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�A14/09/23
  15:42:09 INFO SparkContext: Job finished: reduce at 
 /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi_sframe.py:38, took 
 11.276879779 s
 J�AJ�AJ�AJ�AJ�AJ�AJ�A�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJBJBJBJBJBJBJBJBBJ
  BJ
 BJ
   BJ
 BJBJBJBJBJBJBJBJBJBJBJBJBJBJJBJBJBJBJ 
 BJ!BJBJ#BJ$BJ%BJBJ'BJ(BJ)BJ*BJ+BJ,BJ-BJ.BJ/BJ0BJ1BJ2BJ3BJ4BJ5BJ6BJ7BJ8BJ9BJ:BJ;BJBJ=BJBJ?BJ@Be.
 �]qJ#1a.
 �]qJX4a.
 �]qJX4a.
 �]qJ#1a.
 �]qJX4a.
 �]qJX4a.
 �]qJ#1a.
 �]qJX4a.
 �]qJX4a.
 �]qJa.
 Pi is roughly 3.146136
 {code}
 No idea if that's related, but thought I'd include it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3676) jdk version lead to spark sql test suite error

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146035#comment-14146035
 ] 

Apache Spark commented on SPARK-3676:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/2517

 jdk version lead to spark sql test suite error
 --

 Key: SPARK-3676
 URL: https://issues.apache.org/jira/browse/SPARK-3676
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


 System.out.println(1/500d)  get different result in diff jdk version
 jdk 1.6.0(_31)  0.0020
 jdk 1.7.0(_05)  0.002
 this will lead to  spark sql hive test suite failed (replay by set jdk 
 version = 1.6.0_31)--- 
 [info] - division *** FAILED ***
 [info]   Results do not match for division:
 [info]   SELECT 2 / 1, 1 / 2, 1 / 3, 1 / COUNT(*) FROM src LIMIT 1
 [info]   == Parsed Logical Plan ==
 [info]   Limit 1
 [info]Project [(2 / 1) AS c_0#692,(1 / 2) AS c_1#693,(1 / 3) AS 
 c_2#694,(1 / COUNT(1)) AS c_3#695]
 [info] UnresolvedRelation None, src, None
 [info]   
 [info]   == Analyzed Logical Plan ==
 [info]   Limit 1
 [info]Aggregate [], [(CAST(2, DoubleType) / CAST(1, DoubleType)) AS 
 c_0#692,(CAST(1, DoubleType) / CAST(2, DoubleType)) AS c_1#693,(CAST(1, 
 DoubleType) / CAST(3, DoubleType)) AS c_2#694,(CAST(CAST(1, LongType), Doub
 leType) / CAST(COUNT(1), DoubleType)) AS c_3#695]
 [info] MetastoreRelation default, src, None
 [info]   
 [info]   == Optimized Logical Plan ==
 [info]   Limit 1
 [info]Aggregate [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS 
 c_2#694,(1.0 / CAST(COUNT(1), DoubleType)) AS c_3#695]
 [info] Project []
 [info]  MetastoreRelation default, src, None
 [info]   
 [info]   == Physical Plan ==
 [info]   Limit 1
 [info]Aggregate false, [], [2.0 AS c_0#692,0.5 AS 
 c_1#693,0. AS c_2#694,(1.0 / CAST(SUM(PartialCount#699L), 
 DoubleType)) AS c_3#695]
 [info] Exchange SinglePartition
 [info]  Aggregate true, [], [COUNT(1) AS PartialCount#699L]
 [info]   HiveTableScan [], (MetastoreRelation default, src, None), None
 [info]   
 [info]   Code Generation: false
 [info]   == RDD ==
 [info]   c_0c_1 c_2 c_3
 [info]   !== HIVE - 1 row(s) ==  == CATALYST - 1 row(s) ==
 [info]   !2.0   0.5 0.  0.002   2.0 0.5 
 0.  0.0020 (HiveComparisonTest.scala:370)
 [info] - timestamp cast #1 *** FAILED ***
 [info]   Results do not match for timestamp cast #1:
 [info]   SELECT CAST(CAST(1 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1
 [info]   == Parsed Logical Plan ==
 [info]   Limit 1
 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
 [info] UnresolvedRelation None, src, None
 [info]   
 [info]   == Analyzed Logical Plan ==
 [info]   Limit 1
 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
 [info] MetastoreRelation default, src, None
 [info]   
 [info]   == Optimized Logical Plan ==
 [info]   Limit 1
 [info]Project [0.0010 AS c_0#995]
 [info] MetastoreRelation default, src, None
 [info]   
 [info]   == Physical Plan ==
 [info]   Limit 1
 [info]Project [0.0010 AS c_0#995]
 [info] HiveTableScan [], (MetastoreRelation default, src, None), None
 [info]   
 [info]   Code Generation: false
 [info]   == RDD ==
 [info]   c_0
 [info]   !== HIVE - 1 row(s) ==   == CATALYST - 1 row(s) ==
 [info]   !0.001   0.0010 (HiveComparisonTest.scala:370)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3267) Deadlock between ScalaReflectionLock and Data type initialization

2014-09-24 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146039#comment-14146039
 ] 

Aaron Davidson commented on SPARK-3267:
---

I don't have it anymore, unfortunately. Michael and I did a little digging at 
the time, and I think we found the reason for the deadlock, shown in the stack 
traces above, but decided it was a very unlikely scenario. Indeed, the query 
did not consistently deadlock; this only occurred a single time.

 Deadlock between ScalaReflectionLock and Data type initialization
 -

 Key: SPARK-3267
 URL: https://issues.apache.org/jira/browse/SPARK-3267
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Aaron Davidson
Priority: Critical

 Deadlock here:
 {code}
 Executor task launch worker-0 daemon prio=10 tid=0x7fab50036000 
 nid=0x27a in Object.wait() [0x7fab60c2e000
 ]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.defaultPrimitive(CodeGenerator.scala:565)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:202)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:195)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
 93)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$Evaluate2$2.evaluateAs(CodeGenerator.scal
 a:175)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:304)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:195)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
 93)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:314)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:195)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
 93)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:313)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:195)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
 ...
 {code}
 and
 {code}
 Executor task launch worker-2 daemon prio=10 tid=0x7fab100f0800 
 nid=0x27e in Object.wait() [0x7fab0eeec000
 ]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:250)
 - locked 0x00064e5d9a48 (a 
 org.apache.spark.sql.catalyst.expressions.Cast)
 at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
 at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations.
 scala:139)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations.
 scala:139)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:139)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:126)
 at 
 

[jira] [Commented] (SPARK-3676) jdk version lead to spark sql test suite error

2014-09-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146040#comment-14146040
 ] 

Sean Owen commented on SPARK-3676:
--

(For the interested, I looked it up, since the behavior change sounds 
surprising. This is in fact a bug in Java 6 that was fixed in Java 7: 
http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4428022 It may even be 
fixed in later versions of Java 6, but I have a very recent one and it is not.)

 jdk version lead to spark sql test suite error
 --

 Key: SPARK-3676
 URL: https://issues.apache.org/jira/browse/SPARK-3676
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


 System.out.println(1/500d)  get different result in diff jdk version
 jdk 1.6.0(_31)  0.0020
 jdk 1.7.0(_05)  0.002
 this will lead to  spark sql hive test suite failed (replay by set jdk 
 version = 1.6.0_31)--- 
 [info] - division *** FAILED ***
 [info]   Results do not match for division:
 [info]   SELECT 2 / 1, 1 / 2, 1 / 3, 1 / COUNT(*) FROM src LIMIT 1
 [info]   == Parsed Logical Plan ==
 [info]   Limit 1
 [info]Project [(2 / 1) AS c_0#692,(1 / 2) AS c_1#693,(1 / 3) AS 
 c_2#694,(1 / COUNT(1)) AS c_3#695]
 [info] UnresolvedRelation None, src, None
 [info]   
 [info]   == Analyzed Logical Plan ==
 [info]   Limit 1
 [info]Aggregate [], [(CAST(2, DoubleType) / CAST(1, DoubleType)) AS 
 c_0#692,(CAST(1, DoubleType) / CAST(2, DoubleType)) AS c_1#693,(CAST(1, 
 DoubleType) / CAST(3, DoubleType)) AS c_2#694,(CAST(CAST(1, LongType), Doub
 leType) / CAST(COUNT(1), DoubleType)) AS c_3#695]
 [info] MetastoreRelation default, src, None
 [info]   
 [info]   == Optimized Logical Plan ==
 [info]   Limit 1
 [info]Aggregate [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS 
 c_2#694,(1.0 / CAST(COUNT(1), DoubleType)) AS c_3#695]
 [info] Project []
 [info]  MetastoreRelation default, src, None
 [info]   
 [info]   == Physical Plan ==
 [info]   Limit 1
 [info]Aggregate false, [], [2.0 AS c_0#692,0.5 AS 
 c_1#693,0. AS c_2#694,(1.0 / CAST(SUM(PartialCount#699L), 
 DoubleType)) AS c_3#695]
 [info] Exchange SinglePartition
 [info]  Aggregate true, [], [COUNT(1) AS PartialCount#699L]
 [info]   HiveTableScan [], (MetastoreRelation default, src, None), None
 [info]   
 [info]   Code Generation: false
 [info]   == RDD ==
 [info]   c_0c_1 c_2 c_3
 [info]   !== HIVE - 1 row(s) ==  == CATALYST - 1 row(s) ==
 [info]   !2.0   0.5 0.  0.002   2.0 0.5 
 0.  0.0020 (HiveComparisonTest.scala:370)
 [info] - timestamp cast #1 *** FAILED ***
 [info]   Results do not match for timestamp cast #1:
 [info]   SELECT CAST(CAST(1 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1
 [info]   == Parsed Logical Plan ==
 [info]   Limit 1
 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
 [info] UnresolvedRelation None, src, None
 [info]   
 [info]   == Analyzed Logical Plan ==
 [info]   Limit 1
 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
 [info] MetastoreRelation default, src, None
 [info]   
 [info]   == Optimized Logical Plan ==
 [info]   Limit 1
 [info]Project [0.0010 AS c_0#995]
 [info] MetastoreRelation default, src, None
 [info]   
 [info]   == Physical Plan ==
 [info]   Limit 1
 [info]Project [0.0010 AS c_0#995]
 [info] HiveTableScan [], (MetastoreRelation default, src, None), None
 [info]   
 [info]   Code Generation: false
 [info]   == RDD ==
 [info]   c_0
 [info]   !== HIVE - 1 row(s) ==   == CATALYST - 1 row(s) ==
 [info]   !0.001   0.0010 (HiveComparisonTest.scala:370)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3663) Document SPARK_LOG_DIR and SPARK_PID_DIR

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146041#comment-14146041
 ] 

Apache Spark commented on SPARK-3663:
-

User 'ash211' has created a pull request for this issue:
https://github.com/apache/spark/pull/2518

 Document SPARK_LOG_DIR and SPARK_PID_DIR
 

 Key: SPARK-3663
 URL: https://issues.apache.org/jira/browse/SPARK-3663
 Project: Spark
  Issue Type: Documentation
Reporter: Andrew Ash
Assignee: Andrew Ash

 I'm using these two parameters in some puppet scripts for standalone 
 deployment and realized that they're not documented anywhere.  We should 
 document them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3676) jdk version lead to spark sql test suite error

2014-09-24 Thread wangfei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146050#comment-14146050
 ] 

wangfei commented on SPARK-3676:


hmm, i see, thanks for that.

 jdk version lead to spark sql test suite error
 --

 Key: SPARK-3676
 URL: https://issues.apache.org/jira/browse/SPARK-3676
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


 System.out.println(1/500d)  get different result in diff jdk version
 jdk 1.6.0(_31)  0.0020
 jdk 1.7.0(_05)  0.002
 this will lead to  spark sql hive test suite failed (replay by set jdk 
 version = 1.6.0_31)--- 
 [info] - division *** FAILED ***
 [info]   Results do not match for division:
 [info]   SELECT 2 / 1, 1 / 2, 1 / 3, 1 / COUNT(*) FROM src LIMIT 1
 [info]   == Parsed Logical Plan ==
 [info]   Limit 1
 [info]Project [(2 / 1) AS c_0#692,(1 / 2) AS c_1#693,(1 / 3) AS 
 c_2#694,(1 / COUNT(1)) AS c_3#695]
 [info] UnresolvedRelation None, src, None
 [info]   
 [info]   == Analyzed Logical Plan ==
 [info]   Limit 1
 [info]Aggregate [], [(CAST(2, DoubleType) / CAST(1, DoubleType)) AS 
 c_0#692,(CAST(1, DoubleType) / CAST(2, DoubleType)) AS c_1#693,(CAST(1, 
 DoubleType) / CAST(3, DoubleType)) AS c_2#694,(CAST(CAST(1, LongType), Doub
 leType) / CAST(COUNT(1), DoubleType)) AS c_3#695]
 [info] MetastoreRelation default, src, None
 [info]   
 [info]   == Optimized Logical Plan ==
 [info]   Limit 1
 [info]Aggregate [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS 
 c_2#694,(1.0 / CAST(COUNT(1), DoubleType)) AS c_3#695]
 [info] Project []
 [info]  MetastoreRelation default, src, None
 [info]   
 [info]   == Physical Plan ==
 [info]   Limit 1
 [info]Aggregate false, [], [2.0 AS c_0#692,0.5 AS 
 c_1#693,0. AS c_2#694,(1.0 / CAST(SUM(PartialCount#699L), 
 DoubleType)) AS c_3#695]
 [info] Exchange SinglePartition
 [info]  Aggregate true, [], [COUNT(1) AS PartialCount#699L]
 [info]   HiveTableScan [], (MetastoreRelation default, src, None), None
 [info]   
 [info]   Code Generation: false
 [info]   == RDD ==
 [info]   c_0c_1 c_2 c_3
 [info]   !== HIVE - 1 row(s) ==  == CATALYST - 1 row(s) ==
 [info]   !2.0   0.5 0.  0.002   2.0 0.5 
 0.  0.0020 (HiveComparisonTest.scala:370)
 [info] - timestamp cast #1 *** FAILED ***
 [info]   Results do not match for timestamp cast #1:
 [info]   SELECT CAST(CAST(1 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1
 [info]   == Parsed Logical Plan ==
 [info]   Limit 1
 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
 [info] UnresolvedRelation None, src, None
 [info]   
 [info]   == Analyzed Logical Plan ==
 [info]   Limit 1
 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
 [info] MetastoreRelation default, src, None
 [info]   
 [info]   == Optimized Logical Plan ==
 [info]   Limit 1
 [info]Project [0.0010 AS c_0#995]
 [info] MetastoreRelation default, src, None
 [info]   
 [info]   == Physical Plan ==
 [info]   Limit 1
 [info]Project [0.0010 AS c_0#995]
 [info] HiveTableScan [], (MetastoreRelation default, src, None), None
 [info]   
 [info]   Code Generation: false
 [info]   == RDD ==
 [info]   c_0
 [info]   !== HIVE - 1 row(s) ==   == CATALYST - 1 row(s) ==
 [info]   !0.001   0.0010 (HiveComparisonTest.scala:370)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3526) Docs section on data locality

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146094#comment-14146094
 ] 

Apache Spark commented on SPARK-3526:
-

User 'ash211' has created a pull request for this issue:
https://github.com/apache/spark/pull/2519

 Docs section on data locality
 -

 Key: SPARK-3526
 URL: https://issues.apache.org/jira/browse/SPARK-3526
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.0.2
Reporter: Andrew Ash
Assignee: Andrew Ash

 Several threads on the mailing list have been about data locality and how to 
 interpret PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, etc.  Let's get some more 
 details in the docs on this concept so we can point future questions there.
 A couple people appreciated the below description of locality so it could be 
 a good starting point:
 {quote}
 The locality is how close the data is to the code that's processing it.  
 PROCESS_LOCAL means data is in the same JVM as the code that's running, so 
 it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the same 
 node, or in another executor on the same node, so is a little slower because 
 the data has to travel across an IPC connection.  RACK_LOCAL is even slower 
 -- data is on a different server so needs to be sent over the network.
 Spark switches to lower locality levels when there's no unprocessed data on a 
 node that has idle CPUs.  In that situation you have two options: wait until 
 the busy CPUs free up so you can start another task that uses data on that 
 server, or start a new task on a farther away server that needs to bring data 
 from that remote place.  What Spark typically does is wait a bit in the hopes 
 that a busy CPU frees up.  Once that timeout expires, it starts moving the 
 data from far away to the free CPU.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3677) Scalastyle is never applyed to the sources under yarn/common

2014-09-24 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-3677:
-

 Summary: Scalastyle is never applyed to the sources under 
yarn/common
 Key: SPARK-3677
 URL: https://issues.apache.org/jira/browse/SPARK-3677
 Project: Spark
  Issue Type: Bug
  Components: Build, YARN
Affects Versions: 1.2.0
Reporter: Kousuke Saruta


When we run sbt -Pyarn scalastyle, scalastyle is not applied to the sources 
under yarn/common.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3677) Scalastyle is never applyed to the sources under yarn/common

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146171#comment-14146171
 ] 

Apache Spark commented on SPARK-3677:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2520

 Scalastyle is never applyed to the sources under yarn/common
 

 Key: SPARK-3677
 URL: https://issues.apache.org/jira/browse/SPARK-3677
 Project: Spark
  Issue Type: Bug
  Components: Build, YARN
Affects Versions: 1.2.0
Reporter: Kousuke Saruta

 When we run sbt -Pyarn scalastyle, scalastyle is not applied to the sources 
 under yarn/common.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3639) Kinesis examples set master as local

2014-09-24 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146245#comment-14146245
 ] 

Matthew Farrellee commented on SPARK-3639:
--

seems reasonable to me

 Kinesis examples set master as local
 

 Key: SPARK-3639
 URL: https://issues.apache.org/jira/browse/SPARK-3639
 Project: Spark
  Issue Type: Bug
  Components: Examples, Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Aniket Bhatnagar
Priority: Minor
  Labels: examples

 Kinesis examples set master as local thus not allowing the example to be 
 tested on a cluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-09-24 Thread Ryan D Braley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146246#comment-14146246
 ] 

Ryan D Braley commented on SPARK-2691:
--

+1 Spark typically lags behind mesos in version numbers so if you run mesos 
today you have to choose between spark and docker. With this we could have our 
cake and eat it too :) 

 Allow Spark on Mesos to be launched with Docker
 ---

 Key: SPARK-2691
 URL: https://issues.apache.org/jira/browse/SPARK-2691
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen
  Labels: mesos

 Currently to launch Spark with Mesos one must upload a tarball and specifiy 
 the executor URI to be passed in that is to be downloaded on each slave or 
 even each execution depending coarse mode or not.
 We want to make Spark able to support launching Executors via a Docker image 
 that utilizes the recent Docker and Mesos integration work. 
 With the recent integration Spark can simply specify a Docker image and 
 options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3678) Yarn app name reported in RM is different between cluster and client mode

2014-09-24 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-3678:


 Summary: Yarn app name reported in RM is different between cluster 
and client mode
 Key: SPARK-3678
 URL: https://issues.apache.org/jira/browse/SPARK-3678
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves


If you launch an application in yarn cluster mode the name of the application 
in the ResourceManager generally shows up as the full name 
org.apache.spark.examples.SparkHdfsLR.  If you start the same app in client 
mode it shows up as SparkHdfsLR.

We should be consistent between them.  

I haven't looked at it in detail, perhaps its only the examples but I think 
I've seen this with customer apps also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3466) Limit size of results that a driver collects for each action

2014-09-24 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146640#comment-14146640
 ] 

Andrew Ash commented on SPARK-3466:
---

How would you design this feature?

I can imagine measuring the size of partitions / RDD elements while they are 
held in memory across the cluster, sending those sizes back to the driver, and 
having the driver throw an exception if the requested size exceeds the 
threshold.  Otherwise proceed as normal.

Is that how you were envisioning implementation?

 Limit size of results that a driver collects for each action
 

 Key: SPARK-3466
 URL: https://issues.apache.org/jira/browse/SPARK-3466
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Matei Zaharia

 Right now, operations like collect() and take() can crash the driver if they 
 bring back too many data. We should add a spark.driver.maxResultSize setting 
 (or something like that) that will make the driver abort a job if its result 
 is too big. We can set it to some fraction of the driver's memory by default, 
 or to something like 100 MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3662) Importing pandas breaks included pi.py example

2014-09-24 Thread Evan Samanas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146639#comment-14146639
 ] 

Evan Samanas commented on SPARK-3662:
-

I wouldn't focus on the example, that I modified it, or whether I should be 
importing a small portion of pandas.  The issue here is that Spark breaks in 
this case because of a name collision.  Modifying the example is simply the one 
reproducer I've found.

I was modifying the example to learn about how Spark ships Python code to the 
cluster.  In this case, I expected pandas to only be imported in the driver 
program and not to be imported by any workers.  The workers do not have pandas 
installed, so expected behavior means the example would run to completion, and 
an ImportError would mean that the workers are importing things they don't need 
for the task at hand.

The way I expected Spark to work IS actually how Spark works...modules will 
only be imported by workers if a function passed to them uses the modules, but 
this error showed me false evidence to the contrary.  I'm assuming the error is 
in Spark's modifications to CloudPickle...not in the way the example is set up.

 Importing pandas breaks included pi.py example
 --

 Key: SPARK-3662
 URL: https://issues.apache.org/jira/browse/SPARK-3662
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.1.0
 Environment: Xubuntu 14.04.  Yarn cluster running on Ubuntu 12.04.
Reporter: Evan Samanas

 If I add import pandas at the top of the included pi.py example and submit 
 using spark-submit --master yarn-client, I get this stack trace:
 {code}
 Traceback (most recent call last):
   File /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi.py, line 
 39, in module
 count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add)
   File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 759, in reduce
 vals = self.mapPartitions(func).collect()
   File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 723, in collect
 bytesInJava = self._jrdd.collect().iterator()
   File 
 /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 538, in __call__
   File 
 /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, 
 line 300, in get_return_value
 py4j.protocol.Py4JJavaError14/09/23 15:51:58 INFO TaskSetManager: Lost task 
 2.3 in stage 0.0 (TID 10) on executor SERVERNAMEREMOVED: 
 org.apache.spark.api.python.PythonException (Traceback (most recent call 
 last):
   File 
 /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py,
  line 75, in main
 command = pickleSer._read_with_length(infile)
   File 
 /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py,
  line 150, in _read_with_length
 return self.loads(obj)
 ImportError: No module named algos
 {code}
 The example works fine if I move the statement from random import random 
 from the top and into the function (def f(_)) defined in the example.  Near 
 as I can tell, random is getting confused with a function of the same name 
 within pandas.algos.  
 Submitting the same script using --master local works, but gives a 
 distressing amount of random characters to stdout or stderr and messes up my 
 terminal:
 {code}
 ...
 @J@J@J@J@J@J@J@J@J@J@J@J@J@JJ@J@J@J@J 
 @J!@J@J#@J$@J%@J@J'@J(@J)@J*@J+@J,@J-@J.@J/@J0@J1@J2@J3@J4@J5@J6@J7@J8@J9@J:@J;@J@J=@J@J?@J@@JA@JB@JC@JD@JE@JF@JG@JH@JI@JJ@JK@JL@JM@JN@JO@JP@JQ@JR@JS@JT@JU@JV@JW@JX@JY@JZ@J[@J\@J]@J^@J_@J`@Ja@Jb@Jc@Jd@Je@Jf@Jg@Jh@Ji@Jj@Jk@Jl@Jm@Jn@Jo@Jp@Jq@Jr@Js@Jt@Ju@Jv@Jw@Jx@Jy@Jz@J{@J|@J}@J~@J@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JJJ�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JAJAJAJAJAJAJAJAAJ
AJ
 AJ
   AJ
 AJAJAJAJAJAJAJAJAJAJAJAJAJAJJAJAJAJAJ 
 AJ!AJAJ#AJ$AJ%AJAJ'AJ(AJ)AJ*AJ+AJ,AJ-AJ.AJ/AJ0AJ1AJ2AJ3AJ4AJ5AJ6AJ7AJ8AJ9AJ:AJ;AJAJ=AJAJ?AJ@AJAAJBAJCAJDAJEAJFAJGAJHAJIAJJAJKAJLAJMAJNAJOAJPAJQAJRAJSAJTAJUAJVAJWAJXAJYAJZAJ[AJ\AJ]AJ^AJ_AJ`AJaAJbAJcAJdAJeAJfAJgAJhAJiAJjAJkAJlAJmAJnAJoAJpAJqAJrAJsAJtAJuAJvAJwAJxAJyAJzAJ{AJ|AJ}AJ~AJAJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJJJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�A14/09/23
  15:42:09 INFO SparkContext: Job finished: reduce at 
 /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi_sframe.py:38, took 
 11.276879779 s
 J�AJ�AJ�AJ�AJ�AJ�AJ�A�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJBJBJBJBJBJBJBJBBJ
  BJ
 BJ
  

[jira] [Updated] (SPARK-3466) Limit size of results that a driver collects for each action

2014-09-24 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-3466:
--
Description: Right now, operations like {{collect()}} and {{take()}} can 
crash the driver with an OOM if they bring back too many data. We should add a 
{{spark.driver.maxResultSize}} setting (or something like that) that will make 
the driver abort a job if its result is too big. We can set it to some fraction 
of the driver's memory by default, or to something like 100 MB.  (was: Right 
now, operations like collect() and take() can crash the driver if they bring 
back too many data. We should add a spark.driver.maxResultSize setting (or 
something like that) that will make the driver abort a job if its result is too 
big. We can set it to some fraction of the driver's memory by default, or to 
something like 100 MB.)

 Limit size of results that a driver collects for each action
 

 Key: SPARK-3466
 URL: https://issues.apache.org/jira/browse/SPARK-3466
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Matei Zaharia

 Right now, operations like {{collect()}} and {{take()}} can crash the driver 
 with an OOM if they bring back too many data. We should add a 
 {{spark.driver.maxResultSize}} setting (or something like that) that will 
 make the driver abort a job if its result is too big. We can set it to some 
 fraction of the driver's memory by default, or to something like 100 MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3679) pickle the exact globals of functions

2014-09-24 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3679:
-

 Summary: pickle the exact globals of functions
 Key: SPARK-3679
 URL: https://issues.apache.org/jira/browse/SPARK-3679
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Davies Liu
Priority: Critical


function.func_code.co_names has all the names used in the function, including 
name of attributes. It will pickle some unnecessary globals if there is a 
global having the same name with attribute (in co_names).

There is a regression introduced by PR 2114 
https://github.com/apache/spark/pull/2144/files





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3679) pickle the exact globals of functions

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146691#comment-14146691
 ] 

Apache Spark commented on SPARK-3679:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2522

 pickle the exact globals of functions
 -

 Key: SPARK-3679
 URL: https://issues.apache.org/jira/browse/SPARK-3679
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical

 function.func_code.co_names has all the names used in the function, including 
 name of attributes. It will pickle some unnecessary globals if there is a 
 global having the same name with attribute (in co_names).
 There is a regression introduced by PR 2114 
 https://github.com/apache/spark/pull/2144/files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3659) Set EC2 version to 1.1.0 in master branch

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3659.

   Resolution: Fixed
Fix Version/s: 1.2.0

https://github.com/apache/spark/pull/2510

 Set EC2 version to 1.1.0 in master branch
 -

 Key: SPARK-3659
 URL: https://issues.apache.org/jira/browse/SPARK-3659
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Minor
 Fix For: 1.2.0


 Master branch should be in sync with branch-1.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-09-24 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146714#comment-14146714
 ] 

Nan Zhu commented on SPARK-3628:


https://github.com/apache/spark/pull/2524

 Don't apply accumulator updates multiple times for tasks in result stages
 -

 Key: SPARK-3628
 URL: https://issues.apache.org/jira/browse/SPARK-3628
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia
Priority: Blocker

 In previous versions of Spark, accumulator updates only got applied once for 
 accumulators that are only used in actions (i.e. result stages), letting you 
 use them to deterministically compute a result. Unfortunately, this got 
 broken in some recent refactorings.
 This is related to https://issues.apache.org/jira/browse/SPARK-732, but that 
 issue is about applying the same semantics to intermediate stages too, which 
 is more work and may not be what we want for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146718#comment-14146718
 ] 

Apache Spark commented on SPARK-3628:
-

User 'CodingCat' has created a pull request for this issue:
https://github.com/apache/spark/pull/2524

 Don't apply accumulator updates multiple times for tasks in result stages
 -

 Key: SPARK-3628
 URL: https://issues.apache.org/jira/browse/SPARK-3628
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia
Assignee: Nan Zhu
Priority: Blocker

 In previous versions of Spark, accumulator updates only got applied once for 
 accumulators that are only used in actions (i.e. result stages), letting you 
 use them to deterministically compute a result. Unfortunately, this got 
 broken in some recent refactorings.
 This is related to https://issues.apache.org/jira/browse/SPARK-732, but that 
 issue is about applying the same semantics to intermediate stages too, which 
 is more work and may not be what we want for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3680) java.lang.Exception: makeCopy when using HiveGeneric UDFs on Converted Parquet Metastore tables

2014-09-24 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-3680:
---

 Summary: java.lang.Exception: makeCopy when using HiveGeneric UDFs 
on Converted Parquet Metastore tables
 Key: SPARK-3680
 URL: https://issues.apache.org/jira/browse/SPARK-3680
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3680) java.lang.Exception: makeCopy when using HiveGeneric UDFs on Converted Parquet Metastore tables

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146738#comment-14146738
 ] 

Apache Spark commented on SPARK-3680:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/2525

 java.lang.Exception: makeCopy when using HiveGeneric UDFs on Converted 
 Parquet Metastore tables
 ---

 Key: SPARK-3680
 URL: https://issues.apache.org/jira/browse/SPARK-3680
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3634) Python modules added through addPyFile should take precedence over system modules

2014-09-24 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3634.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2492
[https://github.com/apache/spark/pull/2492]

 Python modules added through addPyFile should take precedence over system 
 modules
 -

 Key: SPARK-3634
 URL: https://issues.apache.org/jira/browse/SPARK-3634
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.2, 1.1.0
Reporter: Josh Rosen
 Fix For: 1.2.0


 Python modules added through {{SparkContext.addPyFile()}} are currently 
 _appended_ to {{sys.path}}; this is probably the opposite of the behavior 
 that we want, since it causes system versions of modules to take precedence 
 over versions explicitly added by users.
 To fix this, we should change the {{sys.path}} manipulation code in 
 {{context.py}} and {{worker.py}} to prepend files to {{sys.path}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-889) Bring back DFS broadcast

2014-09-24 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146804#comment-14146804
 ] 

Andrew Ash commented on SPARK-889:
--

[~matei] should we close ticket this as Won't Fix then, since effort is better 
spent making TorrentBroadcast better?

 Bring back DFS broadcast
 

 Key: SPARK-889
 URL: https://issues.apache.org/jira/browse/SPARK-889
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Priority: Minor

 DFS broadcast was a simple way to get better-than-single-master performance 
 for broadcast, so we should add it back for people who have HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3679) pickle the exact globals of functions

2014-09-24 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3679.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2522
[https://github.com/apache/spark/pull/2522]

 pickle the exact globals of functions
 -

 Key: SPARK-3679
 URL: https://issues.apache.org/jira/browse/SPARK-3679
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical
 Fix For: 1.2.0


 function.func_code.co_names has all the names used in the function, including 
 name of attributes. It will pickle some unnecessary globals if there is a 
 global having the same name with attribute (in co_names).
 There is a regression introduced by PR 2114 
 https://github.com/apache/spark/pull/2144/files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3681) Failed to serialized ArrayType or MapType after accessing them in Python

2014-09-24 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3681:
-

 Summary: Failed to serialized ArrayType or MapType  after 
accessing them in Python
 Key: SPARK-3681
 URL: https://issues.apache.org/jira/browse/SPARK-3681
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu


{code}
files_schema_rdd.map(lambda x: x.files).take(1)
{code}

Also it will lose the schema after iterate an ArrayType.

{code}
files_schema_rdd.map(lambda x: [f.batch for f in x.files]).take(1)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3681) Failed to serialized ArrayType or MapType after accessing them in Python

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146903#comment-14146903
 ] 

Apache Spark commented on SPARK-3681:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2526

 Failed to serialized ArrayType or MapType  after accessing them in Python
 -

 Key: SPARK-3681
 URL: https://issues.apache.org/jira/browse/SPARK-3681
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu
Assignee: Davies Liu

 {code}
 files_schema_rdd.map(lambda x: x.files).take(1)
 {code}
 Also it will lose the schema after iterate an ArrayType.
 {code}
 files_schema_rdd.map(lambda x: [f.batch for f in x.files]).take(1)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3682) Add helpful warnings to the UI

2014-09-24 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-3682:
-

 Summary: Add helpful warnings to the UI
 Key: SPARK-3682
 URL: https://issues.apache.org/jira/browse/SPARK-3682
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: Sandy Ryza


Spark has a zillion configuration options and a zillion different things that 
can go wrong with a job.  Improvements like incremental and better metrics and 
the proposed spark replay debugger provide more insight into what's going on 
under the covers.  However, it's difficult for non-advanced users to synthesize 
this information and understand where to direct their attention. It would be 
helpful to have some sort of central location on the UI users could go to that 
would provide indications about why an app/job is failing or performing poorly.

Some helpful messages that we could provide:
* Warn that the tasks in a particular stage are spending a long time in GC.
* Warn that spark.shuffle.memoryFraction does not fit inside the young 
generation.
* Warn that tasks in a particular stage are very short, and that the number of 
partitions should probably be decreased.
* Warn that tasks in a particular stage are spilling a lot, and that the number 
of partitions should probably be decreased.
* Warn that a cached RDD that gets a lot of use does not fit in memory, and a 
lot of time is being spent recomputing it.

To start, probably two kinds of warnings would be most helpful.
* Warnings at the app level that report on misconfigurations, issues with the 
general health of executors.
* Warnings at the job level that indicate why a job might be performing slowly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2131) Collect per-task filesystem-bytes-read/written metrics

2014-09-24 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-2131.
---
Resolution: Duplicate

 Collect per-task filesystem-bytes-read/written metrics
 --

 Key: SPARK-2131
 URL: https://issues.apache.org/jira/browse/SPARK-2131
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3682) Add helpful warnings to the UI

2014-09-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3682:
-
 Target Version/s: 1.2.0
Affects Version/s: 1.1.0

 Add helpful warnings to the UI
 --

 Key: SPARK-3682
 URL: https://issues.apache.org/jira/browse/SPARK-3682
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 1.1.0
Reporter: Sandy Ryza

 Spark has a zillion configuration options and a zillion different things that 
 can go wrong with a job.  Improvements like incremental and better metrics 
 and the proposed spark replay debugger provide more insight into what's going 
 on under the covers.  However, it's difficult for non-advanced users to 
 synthesize this information and understand where to direct their attention. 
 It would be helpful to have some sort of central location on the UI users 
 could go to that would provide indications about why an app/job is failing or 
 performing poorly.
 Some helpful messages that we could provide:
 * Warn that the tasks in a particular stage are spending a long time in GC.
 * Warn that spark.shuffle.memoryFraction does not fit inside the young 
 generation.
 * Warn that tasks in a particular stage are very short, and that the number 
 of partitions should probably be decreased.
 * Warn that tasks in a particular stage are spilling a lot, and that the 
 number of partitions should probably be decreased.
 * Warn that a cached RDD that gets a lot of use does not fit in memory, and a 
 lot of time is being spent recomputing it.
 To start, probably two kinds of warnings would be most helpful.
 * Warnings at the app level that report on misconfigurations, issues with the 
 general health of executors.
 * Warnings at the job level that indicate why a job might be performing 
 slowly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-889) Bring back DFS broadcast

2014-09-24 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147032#comment-14147032
 ] 

Josh Rosen commented on SPARK-889:
--

In fact, I think [~rxin] has some JIRAs and PRs to make TorrentBroadcast _even_ 
better than it is now (it was greatly improved from 1.0.2 to 1.1.0), so it's 
probably safe to close this.

 Bring back DFS broadcast
 

 Key: SPARK-889
 URL: https://issues.apache.org/jira/browse/SPARK-889
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Priority: Minor

 DFS broadcast was a simple way to get better-than-single-master performance 
 for broadcast, so we should add it back for people who have HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3639) Kinesis examples set master as local

2014-09-24 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147042#comment-14147042
 ] 

Josh Rosen commented on SPARK-3639:
---

This sounds reasonable to me; feel free to open a PR.  If you look at most of 
the other Spark examples, they only set the appName when creating the 
SparkContext and leave the master unspecified in order to allow it to be set 
when passing the script to {{spark-submit}}.

 Kinesis examples set master as local
 

 Key: SPARK-3639
 URL: https://issues.apache.org/jira/browse/SPARK-3639
 Project: Spark
  Issue Type: Bug
  Components: Examples, Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Aniket Bhatnagar
Priority: Minor
  Labels: examples

 Kinesis examples set master as local thus not allowing the example to be 
 tested on a cluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-889) Bring back DFS broadcast

2014-09-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-889.
---
Resolution: Won't Fix

 Bring back DFS broadcast
 

 Key: SPARK-889
 URL: https://issues.apache.org/jira/browse/SPARK-889
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Priority: Minor

 DFS broadcast was a simple way to get better-than-single-master performance 
 for broadcast, so we should add it back for people who have HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2691:
---
Assignee: Timothy Chen  (was: Tim Chen)

 Allow Spark on Mesos to be launched with Docker
 ---

 Key: SPARK-2691
 URL: https://issues.apache.org/jira/browse/SPARK-2691
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen
Assignee: Timothy Chen
  Labels: mesos

 Currently to launch Spark with Mesos one must upload a tarball and specifiy 
 the executor URI to be passed in that is to be downloaded on each slave or 
 even each execution depending coarse mode or not.
 We want to make Spark able to support launching Executors via a Docker image 
 that utilizes the recent Docker and Mesos integration work. 
 With the recent integration Spark can simply specify a Docker image and 
 options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3678) Yarn app name reported in RM is different between cluster and client mode

2014-09-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3678:
-
Affects Version/s: (was: 1.2.0)
   1.1.0

 Yarn app name reported in RM is different between cluster and client mode
 -

 Key: SPARK-3678
 URL: https://issues.apache.org/jira/browse/SPARK-3678
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Thomas Graves

 If you launch an application in yarn cluster mode the name of the application 
 in the ResourceManager generally shows up as the full name 
 org.apache.spark.examples.SparkHdfsLR.  If you start the same app in client 
 mode it shows up as SparkHdfsLR.
 We should be consistent between them.  
 I haven't looked at it in detail, perhaps its only the examples but I think 
 I've seen this with customer apps also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2691:
---
Assignee: Tim Chen  (was: Timothy Hunter)

 Allow Spark on Mesos to be launched with Docker
 ---

 Key: SPARK-2691
 URL: https://issues.apache.org/jira/browse/SPARK-2691
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen
Assignee: Tim Chen
  Labels: mesos

 Currently to launch Spark with Mesos one must upload a tarball and specifiy 
 the executor URI to be passed in that is to be downloaded on each slave or 
 even each execution depending coarse mode or not.
 We want to make Spark able to support launching Executors via a Docker image 
 that utilizes the recent Docker and Mesos integration work. 
 With the recent integration Spark can simply specify a Docker image and 
 options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3684) Can't configure local dirs in Yarn mode

2014-09-24 Thread Andrew Or (JIRA)
Andrew Or created SPARK-3684:


 Summary: Can't configure local dirs in Yarn mode
 Key: SPARK-3684
 URL: https://issues.apache.org/jira/browse/SPARK-3684
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Andrew Or


We can't set SPARK_LOCAL_DIRS or spark.local.dirs because they're not picked up 
in Yarn mode. However, we can't set YARN_LOCAL_DIRS or LOCAL_DIRS either 
because these are overridden by Yarn.

I'm trying to set these through SPARK_YARN_USER_ENV. I'm aware that the default 
behavior is for Spark to use Yarn's local dirs, but right now there's no way to 
change it even if the user wants to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3604) unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3604.

Resolution: Not a Problem

 unbounded recursion in getNumPartitions triggers stack overflow for large 
 UnionRDD
 --

 Key: SPARK-3604
 URL: https://issues.apache.org/jira/browse/SPARK-3604
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: linux.  Used python, but error is in Scala land.
Reporter: Eric Friedman
Priority: Critical

 I have a large number of parquet files all with the same schema and attempted 
 to make a UnionRDD out of them.
 When I call getNumPartitions(), I get a stack overflow error
 that looks like this:
 Py4JJavaError: An error occurred while calling o3275.partitions.
 : java.lang.StackOverflowError
   at 
 scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:243)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65)
   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3681) Failed to serialized ArrayType or MapType after accessing them in Python

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3681:
---
Component/s: PySpark

 Failed to serialized ArrayType or MapType  after accessing them in Python
 -

 Key: SPARK-3681
 URL: https://issues.apache.org/jira/browse/SPARK-3681
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu

 {code}
 files_schema_rdd.map(lambda x: x.files).take(1)
 {code}
 Also it will lose the schema after iterate an ArrayType.
 {code}
 files_schema_rdd.map(lambda x: [f.batch for f in x.files]).take(1)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3663) Document SPARK_LOG_DIR and SPARK_PID_DIR

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3663:
---
Component/s: Documentation

 Document SPARK_LOG_DIR and SPARK_PID_DIR
 

 Key: SPARK-3663
 URL: https://issues.apache.org/jira/browse/SPARK-3663
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Andrew Ash
Assignee: Andrew Ash

 I'm using these two parameters in some puppet scripts for standalone 
 deployment and realized that they're not documented anywhere.  We should 
 document them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3610) History server log name should not be based on user input

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3610:
---
Component/s: Spark Core

 History server log name should not be based on user input
 -

 Key: SPARK-3610
 URL: https://issues.apache.org/jira/browse/SPARK-3610
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: SK
Priority: Critical

 Right now we use the user-defined application name when creating the logging 
 file for the history server. We should use some type of GUID generated from 
 inside of Spark instead of allowing user input here. It can cause errors if 
 users provide characters that are not valid in filesystem paths.
 Original bug report:
 {quote}
 The default log files for the Mllib examples use a rather long naming 
 convention that includes special characters like parentheses and comma.For 
 e.g. one of my log files is named 
 binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032.
 When I click on the program on the history server page (at port 18080), to 
 view the detailed application logs, the history server crashes and I need to 
 restart it. I am using Spark 1.1 on a mesos cluster.
 I renamed the  log file by removing the special characters and  then it loads 
 up correctly. I am not sure which program is creating the log files. Can it 
 be changed so that the default log file naming convention does not include  
 special characters? 
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3615) Kafka test should not hard code Zookeeper port

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3615.

Resolution: Fixed

https://github.com/apache/spark/pull/2483

 Kafka test should not hard code Zookeeper port
 --

 Key: SPARK-3615
 URL: https://issues.apache.org/jira/browse/SPARK-3615
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Patrick Wendell
Assignee: Saisai Shao
Priority: Blocker

 This is causing failures in our master build if port 2181 is contented. 
 Instead of binding to a static port we should re-factor this such that it 
 opens a socket on port 0 and then reads back the port. So we can never have 
 contention.
 {code}
 sbt.ForkMain$ForkError: Address already in use
   at sun.nio.ch.Net.bind0(Native Method)
   at sun.nio.ch.Net.bind(Net.java:444)
   at sun.nio.ch.Net.bind(Net.java:436)
   at 
 sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:67)
   at 
 org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:95)
   at 
 org.apache.spark.streaming.kafka.KafkaTestUtils$EmbeddedZookeeper.init(KafkaStreamSuite.scala:200)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuite.beforeFunction(KafkaStreamSuite.scala:62)
   at 
 org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.setUp(JavaKafkaStreamSuite.java:51)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
   at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
   at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
   at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:27)
   at 
 org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
   at 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
   at 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
   at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
   at org.junit.runners.Suite.runChild(Suite.java:128)
   at org.junit.runners.Suite.runChild(Suite.java:24)
   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
   at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
   at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
   at org.junit.runner.JUnitCore.run(JUnitCore.java:136)
   at com.novocode.junit.JUnitRunner.run(JUnitRunner.java:90)
   at sbt.RunnerWrapper$1.runRunner2(FrameworkWrapper.java:223)
   at sbt.RunnerWrapper$1.execute(FrameworkWrapper.java:236)
   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3685) Spark's local dir scheme is not configurable

2014-09-24 Thread Andrew Or (JIRA)
Andrew Or created SPARK-3685:


 Summary: Spark's local dir scheme is not configurable
 Key: SPARK-3685
 URL: https://issues.apache.org/jira/browse/SPARK-3685
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Andrew Or


When you try to set local dirs to hdfs:/tmp/foo it doesn't work. What it will 
try to do is create a folder called hdfs: and put tmp inside it. This is 
because in Util#getOrCreateLocalRootDirs we use java.io.File instead of 
Hadoop's file system to parse this path. We also need to resolve the path 
appropriately.

This may not have an urgent use case, but it fails silently and does what is 
least expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3476) Yarn ClientBase.validateArgs memory checks wrong

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147180#comment-14147180
 ] 

Apache Spark commented on SPARK-3476:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/2528

 Yarn ClientBase.validateArgs memory checks wrong
 

 Key: SPARK-3476
 URL: https://issues.apache.org/jira/browse/SPARK-3476
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves

 The yarn ClientBase.validateArgs  memory checks are no longer valid.  It used 
 to be that the overhead was taken out of what the user specified, now we add 
 it on top of what the user specifies.   We can probably just remove these. 
 (args.amMemory = memoryOverhead) - (Error: AM memory size must be +
 greater than:  + memoryOverhead),
   (args.executorMemory = memoryOverhead) - (Error: Executor memory 
 size +
 must be greater than:  + memoryOverhead.toString)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3686) flume.SparkSinkSuite.Success is flaky

2014-09-24 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-3686:
--

 Summary: flume.SparkSinkSuite.Success is flaky
 Key: SPARK-3686
 URL: https://issues.apache.org/jira/browse/SPARK-3686
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Patrick Wendell
Assignee: Hari Shreedharan
Priority: Blocker


{code}
Error Message

4000 did not equal 5000
Stacktrace

sbt.ForkMain$ForkError: 4000 did not equal 5000
at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:498)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1559)
at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:416)
at 
org.apache.spark.streaming.flume.sink.SparkSinkSuite.org$apache$spark$streaming$flume$sink$SparkSinkSuite$$assertChannelIsEmpty(SparkSinkSuite.scala:195)
at 
org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply$mcV$sp(SparkSinkSuite.scala:54)
at 
org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40)
at 
org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
at org.scalatest.FunSuite.runTest(FunSuite.scala:1559)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1559)
at org.scalatest.Suite$class.run(Suite.scala:1423)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204)
at org.scalatest.FunSuite.run(FunSuite.scala:1559)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:444)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:651)
at sbt.ForkMain$Run$2.call(ForkMain.java:294)
at sbt.ForkMain$Run$2.call(ForkMain.java:284)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

Example test result (this will stop working in a few days):
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/719/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.streaming.flume.sink/SparkSinkSuite/Success_with_ack/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3685) Spark's local dir scheme is not configurable

2014-09-24 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147206#comment-14147206
 ] 

Andrew Or commented on SPARK-3685:
--

Note that this is not meaningful unless we also change the usages of this to 
use the Hadoop FileSystem. This requires a non-trivial refactor of shuffle and 
spill code to use the Hadoop API.

 Spark's local dir scheme is not configurable
 

 Key: SPARK-3685
 URL: https://issues.apache.org/jira/browse/SPARK-3685
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Andrew Or

 When you try to set local dirs to hdfs:/tmp/foo it doesn't work. What it 
 will try to do is create a folder called hdfs: and put tmp inside it. 
 This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead 
 of Hadoop's file system to parse this path. We also need to resolve the path 
 appropriately.
 This may not have an urgent use case, but it fails silently and does what is 
 least expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3412) Add Missing Types for Row API

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147261#comment-14147261
 ] 

Apache Spark commented on SPARK-3412:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/2529

 Add Missing Types for Row API
 -

 Key: SPARK-3412
 URL: https://issues.apache.org/jira/browse/SPARK-3412
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3610) History server log name should not be based on user input

2014-09-24 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147274#comment-14147274
 ] 

Kousuke Saruta commented on SPARK-3610:
---

Hi [~skrishna...@gmail.com], I'm trying to resolve similar issue and I think I 
can resolve this issue using Application ID.
See https://github.com/apache/spark/pull/2432


 History server log name should not be based on user input
 -

 Key: SPARK-3610
 URL: https://issues.apache.org/jira/browse/SPARK-3610
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: SK
Priority: Critical

 Right now we use the user-defined application name when creating the logging 
 file for the history server. We should use some type of GUID generated from 
 inside of Spark instead of allowing user input here. It can cause errors if 
 users provide characters that are not valid in filesystem paths.
 Original bug report:
 {quote}
 The default log files for the Mllib examples use a rather long naming 
 convention that includes special characters like parentheses and comma.For 
 e.g. one of my log files is named 
 binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032.
 When I click on the program on the history server page (at port 18080), to 
 view the detailed application logs, the history server crashes and I need to 
 restart it. I am using Spark 1.1 on a mesos cluster.
 I renamed the  log file by removing the special characters and  then it loads 
 up correctly. I am not sure which program is creating the log files. Can it 
 be changed so that the default log file naming convention does not include  
 special characters? 
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3610) History server log name should not be based on user input

2014-09-24 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147274#comment-14147274
 ] 

Kousuke Saruta edited comment on SPARK-3610 at 9/25/14 2:35 AM:


Hi [~SK], I'm trying to resolve similar issue and I think I can resolve this 
issue using Application ID.
See https://github.com/apache/spark/pull/2432



was (Author: sarutak):
Hi [~skrishna...@gmail.com], I'm trying to resolve similar issue and I think I 
can resolve this issue using Application ID.
See https://github.com/apache/spark/pull/2432


 History server log name should not be based on user input
 -

 Key: SPARK-3610
 URL: https://issues.apache.org/jira/browse/SPARK-3610
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: SK
Priority: Critical

 Right now we use the user-defined application name when creating the logging 
 file for the history server. We should use some type of GUID generated from 
 inside of Spark instead of allowing user input here. It can cause errors if 
 users provide characters that are not valid in filesystem paths.
 Original bug report:
 {quote}
 The default log files for the Mllib examples use a rather long naming 
 convention that includes special characters like parentheses and comma.For 
 e.g. one of my log files is named 
 binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032.
 When I click on the program on the history server page (at port 18080), to 
 view the detailed application logs, the history server crashes and I need to 
 restart it. I am using Spark 1.1 on a mesos cluster.
 I renamed the  log file by removing the special characters and  then it loads 
 up correctly. I am not sure which program is creating the log files. Can it 
 be changed so that the default log file naming convention does not include  
 special characters? 
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3665) Java API for GraphX

2014-09-24 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-3665:
--
Description: 
The Java API will wrap the Scala API in a similar manner as JavaRDD. Components 
will include:
# JavaGraph
#- removes optional param from persist, subgraph, mapReduceTriplets, 
Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply
#- removes implicit {{=:=}} param from mapVertices, outerJoinVertices
#- merges multiple parameters lists
#- incorporates GraphOps
# JavaVertexRDD
# JavaEdgeRDD
# JavaGraphLoader
#- removes optional params, or uses builder pattern

  was:
The Java API will wrap the Scala API in a similar manner as JavaRDD. Components 
will include:
1. JavaGraph
-- removes optional param from persist, subgraph, mapReduceTriplets, 
Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply
-- removes implicit {{=:=}} param from mapVertices, outerJoinVertices
-- merges multiple parameters lists
-- incorporates GraphOps
2. JavaVertexRDD
3. JavaEdgeRDD
4. JavaGraphLoader
-- removes optional params, or uses builder pattern


 Java API for GraphX
 ---

 Key: SPARK-3665
 URL: https://issues.apache.org/jira/browse/SPARK-3665
 Project: Spark
  Issue Type: Improvement
  Components: GraphX, Java API
Reporter: Ankur Dave
Assignee: Ankur Dave

 The Java API will wrap the Scala API in a similar manner as JavaRDD. 
 Components will include:
 # JavaGraph
 #- removes optional param from persist, subgraph, mapReduceTriplets, 
 Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply
 #- removes implicit {{=:=}} param from mapVertices, outerJoinVertices
 #- merges multiple parameters lists
 #- incorporates GraphOps
 # JavaVertexRDD
 # JavaEdgeRDD
 # JavaGraphLoader
 #- removes optional params, or uses builder pattern



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3666) Extract interfaces for EdgeRDD and VertexRDD

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147280#comment-14147280
 ] 

Apache Spark commented on SPARK-3666:
-

User 'ankurdave' has created a pull request for this issue:
https://github.com/apache/spark/pull/2530

 Extract interfaces for EdgeRDD and VertexRDD
 

 Key: SPARK-3666
 URL: https://issues.apache.org/jira/browse/SPARK-3666
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3686) flume.SparkSinkSuite.Success is flaky

2014-09-24 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147314#comment-14147314
 ] 

Hari Shreedharan commented on SPARK-3686:
-

Looking into this.

 flume.SparkSinkSuite.Success is flaky
 -

 Key: SPARK-3686
 URL: https://issues.apache.org/jira/browse/SPARK-3686
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Patrick Wendell
Assignee: Hari Shreedharan
Priority: Blocker

 {code}
 Error Message
 4000 did not equal 5000
 Stacktrace
 sbt.ForkMain$ForkError: 4000 did not equal 5000
   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:498)
   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1559)
   at 
 org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:416)
   at 
 org.apache.spark.streaming.flume.sink.SparkSinkSuite.org$apache$spark$streaming$flume$sink$SparkSinkSuite$$assertChannelIsEmpty(SparkSinkSuite.scala:195)
   at 
 org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply$mcV$sp(SparkSinkSuite.scala:54)
   at 
 org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40)
   at 
 org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1559)
   at org.scalatest.Suite$class.run(Suite.scala:1423)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204)
   at org.scalatest.FunSuite.run(FunSuite.scala:1559)
   at 
 org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:444)
   at 
 org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:651)
   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 Example test result (this will stop working in a few days):
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/719/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.streaming.flume.sink/SparkSinkSuite/Success_with_ack/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-3686) flume.SparkSinkSuite.Success is flaky

2014-09-24 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147316#comment-14147316
 ] 

Hari Shreedharan commented on SPARK-3686:
-

Unlike the other tests in this suite, this one does not have a sleep to let the 
sink commit the transactions back to the channel. So because this does not give 
enough time for the channel to actually becoming empty. Let me add a sleep - 
will send a PR and run the pre-commit hook a bunch of times to ensure that it 
fixes it.

 flume.SparkSinkSuite.Success is flaky
 -

 Key: SPARK-3686
 URL: https://issues.apache.org/jira/browse/SPARK-3686
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Patrick Wendell
Assignee: Hari Shreedharan
Priority: Blocker

 {code}
 Error Message
 4000 did not equal 5000
 Stacktrace
 sbt.ForkMain$ForkError: 4000 did not equal 5000
   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:498)
   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1559)
   at 
 org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:416)
   at 
 org.apache.spark.streaming.flume.sink.SparkSinkSuite.org$apache$spark$streaming$flume$sink$SparkSinkSuite$$assertChannelIsEmpty(SparkSinkSuite.scala:195)
   at 
 org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply$mcV$sp(SparkSinkSuite.scala:54)
   at 
 org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40)
   at 
 org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1559)
   at org.scalatest.Suite$class.run(Suite.scala:1423)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204)
   at org.scalatest.FunSuite.run(FunSuite.scala:1559)
   at 
 org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:444)
   at 
 org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:651)
   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 Example test result (this will stop working in a few days):
 

[jira] [Resolved] (SPARK-546) Support full outer join and multiple join in a single shuffle

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-546.
---
   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Aaron Staple

Fixed by:
https://github.com/apache/spark/pull/1395

 Support full outer join and multiple join in a single shuffle
 -

 Key: SPARK-546
 URL: https://issues.apache.org/jira/browse/SPARK-546
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Streaming
Reporter: Reynold Xin
Assignee: Aaron Staple
 Fix For: 1.2.0


 RDD[(K,V)] now supports left/right outer join but not full outer join.
 Also it'd be nice to provide a way for users to join multiple RDDs on the 
 same key in a single shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3686) flume.SparkSinkSuite.Success is flaky

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147331#comment-14147331
 ] 

Apache Spark commented on SPARK-3686:
-

User 'harishreedharan' has created a pull request for this issue:
https://github.com/apache/spark/pull/2531

 flume.SparkSinkSuite.Success is flaky
 -

 Key: SPARK-3686
 URL: https://issues.apache.org/jira/browse/SPARK-3686
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Patrick Wendell
Assignee: Hari Shreedharan
Priority: Blocker

 {code}
 Error Message
 4000 did not equal 5000
 Stacktrace
 sbt.ForkMain$ForkError: 4000 did not equal 5000
   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:498)
   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1559)
   at 
 org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:416)
   at 
 org.apache.spark.streaming.flume.sink.SparkSinkSuite.org$apache$spark$streaming$flume$sink$SparkSinkSuite$$assertChannelIsEmpty(SparkSinkSuite.scala:195)
   at 
 org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply$mcV$sp(SparkSinkSuite.scala:54)
   at 
 org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40)
   at 
 org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1559)
   at org.scalatest.Suite$class.run(Suite.scala:1423)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204)
   at org.scalatest.FunSuite.run(FunSuite.scala:1559)
   at 
 org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:444)
   at 
 org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:651)
   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 Example test result (this will stop working in a few days):
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/719/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.streaming.flume.sink/SparkSinkSuite/Success_with_ack/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (SPARK-3687) Spark hang while

2014-09-24 Thread Ziv Huang (JIRA)
Ziv Huang created SPARK-3687:


 Summary: Spark hang while 
 Key: SPARK-3687
 URL: https://issues.apache.org/jira/browse/SPARK-3687
 Project: Spark
  Issue Type: Bug
Reporter: Ziv Huang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files

2014-09-24 Thread Ziv Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziv Huang updated SPARK-3687:
-
Summary: Spark hang while processing more than 100 sequence files  (was: 
Spark hang while )

 Spark hang while processing more than 100 sequence files
 

 Key: SPARK-3687
 URL: https://issues.apache.org/jira/browse/SPARK-3687
 Project: Spark
  Issue Type: Bug
Reporter: Ziv Huang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files

2014-09-24 Thread Ziv Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziv Huang updated SPARK-3687:
-
Affects Version/s: 1.0.2
   1.1.0

 Spark hang while processing more than 100 sequence files
 

 Key: SPARK-3687
 URL: https://issues.apache.org/jira/browse/SPARK-3687
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Ziv Huang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files

2014-09-24 Thread Ziv Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziv Huang updated SPARK-3687:
-
Component/s: Spark Core

 Spark hang while processing more than 100 sequence files
 

 Key: SPARK-3687
 URL: https://issues.apache.org/jira/browse/SPARK-3687
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Ziv Huang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files

2014-09-24 Thread Ziv Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziv Huang updated SPARK-3687:
-
Description: I use spark 

 Spark hang while processing more than 100 sequence files
 

 Key: SPARK-3687
 URL: https://issues.apache.org/jira/browse/SPARK-3687
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Ziv Huang

 I use spark 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files

2014-09-24 Thread Ziv Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziv Huang updated SPARK-3687:
-
Description: In my application, I read more than 100 sequence files to a 
JavaPairRDD, perform flatmap to get another JavaRDD, and then use takeOrdered  
(was: In my application, I read more than 100 sequence files, )

 Spark hang while processing more than 100 sequence files
 

 Key: SPARK-3687
 URL: https://issues.apache.org/jira/browse/SPARK-3687
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Ziv Huang

 In my application, I read more than 100 sequence files to a JavaPairRDD, 
 perform flatmap to get another JavaRDD, and then use takeOrdered



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files

2014-09-24 Thread Ziv Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziv Huang updated SPARK-3687:
-
Description: 
In my application, I read more than 100 sequence files to a JavaPairRDD, 
perform flatmap to get another JavaRDD, and then use takeOrdered to get the 
result.
It is quite often (but not always) that the spark hangs while the executing 
some of 110th-130th tasks.
The job can hang for several hours, maybe forever (I can't wait for its 
completion).
When the spark job hangs, I can't find any error message in anywhere, and I 
can't kill the job from web UI.

The current workaround is to use coalesce to reduce the number of partitions to 
be processed.
I never get job hanged if the number of partitions to be processed is no 
greater than 80.

  was:In my application, I read more than 100 sequence files to a JavaPairRDD, 
perform flatmap to get another JavaRDD, and then use takeOrdered


 Spark hang while processing more than 100 sequence files
 

 Key: SPARK-3687
 URL: https://issues.apache.org/jira/browse/SPARK-3687
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Ziv Huang

 In my application, I read more than 100 sequence files to a JavaPairRDD, 
 perform flatmap to get another JavaRDD, and then use takeOrdered to get the 
 result.
 It is quite often (but not always) that the spark hangs while the executing 
 some of 110th-130th tasks.
 The job can hang for several hours, maybe forever (I can't wait for its 
 completion).
 When the spark job hangs, I can't find any error message in anywhere, and I 
 can't kill the job from web UI.
 The current workaround is to use coalesce to reduce the number of partitions 
 to be processed.
 I never get job hanged if the number of partitions to be processed is no 
 greater than 80.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files

2014-09-24 Thread Ziv Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziv Huang updated SPARK-3687:
-
Description: 
In my application, I read more than 100 sequence files to a JavaPairRDD, 
perform flatmap to get another JavaRDD, and then use takeOrdered to get the 
result.
It is quite often (but not always) that the spark hangs while the executing 
some of 110th-130th tasks.
The job can hang for several hours, maybe forever (I can't wait for its 
completion).
When the spark job hangs, I can't find any error message in anywhere, and I 
can't kill the job from web UI.

The current workaround is to use coalesce to reduce the number of partitions to 
be processed.
I never get a job hanged if the number of partitions to be processed is no 
greater than 80.

  was:
In my application, I read more than 100 sequence files to a JavaPairRDD, 
perform flatmap to get another JavaRDD, and then use takeOrdered to get the 
result.
It is quite often (but not always) that the spark hangs while the executing 
some of 110th-130th tasks.
The job can hang for several hours, maybe forever (I can't wait for its 
completion).
When the spark job hangs, I can't find any error message in anywhere, and I 
can't kill the job from web UI.

The current workaround is to use coalesce to reduce the number of partitions to 
be processed.
I never get job hanged if the number of partitions to be processed is no 
greater than 80.


 Spark hang while processing more than 100 sequence files
 

 Key: SPARK-3687
 URL: https://issues.apache.org/jira/browse/SPARK-3687
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Ziv Huang

 In my application, I read more than 100 sequence files to a JavaPairRDD, 
 perform flatmap to get another JavaRDD, and then use takeOrdered to get the 
 result.
 It is quite often (but not always) that the spark hangs while the executing 
 some of 110th-130th tasks.
 The job can hang for several hours, maybe forever (I can't wait for its 
 completion).
 When the spark job hangs, I can't find any error message in anywhere, and I 
 can't kill the job from web UI.
 The current workaround is to use coalesce to reduce the number of partitions 
 to be processed.
 I never get a job hanged if the number of partitions to be processed is no 
 greater than 80.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2647) DAGScheduler plugs others when processing one JobSubmitted event

2014-09-24 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147378#comment-14147378
 ] 

Nan Zhu commented on SPARK-2647:


isn't it the expected behaviour as we keep DAGScheduler as a single-thread mode?

 DAGScheduler plugs others when processing one JobSubmitted event
 

 Key: SPARK-2647
 URL: https://issues.apache.org/jira/browse/SPARK-2647
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: YanTang Zhai

 If a few of jobs are submitted, DAGScheduler plugs others when processing one 
 JobSubmitted event.
 For example ont JobSubmitted event is processed as follows and costs much time
 spark-akka.actor.default-dispatcher-67 daemon prio=10 
 tid=0x7f75ec001000 nid=0x7dd6 in Object.wait() [0x7f76063e1000]
java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.hadoopcdh3.ipc.Client.call(Client.java:1130)
   - locked 0x000783b17330 (a org.apache.hadoopcdh3.ipc.Client$Call)
   at org.apache.hadoopcdh3.ipc.RPC$Invoker.invoke(RPC.java:241)
   at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
   at sun.reflect.GeneratedMethodAccessor86.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:83)
   at 
 org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:60)
   at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
   at 
 org.apache.hadoopcdh3.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1472)
   at 
 org.apache.hadoopcdh3.hdfs.DFSClient.getBlockLocations(DFSClient.java:1498)
   at 
 org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:208)
   at 
 org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:204)
   at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
   at 
 org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getFileBlockLocations(Cdh3DistributedFileSystem.java:204)
   at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1812)
   at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1797)
   at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:233)
   at 
 StorageEngineClient.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:141)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:172)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54)
   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:54)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at