[jira] [Created] (SPARK-3675) Allow starting JDBC server on an existing context
Michael Armbrust created SPARK-3675: --- Summary: Allow starting JDBC server on an existing context Key: SPARK-3675 URL: https://issues.apache.org/jira/browse/SPARK-3675 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust A common question on the mailing list is how to read from temporary tables over JDBC. While we should try and support most of this in SQL, it would also be nice to query generic RDDs over JDBC. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3675) Allow starting JDBC server on an existing context
[ https://issues.apache.org/jira/browse/SPARK-3675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145969#comment-14145969 ] Apache Spark commented on SPARK-3675: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/2515 Allow starting JDBC server on an existing context - Key: SPARK-3675 URL: https://issues.apache.org/jira/browse/SPARK-3675 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust A common question on the mailing list is how to read from temporary tables over JDBC. While we should try and support most of this in SQL, it would also be nice to query generic RDDs over JDBC. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3675) Allow starting JDBC server on an existing context
[ https://issues.apache.org/jira/browse/SPARK-3675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3675: Target Version/s: 1.2.0 Allow starting JDBC server on an existing context - Key: SPARK-3675 URL: https://issues.apache.org/jira/browse/SPARK-3675 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust A common question on the mailing list is how to read from temporary tables over JDBC. While we should try and support most of this in SQL, it would also be nice to query generic RDDs over JDBC. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3676) jdk version lead to spark hql test suite error
wangfei created SPARK-3676: -- Summary: jdk version lead to spark hql test suite error Key: SPARK-3676 URL: https://issues.apache.org/jira/browse/SPARK-3676 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.2.0 System.out.println(1/500d) get different result in diff jdk version jdk 1.6.0(_31) 0.0020 jdk 1.7.0(_05) 0.002 this will lead to spark sql hive test suite failed (replay by set jdk version = 1.6.0_31)--- [info] - division *** FAILED *** [info] Results do not match for division: [info] SELECT 2 / 1, 1 / 2, 1 / 3, 1 / COUNT(*) FROM src LIMIT 1 [info] == Parsed Logical Plan == [info] Limit 1 [info]Project [(2 / 1) AS c_0#692,(1 / 2) AS c_1#693,(1 / 3) AS c_2#694,(1 / COUNT(1)) AS c_3#695] [info] UnresolvedRelation None, src, None [info] [info] == Analyzed Logical Plan == [info] Limit 1 [info]Aggregate [], [(CAST(2, DoubleType) / CAST(1, DoubleType)) AS c_0#692,(CAST(1, DoubleType) / CAST(2, DoubleType)) AS c_1#693,(CAST(1, DoubleType) / CAST(3, DoubleType)) AS c_2#694,(CAST(CAST(1, LongType), Doub leType) / CAST(COUNT(1), DoubleType)) AS c_3#695] [info] MetastoreRelation default, src, None [info] [info] == Optimized Logical Plan == [info] Limit 1 [info]Aggregate [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS c_2#694,(1.0 / CAST(COUNT(1), DoubleType)) AS c_3#695] [info] Project [] [info] MetastoreRelation default, src, None [info] [info] == Physical Plan == [info] Limit 1 [info]Aggregate false, [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS c_2#694,(1.0 / CAST(SUM(PartialCount#699L), DoubleType)) AS c_3#695] [info] Exchange SinglePartition [info] Aggregate true, [], [COUNT(1) AS PartialCount#699L] [info] HiveTableScan [], (MetastoreRelation default, src, None), None [info] [info] Code Generation: false [info] == RDD == [info] c_0c_1 c_2 c_3 [info] !== HIVE - 1 row(s) == == CATALYST - 1 row(s) == [info] !2.0 0.5 0. 0.002 2.0 0.5 0. 0.0020 (HiveComparisonTest.scala:370) [info] - timestamp cast #1 *** FAILED *** [info] Results do not match for timestamp cast #1: [info] SELECT CAST(CAST(1 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1 [info] == Parsed Logical Plan == [info] Limit 1 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995] [info] UnresolvedRelation None, src, None [info] [info] == Analyzed Logical Plan == [info] Limit 1 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995] [info] MetastoreRelation default, src, None [info] [info] == Optimized Logical Plan == [info] Limit 1 [info]Project [0.0010 AS c_0#995] [info] MetastoreRelation default, src, None [info] [info] == Physical Plan == [info] Limit 1 [info]Project [0.0010 AS c_0#995] [info] HiveTableScan [], (MetastoreRelation default, src, None), None [info] [info] Code Generation: false [info] == RDD == [info] c_0 [info] !== HIVE - 1 row(s) == == CATALYST - 1 row(s) == [info] !0.001 0.0010 (HiveComparisonTest.scala:370) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3676) jdk version lead to spark sql test suite error
[ https://issues.apache.org/jira/browse/SPARK-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangfei updated SPARK-3676: --- Summary: jdk version lead to spark sql test suite error (was: jdk version lead to spark hql test suite error) jdk version lead to spark sql test suite error -- Key: SPARK-3676 URL: https://issues.apache.org/jira/browse/SPARK-3676 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.2.0 System.out.println(1/500d) get different result in diff jdk version jdk 1.6.0(_31) 0.0020 jdk 1.7.0(_05) 0.002 this will lead to spark sql hive test suite failed (replay by set jdk version = 1.6.0_31)--- [info] - division *** FAILED *** [info] Results do not match for division: [info] SELECT 2 / 1, 1 / 2, 1 / 3, 1 / COUNT(*) FROM src LIMIT 1 [info] == Parsed Logical Plan == [info] Limit 1 [info]Project [(2 / 1) AS c_0#692,(1 / 2) AS c_1#693,(1 / 3) AS c_2#694,(1 / COUNT(1)) AS c_3#695] [info] UnresolvedRelation None, src, None [info] [info] == Analyzed Logical Plan == [info] Limit 1 [info]Aggregate [], [(CAST(2, DoubleType) / CAST(1, DoubleType)) AS c_0#692,(CAST(1, DoubleType) / CAST(2, DoubleType)) AS c_1#693,(CAST(1, DoubleType) / CAST(3, DoubleType)) AS c_2#694,(CAST(CAST(1, LongType), Doub leType) / CAST(COUNT(1), DoubleType)) AS c_3#695] [info] MetastoreRelation default, src, None [info] [info] == Optimized Logical Plan == [info] Limit 1 [info]Aggregate [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS c_2#694,(1.0 / CAST(COUNT(1), DoubleType)) AS c_3#695] [info] Project [] [info] MetastoreRelation default, src, None [info] [info] == Physical Plan == [info] Limit 1 [info]Aggregate false, [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS c_2#694,(1.0 / CAST(SUM(PartialCount#699L), DoubleType)) AS c_3#695] [info] Exchange SinglePartition [info] Aggregate true, [], [COUNT(1) AS PartialCount#699L] [info] HiveTableScan [], (MetastoreRelation default, src, None), None [info] [info] Code Generation: false [info] == RDD == [info] c_0c_1 c_2 c_3 [info] !== HIVE - 1 row(s) == == CATALYST - 1 row(s) == [info] !2.0 0.5 0. 0.002 2.0 0.5 0. 0.0020 (HiveComparisonTest.scala:370) [info] - timestamp cast #1 *** FAILED *** [info] Results do not match for timestamp cast #1: [info] SELECT CAST(CAST(1 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1 [info] == Parsed Logical Plan == [info] Limit 1 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995] [info] UnresolvedRelation None, src, None [info] [info] == Analyzed Logical Plan == [info] Limit 1 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995] [info] MetastoreRelation default, src, None [info] [info] == Optimized Logical Plan == [info] Limit 1 [info]Project [0.0010 AS c_0#995] [info] MetastoreRelation default, src, None [info] [info] == Physical Plan == [info] Limit 1 [info]Project [0.0010 AS c_0#995] [info] HiveTableScan [], (MetastoreRelation default, src, None), None [info] [info] Code Generation: false [info] == RDD == [info] c_0 [info] !== HIVE - 1 row(s) == == CATALYST - 1 row(s) == [info] !0.001 0.0010 (HiveComparisonTest.scala:370) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3620) Refactor config option handling code for spark-submit
[ https://issues.apache.org/jira/browse/SPARK-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146011#comment-14146011 ] Apache Spark commented on SPARK-3620: - User 'tigerquoll' has created a pull request for this issue: https://github.com/apache/spark/pull/2516 Refactor config option handling code for spark-submit - Key: SPARK-3620 URL: https://issues.apache.org/jira/browse/SPARK-3620 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.0.0, 1.1.0 Reporter: Dale Richardson Assignee: Dale Richardson Priority: Minor I'm proposing its time to refactor the configuration argument handling code in spark-submit. The code has grown organically in a short period of time, handles a pretty complicated logic flow, and is now pretty fragile. Some issues that have been identified: 1. Hand-crafted property file readers that do not support the property file format as specified in http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader) 2. ResolveURI not called on paths read from conf/prop files 3. inconsistent means of merging / overriding values from different sources (Some get overridden by file, others by manual settings of field on object, Some by properties) 4. Argument validation should be done after combining config files, system properties and command line arguments, 5. Alternate conf file location not handled in shell scripts 6. Some options can only be passed as command line arguments 7. Defaults for options are hard-coded (and sometimes overridden multiple times) in many through-out the code e.g. master = local[*] Initial proposal is to use typesafe conf to read in the config information and merge the various config sources -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3662) Importing pandas breaks included pi.py example
[ https://issues.apache.org/jira/browse/SPARK-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146017#comment-14146017 ] Sean Owen commented on SPARK-3662: -- Maybe I miss something, but, does this just mean you can't import pandas entirely? If you're modifying the example, you should import only what you need from pandas. Or, it may be that you need to modify the import random, indeed, to accommodate other modifications you want to make. But what is the problem with the included example? it runs fine without modifications, no? Importing pandas breaks included pi.py example -- Key: SPARK-3662 URL: https://issues.apache.org/jira/browse/SPARK-3662 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.1.0 Environment: Xubuntu 14.04. Yarn cluster running on Ubuntu 12.04. Reporter: Evan Samanas If I add import pandas at the top of the included pi.py example and submit using spark-submit --master yarn-client, I get this stack trace: {code} Traceback (most recent call last): File /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi.py, line 39, in module count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add) File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 759, in reduce vals = self.mapPartitions(func).collect() File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 723, in collect bytesInJava = self._jrdd.collect().iterator() File /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError14/09/23 15:51:58 INFO TaskSetManager: Lost task 2.3 in stage 0.0 (TID 10) on executor SERVERNAMEREMOVED: org.apache.spark.api.python.PythonException (Traceback (most recent call last): File /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py, line 75, in main command = pickleSer._read_with_length(infile) File /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py, line 150, in _read_with_length return self.loads(obj) ImportError: No module named algos {code} The example works fine if I move the statement from random import random from the top and into the function (def f(_)) defined in the example. Near as I can tell, random is getting confused with a function of the same name within pandas.algos. Submitting the same script using --master local works, but gives a distressing amount of random characters to stdout or stderr and messes up my terminal: {code} ... @J@J@J@J@J@J@J@J@J@J@J@J@J@JJ@J@J@J@J @J!@J@J#@J$@J%@J@J'@J(@J)@J*@J+@J,@J-@J.@J/@J0@J1@J2@J3@J4@J5@J6@J7@J8@J9@J:@J;@J@J=@J@J?@J@@JA@JB@JC@JD@JE@JF@JG@JH@JI@JJ@JK@JL@JM@JN@JO@JP@JQ@JR@JS@JT@JU@JV@JW@JX@JY@JZ@J[@J\@J]@J^@J_@J`@Ja@Jb@Jc@Jd@Je@Jf@Jg@Jh@Ji@Jj@Jk@Jl@Jm@Jn@Jo@Jp@Jq@Jr@Js@Jt@Ju@Jv@Jw@Jx@Jy@Jz@J{@J|@J}@J~@J@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JJJ�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JAJAJAJAJAJAJAJAAJ AJ AJ AJ AJAJAJAJAJAJAJAJAJAJAJAJAJAJJAJAJAJAJ AJ!AJAJ#AJ$AJ%AJAJ'AJ(AJ)AJ*AJ+AJ,AJ-AJ.AJ/AJ0AJ1AJ2AJ3AJ4AJ5AJ6AJ7AJ8AJ9AJ:AJ;AJAJ=AJAJ?AJ@AJAAJBAJCAJDAJEAJFAJGAJHAJIAJJAJKAJLAJMAJNAJOAJPAJQAJRAJSAJTAJUAJVAJWAJXAJYAJZAJ[AJ\AJ]AJ^AJ_AJ`AJaAJbAJcAJdAJeAJfAJgAJhAJiAJjAJkAJlAJmAJnAJoAJpAJqAJrAJsAJtAJuAJvAJwAJxAJyAJzAJ{AJ|AJ}AJ~AJAJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJJJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�A14/09/23 15:42:09 INFO SparkContext: Job finished: reduce at /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi_sframe.py:38, took 11.276879779 s J�AJ�AJ�AJ�AJ�AJ�AJ�A�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJBJBJBJBJBJBJBJBBJ BJ BJ BJ BJBJBJBJBJBJBJBJBJBJBJBJBJBJJBJBJBJBJ BJ!BJBJ#BJ$BJ%BJBJ'BJ(BJ)BJ*BJ+BJ,BJ-BJ.BJ/BJ0BJ1BJ2BJ3BJ4BJ5BJ6BJ7BJ8BJ9BJ:BJ;BJBJ=BJBJ?BJ@Be. �]qJ#1a. �]qJX4a. �]qJX4a. �]qJ#1a. �]qJX4a. �]qJX4a. �]qJ#1a. �]qJX4a. �]qJX4a. �]qJa. Pi is roughly 3.146136 {code} No idea if that's related, but thought I'd include it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3676) jdk version lead to spark sql test suite error
[ https://issues.apache.org/jira/browse/SPARK-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146035#comment-14146035 ] Apache Spark commented on SPARK-3676: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/2517 jdk version lead to spark sql test suite error -- Key: SPARK-3676 URL: https://issues.apache.org/jira/browse/SPARK-3676 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.2.0 System.out.println(1/500d) get different result in diff jdk version jdk 1.6.0(_31) 0.0020 jdk 1.7.0(_05) 0.002 this will lead to spark sql hive test suite failed (replay by set jdk version = 1.6.0_31)--- [info] - division *** FAILED *** [info] Results do not match for division: [info] SELECT 2 / 1, 1 / 2, 1 / 3, 1 / COUNT(*) FROM src LIMIT 1 [info] == Parsed Logical Plan == [info] Limit 1 [info]Project [(2 / 1) AS c_0#692,(1 / 2) AS c_1#693,(1 / 3) AS c_2#694,(1 / COUNT(1)) AS c_3#695] [info] UnresolvedRelation None, src, None [info] [info] == Analyzed Logical Plan == [info] Limit 1 [info]Aggregate [], [(CAST(2, DoubleType) / CAST(1, DoubleType)) AS c_0#692,(CAST(1, DoubleType) / CAST(2, DoubleType)) AS c_1#693,(CAST(1, DoubleType) / CAST(3, DoubleType)) AS c_2#694,(CAST(CAST(1, LongType), Doub leType) / CAST(COUNT(1), DoubleType)) AS c_3#695] [info] MetastoreRelation default, src, None [info] [info] == Optimized Logical Plan == [info] Limit 1 [info]Aggregate [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS c_2#694,(1.0 / CAST(COUNT(1), DoubleType)) AS c_3#695] [info] Project [] [info] MetastoreRelation default, src, None [info] [info] == Physical Plan == [info] Limit 1 [info]Aggregate false, [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS c_2#694,(1.0 / CAST(SUM(PartialCount#699L), DoubleType)) AS c_3#695] [info] Exchange SinglePartition [info] Aggregate true, [], [COUNT(1) AS PartialCount#699L] [info] HiveTableScan [], (MetastoreRelation default, src, None), None [info] [info] Code Generation: false [info] == RDD == [info] c_0c_1 c_2 c_3 [info] !== HIVE - 1 row(s) == == CATALYST - 1 row(s) == [info] !2.0 0.5 0. 0.002 2.0 0.5 0. 0.0020 (HiveComparisonTest.scala:370) [info] - timestamp cast #1 *** FAILED *** [info] Results do not match for timestamp cast #1: [info] SELECT CAST(CAST(1 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1 [info] == Parsed Logical Plan == [info] Limit 1 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995] [info] UnresolvedRelation None, src, None [info] [info] == Analyzed Logical Plan == [info] Limit 1 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995] [info] MetastoreRelation default, src, None [info] [info] == Optimized Logical Plan == [info] Limit 1 [info]Project [0.0010 AS c_0#995] [info] MetastoreRelation default, src, None [info] [info] == Physical Plan == [info] Limit 1 [info]Project [0.0010 AS c_0#995] [info] HiveTableScan [], (MetastoreRelation default, src, None), None [info] [info] Code Generation: false [info] == RDD == [info] c_0 [info] !== HIVE - 1 row(s) == == CATALYST - 1 row(s) == [info] !0.001 0.0010 (HiveComparisonTest.scala:370) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3267) Deadlock between ScalaReflectionLock and Data type initialization
[ https://issues.apache.org/jira/browse/SPARK-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146039#comment-14146039 ] Aaron Davidson commented on SPARK-3267: --- I don't have it anymore, unfortunately. Michael and I did a little digging at the time, and I think we found the reason for the deadlock, shown in the stack traces above, but decided it was a very unlikely scenario. Indeed, the query did not consistently deadlock; this only occurred a single time. Deadlock between ScalaReflectionLock and Data type initialization - Key: SPARK-3267 URL: https://issues.apache.org/jira/browse/SPARK-3267 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Aaron Davidson Priority: Critical Deadlock here: {code} Executor task launch worker-0 daemon prio=10 tid=0x7fab50036000 nid=0x27a in Object.wait() [0x7fab60c2e000 ] java.lang.Thread.State: RUNNABLE at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.defaultPrimitive(CodeGenerator.scala:565) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:202) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 93) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$Evaluate2$2.evaluateAs(CodeGenerator.scal a:175) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:304) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 93) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:314) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 93) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:313) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) ... {code} and {code} Executor task launch worker-2 daemon prio=10 tid=0x7fab100f0800 nid=0x27e in Object.wait() [0x7fab0eeec000 ] java.lang.Thread.State: RUNNABLE at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:250) - locked 0x00064e5d9a48 (a org.apache.spark.sql.catalyst.expressions.Cast) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations. scala:139) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations. scala:139) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:139) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:126) at
[jira] [Commented] (SPARK-3676) jdk version lead to spark sql test suite error
[ https://issues.apache.org/jira/browse/SPARK-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146040#comment-14146040 ] Sean Owen commented on SPARK-3676: -- (For the interested, I looked it up, since the behavior change sounds surprising. This is in fact a bug in Java 6 that was fixed in Java 7: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4428022 It may even be fixed in later versions of Java 6, but I have a very recent one and it is not.) jdk version lead to spark sql test suite error -- Key: SPARK-3676 URL: https://issues.apache.org/jira/browse/SPARK-3676 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.2.0 System.out.println(1/500d) get different result in diff jdk version jdk 1.6.0(_31) 0.0020 jdk 1.7.0(_05) 0.002 this will lead to spark sql hive test suite failed (replay by set jdk version = 1.6.0_31)--- [info] - division *** FAILED *** [info] Results do not match for division: [info] SELECT 2 / 1, 1 / 2, 1 / 3, 1 / COUNT(*) FROM src LIMIT 1 [info] == Parsed Logical Plan == [info] Limit 1 [info]Project [(2 / 1) AS c_0#692,(1 / 2) AS c_1#693,(1 / 3) AS c_2#694,(1 / COUNT(1)) AS c_3#695] [info] UnresolvedRelation None, src, None [info] [info] == Analyzed Logical Plan == [info] Limit 1 [info]Aggregate [], [(CAST(2, DoubleType) / CAST(1, DoubleType)) AS c_0#692,(CAST(1, DoubleType) / CAST(2, DoubleType)) AS c_1#693,(CAST(1, DoubleType) / CAST(3, DoubleType)) AS c_2#694,(CAST(CAST(1, LongType), Doub leType) / CAST(COUNT(1), DoubleType)) AS c_3#695] [info] MetastoreRelation default, src, None [info] [info] == Optimized Logical Plan == [info] Limit 1 [info]Aggregate [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS c_2#694,(1.0 / CAST(COUNT(1), DoubleType)) AS c_3#695] [info] Project [] [info] MetastoreRelation default, src, None [info] [info] == Physical Plan == [info] Limit 1 [info]Aggregate false, [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS c_2#694,(1.0 / CAST(SUM(PartialCount#699L), DoubleType)) AS c_3#695] [info] Exchange SinglePartition [info] Aggregate true, [], [COUNT(1) AS PartialCount#699L] [info] HiveTableScan [], (MetastoreRelation default, src, None), None [info] [info] Code Generation: false [info] == RDD == [info] c_0c_1 c_2 c_3 [info] !== HIVE - 1 row(s) == == CATALYST - 1 row(s) == [info] !2.0 0.5 0. 0.002 2.0 0.5 0. 0.0020 (HiveComparisonTest.scala:370) [info] - timestamp cast #1 *** FAILED *** [info] Results do not match for timestamp cast #1: [info] SELECT CAST(CAST(1 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1 [info] == Parsed Logical Plan == [info] Limit 1 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995] [info] UnresolvedRelation None, src, None [info] [info] == Analyzed Logical Plan == [info] Limit 1 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995] [info] MetastoreRelation default, src, None [info] [info] == Optimized Logical Plan == [info] Limit 1 [info]Project [0.0010 AS c_0#995] [info] MetastoreRelation default, src, None [info] [info] == Physical Plan == [info] Limit 1 [info]Project [0.0010 AS c_0#995] [info] HiveTableScan [], (MetastoreRelation default, src, None), None [info] [info] Code Generation: false [info] == RDD == [info] c_0 [info] !== HIVE - 1 row(s) == == CATALYST - 1 row(s) == [info] !0.001 0.0010 (HiveComparisonTest.scala:370) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3663) Document SPARK_LOG_DIR and SPARK_PID_DIR
[ https://issues.apache.org/jira/browse/SPARK-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146041#comment-14146041 ] Apache Spark commented on SPARK-3663: - User 'ash211' has created a pull request for this issue: https://github.com/apache/spark/pull/2518 Document SPARK_LOG_DIR and SPARK_PID_DIR Key: SPARK-3663 URL: https://issues.apache.org/jira/browse/SPARK-3663 Project: Spark Issue Type: Documentation Reporter: Andrew Ash Assignee: Andrew Ash I'm using these two parameters in some puppet scripts for standalone deployment and realized that they're not documented anywhere. We should document them -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3676) jdk version lead to spark sql test suite error
[ https://issues.apache.org/jira/browse/SPARK-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146050#comment-14146050 ] wangfei commented on SPARK-3676: hmm, i see, thanks for that. jdk version lead to spark sql test suite error -- Key: SPARK-3676 URL: https://issues.apache.org/jira/browse/SPARK-3676 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.2.0 System.out.println(1/500d) get different result in diff jdk version jdk 1.6.0(_31) 0.0020 jdk 1.7.0(_05) 0.002 this will lead to spark sql hive test suite failed (replay by set jdk version = 1.6.0_31)--- [info] - division *** FAILED *** [info] Results do not match for division: [info] SELECT 2 / 1, 1 / 2, 1 / 3, 1 / COUNT(*) FROM src LIMIT 1 [info] == Parsed Logical Plan == [info] Limit 1 [info]Project [(2 / 1) AS c_0#692,(1 / 2) AS c_1#693,(1 / 3) AS c_2#694,(1 / COUNT(1)) AS c_3#695] [info] UnresolvedRelation None, src, None [info] [info] == Analyzed Logical Plan == [info] Limit 1 [info]Aggregate [], [(CAST(2, DoubleType) / CAST(1, DoubleType)) AS c_0#692,(CAST(1, DoubleType) / CAST(2, DoubleType)) AS c_1#693,(CAST(1, DoubleType) / CAST(3, DoubleType)) AS c_2#694,(CAST(CAST(1, LongType), Doub leType) / CAST(COUNT(1), DoubleType)) AS c_3#695] [info] MetastoreRelation default, src, None [info] [info] == Optimized Logical Plan == [info] Limit 1 [info]Aggregate [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS c_2#694,(1.0 / CAST(COUNT(1), DoubleType)) AS c_3#695] [info] Project [] [info] MetastoreRelation default, src, None [info] [info] == Physical Plan == [info] Limit 1 [info]Aggregate false, [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS c_2#694,(1.0 / CAST(SUM(PartialCount#699L), DoubleType)) AS c_3#695] [info] Exchange SinglePartition [info] Aggregate true, [], [COUNT(1) AS PartialCount#699L] [info] HiveTableScan [], (MetastoreRelation default, src, None), None [info] [info] Code Generation: false [info] == RDD == [info] c_0c_1 c_2 c_3 [info] !== HIVE - 1 row(s) == == CATALYST - 1 row(s) == [info] !2.0 0.5 0. 0.002 2.0 0.5 0. 0.0020 (HiveComparisonTest.scala:370) [info] - timestamp cast #1 *** FAILED *** [info] Results do not match for timestamp cast #1: [info] SELECT CAST(CAST(1 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1 [info] == Parsed Logical Plan == [info] Limit 1 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995] [info] UnresolvedRelation None, src, None [info] [info] == Analyzed Logical Plan == [info] Limit 1 [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995] [info] MetastoreRelation default, src, None [info] [info] == Optimized Logical Plan == [info] Limit 1 [info]Project [0.0010 AS c_0#995] [info] MetastoreRelation default, src, None [info] [info] == Physical Plan == [info] Limit 1 [info]Project [0.0010 AS c_0#995] [info] HiveTableScan [], (MetastoreRelation default, src, None), None [info] [info] Code Generation: false [info] == RDD == [info] c_0 [info] !== HIVE - 1 row(s) == == CATALYST - 1 row(s) == [info] !0.001 0.0010 (HiveComparisonTest.scala:370) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3526) Docs section on data locality
[ https://issues.apache.org/jira/browse/SPARK-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146094#comment-14146094 ] Apache Spark commented on SPARK-3526: - User 'ash211' has created a pull request for this issue: https://github.com/apache/spark/pull/2519 Docs section on data locality - Key: SPARK-3526 URL: https://issues.apache.org/jira/browse/SPARK-3526 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.0.2 Reporter: Andrew Ash Assignee: Andrew Ash Several threads on the mailing list have been about data locality and how to interpret PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, etc. Let's get some more details in the docs on this concept so we can point future questions there. A couple people appreciated the below description of locality so it could be a good starting point: {quote} The locality is how close the data is to the code that's processing it. PROCESS_LOCAL means data is in the same JVM as the code that's running, so it's really fast. NODE_LOCAL might mean that the data is in HDFS on the same node, or in another executor on the same node, so is a little slower because the data has to travel across an IPC connection. RACK_LOCAL is even slower -- data is on a different server so needs to be sent over the network. Spark switches to lower locality levels when there's no unprocessed data on a node that has idle CPUs. In that situation you have two options: wait until the busy CPUs free up so you can start another task that uses data on that server, or start a new task on a farther away server that needs to bring data from that remote place. What Spark typically does is wait a bit in the hopes that a busy CPU frees up. Once that timeout expires, it starts moving the data from far away to the free CPU. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3677) Scalastyle is never applyed to the sources under yarn/common
Kousuke Saruta created SPARK-3677: - Summary: Scalastyle is never applyed to the sources under yarn/common Key: SPARK-3677 URL: https://issues.apache.org/jira/browse/SPARK-3677 Project: Spark Issue Type: Bug Components: Build, YARN Affects Versions: 1.2.0 Reporter: Kousuke Saruta When we run sbt -Pyarn scalastyle, scalastyle is not applied to the sources under yarn/common. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3677) Scalastyle is never applyed to the sources under yarn/common
[ https://issues.apache.org/jira/browse/SPARK-3677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146171#comment-14146171 ] Apache Spark commented on SPARK-3677: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2520 Scalastyle is never applyed to the sources under yarn/common Key: SPARK-3677 URL: https://issues.apache.org/jira/browse/SPARK-3677 Project: Spark Issue Type: Bug Components: Build, YARN Affects Versions: 1.2.0 Reporter: Kousuke Saruta When we run sbt -Pyarn scalastyle, scalastyle is not applied to the sources under yarn/common. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3639) Kinesis examples set master as local
[ https://issues.apache.org/jira/browse/SPARK-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146245#comment-14146245 ] Matthew Farrellee commented on SPARK-3639: -- seems reasonable to me Kinesis examples set master as local Key: SPARK-3639 URL: https://issues.apache.org/jira/browse/SPARK-3639 Project: Spark Issue Type: Bug Components: Examples, Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Aniket Bhatnagar Priority: Minor Labels: examples Kinesis examples set master as local thus not allowing the example to be tested on a cluster -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker
[ https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146246#comment-14146246 ] Ryan D Braley commented on SPARK-2691: -- +1 Spark typically lags behind mesos in version numbers so if you run mesos today you have to choose between spark and docker. With this we could have our cake and eat it too :) Allow Spark on Mesos to be launched with Docker --- Key: SPARK-2691 URL: https://issues.apache.org/jira/browse/SPARK-2691 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Labels: mesos Currently to launch Spark with Mesos one must upload a tarball and specifiy the executor URI to be passed in that is to be downloaded on each slave or even each execution depending coarse mode or not. We want to make Spark able to support launching Executors via a Docker image that utilizes the recent Docker and Mesos integration work. With the recent integration Spark can simply specify a Docker image and options that is needed and it should continue to work as-is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3678) Yarn app name reported in RM is different between cluster and client mode
Thomas Graves created SPARK-3678: Summary: Yarn app name reported in RM is different between cluster and client mode Key: SPARK-3678 URL: https://issues.apache.org/jira/browse/SPARK-3678 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves If you launch an application in yarn cluster mode the name of the application in the ResourceManager generally shows up as the full name org.apache.spark.examples.SparkHdfsLR. If you start the same app in client mode it shows up as SparkHdfsLR. We should be consistent between them. I haven't looked at it in detail, perhaps its only the examples but I think I've seen this with customer apps also. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3466) Limit size of results that a driver collects for each action
[ https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146640#comment-14146640 ] Andrew Ash commented on SPARK-3466: --- How would you design this feature? I can imagine measuring the size of partitions / RDD elements while they are held in memory across the cluster, sending those sizes back to the driver, and having the driver throw an exception if the requested size exceeds the threshold. Otherwise proceed as normal. Is that how you were envisioning implementation? Limit size of results that a driver collects for each action Key: SPARK-3466 URL: https://issues.apache.org/jira/browse/SPARK-3466 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Matei Zaharia Right now, operations like collect() and take() can crash the driver if they bring back too many data. We should add a spark.driver.maxResultSize setting (or something like that) that will make the driver abort a job if its result is too big. We can set it to some fraction of the driver's memory by default, or to something like 100 MB. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3662) Importing pandas breaks included pi.py example
[ https://issues.apache.org/jira/browse/SPARK-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146639#comment-14146639 ] Evan Samanas commented on SPARK-3662: - I wouldn't focus on the example, that I modified it, or whether I should be importing a small portion of pandas. The issue here is that Spark breaks in this case because of a name collision. Modifying the example is simply the one reproducer I've found. I was modifying the example to learn about how Spark ships Python code to the cluster. In this case, I expected pandas to only be imported in the driver program and not to be imported by any workers. The workers do not have pandas installed, so expected behavior means the example would run to completion, and an ImportError would mean that the workers are importing things they don't need for the task at hand. The way I expected Spark to work IS actually how Spark works...modules will only be imported by workers if a function passed to them uses the modules, but this error showed me false evidence to the contrary. I'm assuming the error is in Spark's modifications to CloudPickle...not in the way the example is set up. Importing pandas breaks included pi.py example -- Key: SPARK-3662 URL: https://issues.apache.org/jira/browse/SPARK-3662 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.1.0 Environment: Xubuntu 14.04. Yarn cluster running on Ubuntu 12.04. Reporter: Evan Samanas If I add import pandas at the top of the included pi.py example and submit using spark-submit --master yarn-client, I get this stack trace: {code} Traceback (most recent call last): File /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi.py, line 39, in module count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add) File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 759, in reduce vals = self.mapPartitions(func).collect() File /home/evan/pub_src/spark/python/pyspark/rdd.py, line 723, in collect bytesInJava = self._jrdd.collect().iterator() File /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError14/09/23 15:51:58 INFO TaskSetManager: Lost task 2.3 in stage 0.0 (TID 10) on executor SERVERNAMEREMOVED: org.apache.spark.api.python.PythonException (Traceback (most recent call last): File /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py, line 75, in main command = pickleSer._read_with_length(infile) File /yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py, line 150, in _read_with_length return self.loads(obj) ImportError: No module named algos {code} The example works fine if I move the statement from random import random from the top and into the function (def f(_)) defined in the example. Near as I can tell, random is getting confused with a function of the same name within pandas.algos. Submitting the same script using --master local works, but gives a distressing amount of random characters to stdout or stderr and messes up my terminal: {code} ... @J@J@J@J@J@J@J@J@J@J@J@J@J@JJ@J@J@J@J @J!@J@J#@J$@J%@J@J'@J(@J)@J*@J+@J,@J-@J.@J/@J0@J1@J2@J3@J4@J5@J6@J7@J8@J9@J:@J;@J@J=@J@J?@J@@JA@JB@JC@JD@JE@JF@JG@JH@JI@JJ@JK@JL@JM@JN@JO@JP@JQ@JR@JS@JT@JU@JV@JW@JX@JY@JZ@J[@J\@J]@J^@J_@J`@Ja@Jb@Jc@Jd@Je@Jf@Jg@Jh@Ji@Jj@Jk@Jl@Jm@Jn@Jo@Jp@Jq@Jr@Js@Jt@Ju@Jv@Jw@Jx@Jy@Jz@J{@J|@J}@J~@J@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JJJ�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JAJAJAJAJAJAJAJAAJ AJ AJ AJ AJAJAJAJAJAJAJAJAJAJAJAJAJAJJAJAJAJAJ AJ!AJAJ#AJ$AJ%AJAJ'AJ(AJ)AJ*AJ+AJ,AJ-AJ.AJ/AJ0AJ1AJ2AJ3AJ4AJ5AJ6AJ7AJ8AJ9AJ:AJ;AJAJ=AJAJ?AJ@AJAAJBAJCAJDAJEAJFAJGAJHAJIAJJAJKAJLAJMAJNAJOAJPAJQAJRAJSAJTAJUAJVAJWAJXAJYAJZAJ[AJ\AJ]AJ^AJ_AJ`AJaAJbAJcAJdAJeAJfAJgAJhAJiAJjAJkAJlAJmAJnAJoAJpAJqAJrAJsAJtAJuAJvAJwAJxAJyAJzAJ{AJ|AJ}AJ~AJAJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJJJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�A14/09/23 15:42:09 INFO SparkContext: Job finished: reduce at /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi_sframe.py:38, took 11.276879779 s J�AJ�AJ�AJ�AJ�AJ�AJ�A�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJBJBJBJBJBJBJBJBBJ BJ BJ
[jira] [Updated] (SPARK-3466) Limit size of results that a driver collects for each action
[ https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-3466: -- Description: Right now, operations like {{collect()}} and {{take()}} can crash the driver with an OOM if they bring back too many data. We should add a {{spark.driver.maxResultSize}} setting (or something like that) that will make the driver abort a job if its result is too big. We can set it to some fraction of the driver's memory by default, or to something like 100 MB. (was: Right now, operations like collect() and take() can crash the driver if they bring back too many data. We should add a spark.driver.maxResultSize setting (or something like that) that will make the driver abort a job if its result is too big. We can set it to some fraction of the driver's memory by default, or to something like 100 MB.) Limit size of results that a driver collects for each action Key: SPARK-3466 URL: https://issues.apache.org/jira/browse/SPARK-3466 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Matei Zaharia Right now, operations like {{collect()}} and {{take()}} can crash the driver with an OOM if they bring back too many data. We should add a {{spark.driver.maxResultSize}} setting (or something like that) that will make the driver abort a job if its result is too big. We can set it to some fraction of the driver's memory by default, or to something like 100 MB. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3679) pickle the exact globals of functions
Davies Liu created SPARK-3679: - Summary: pickle the exact globals of functions Key: SPARK-3679 URL: https://issues.apache.org/jira/browse/SPARK-3679 Project: Spark Issue Type: Bug Components: PySpark Reporter: Davies Liu Priority: Critical function.func_code.co_names has all the names used in the function, including name of attributes. It will pickle some unnecessary globals if there is a global having the same name with attribute (in co_names). There is a regression introduced by PR 2114 https://github.com/apache/spark/pull/2144/files -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3679) pickle the exact globals of functions
[ https://issues.apache.org/jira/browse/SPARK-3679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146691#comment-14146691 ] Apache Spark commented on SPARK-3679: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2522 pickle the exact globals of functions - Key: SPARK-3679 URL: https://issues.apache.org/jira/browse/SPARK-3679 Project: Spark Issue Type: Bug Components: PySpark Reporter: Davies Liu Assignee: Davies Liu Priority: Critical function.func_code.co_names has all the names used in the function, including name of attributes. It will pickle some unnecessary globals if there is a global having the same name with attribute (in co_names). There is a regression introduced by PR 2114 https://github.com/apache/spark/pull/2144/files -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3659) Set EC2 version to 1.1.0 in master branch
[ https://issues.apache.org/jira/browse/SPARK-3659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3659. Resolution: Fixed Fix Version/s: 1.2.0 https://github.com/apache/spark/pull/2510 Set EC2 version to 1.1.0 in master branch - Key: SPARK-3659 URL: https://issues.apache.org/jira/browse/SPARK-3659 Project: Spark Issue Type: Bug Components: EC2 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Minor Fix For: 1.2.0 Master branch should be in sync with branch-1.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146714#comment-14146714 ] Nan Zhu commented on SPARK-3628: https://github.com/apache/spark/pull/2524 Don't apply accumulator updates multiple times for tasks in result stages - Key: SPARK-3628 URL: https://issues.apache.org/jira/browse/SPARK-3628 Project: Spark Issue Type: Bug Reporter: Matei Zaharia Priority: Blocker In previous versions of Spark, accumulator updates only got applied once for accumulators that are only used in actions (i.e. result stages), letting you use them to deterministically compute a result. Unfortunately, this got broken in some recent refactorings. This is related to https://issues.apache.org/jira/browse/SPARK-732, but that issue is about applying the same semantics to intermediate stages too, which is more work and may not be what we want for debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146718#comment-14146718 ] Apache Spark commented on SPARK-3628: - User 'CodingCat' has created a pull request for this issue: https://github.com/apache/spark/pull/2524 Don't apply accumulator updates multiple times for tasks in result stages - Key: SPARK-3628 URL: https://issues.apache.org/jira/browse/SPARK-3628 Project: Spark Issue Type: Bug Reporter: Matei Zaharia Assignee: Nan Zhu Priority: Blocker In previous versions of Spark, accumulator updates only got applied once for accumulators that are only used in actions (i.e. result stages), letting you use them to deterministically compute a result. Unfortunately, this got broken in some recent refactorings. This is related to https://issues.apache.org/jira/browse/SPARK-732, but that issue is about applying the same semantics to intermediate stages too, which is more work and may not be what we want for debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3680) java.lang.Exception: makeCopy when using HiveGeneric UDFs on Converted Parquet Metastore tables
Michael Armbrust created SPARK-3680: --- Summary: java.lang.Exception: makeCopy when using HiveGeneric UDFs on Converted Parquet Metastore tables Key: SPARK-3680 URL: https://issues.apache.org/jira/browse/SPARK-3680 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3680) java.lang.Exception: makeCopy when using HiveGeneric UDFs on Converted Parquet Metastore tables
[ https://issues.apache.org/jira/browse/SPARK-3680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146738#comment-14146738 ] Apache Spark commented on SPARK-3680: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/2525 java.lang.Exception: makeCopy when using HiveGeneric UDFs on Converted Parquet Metastore tables --- Key: SPARK-3680 URL: https://issues.apache.org/jira/browse/SPARK-3680 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3634) Python modules added through addPyFile should take precedence over system modules
[ https://issues.apache.org/jira/browse/SPARK-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3634. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2492 [https://github.com/apache/spark/pull/2492] Python modules added through addPyFile should take precedence over system modules - Key: SPARK-3634 URL: https://issues.apache.org/jira/browse/SPARK-3634 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.0.2, 1.1.0 Reporter: Josh Rosen Fix For: 1.2.0 Python modules added through {{SparkContext.addPyFile()}} are currently _appended_ to {{sys.path}}; this is probably the opposite of the behavior that we want, since it causes system versions of modules to take precedence over versions explicitly added by users. To fix this, we should change the {{sys.path}} manipulation code in {{context.py}} and {{worker.py}} to prepend files to {{sys.path}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-889) Bring back DFS broadcast
[ https://issues.apache.org/jira/browse/SPARK-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146804#comment-14146804 ] Andrew Ash commented on SPARK-889: -- [~matei] should we close ticket this as Won't Fix then, since effort is better spent making TorrentBroadcast better? Bring back DFS broadcast Key: SPARK-889 URL: https://issues.apache.org/jira/browse/SPARK-889 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Priority: Minor DFS broadcast was a simple way to get better-than-single-master performance for broadcast, so we should add it back for people who have HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3679) pickle the exact globals of functions
[ https://issues.apache.org/jira/browse/SPARK-3679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3679. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2522 [https://github.com/apache/spark/pull/2522] pickle the exact globals of functions - Key: SPARK-3679 URL: https://issues.apache.org/jira/browse/SPARK-3679 Project: Spark Issue Type: Bug Components: PySpark Reporter: Davies Liu Assignee: Davies Liu Priority: Critical Fix For: 1.2.0 function.func_code.co_names has all the names used in the function, including name of attributes. It will pickle some unnecessary globals if there is a global having the same name with attribute (in co_names). There is a regression introduced by PR 2114 https://github.com/apache/spark/pull/2144/files -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3681) Failed to serialized ArrayType or MapType after accessing them in Python
Davies Liu created SPARK-3681: - Summary: Failed to serialized ArrayType or MapType after accessing them in Python Key: SPARK-3681 URL: https://issues.apache.org/jira/browse/SPARK-3681 Project: Spark Issue Type: Bug Reporter: Davies Liu {code} files_schema_rdd.map(lambda x: x.files).take(1) {code} Also it will lose the schema after iterate an ArrayType. {code} files_schema_rdd.map(lambda x: [f.batch for f in x.files]).take(1) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3681) Failed to serialized ArrayType or MapType after accessing them in Python
[ https://issues.apache.org/jira/browse/SPARK-3681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146903#comment-14146903 ] Apache Spark commented on SPARK-3681: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2526 Failed to serialized ArrayType or MapType after accessing them in Python - Key: SPARK-3681 URL: https://issues.apache.org/jira/browse/SPARK-3681 Project: Spark Issue Type: Bug Reporter: Davies Liu Assignee: Davies Liu {code} files_schema_rdd.map(lambda x: x.files).take(1) {code} Also it will lose the schema after iterate an ArrayType. {code} files_schema_rdd.map(lambda x: [f.batch for f in x.files]).take(1) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3682) Add helpful warnings to the UI
Sandy Ryza created SPARK-3682: - Summary: Add helpful warnings to the UI Key: SPARK-3682 URL: https://issues.apache.org/jira/browse/SPARK-3682 Project: Spark Issue Type: New Feature Components: Web UI Reporter: Sandy Ryza Spark has a zillion configuration options and a zillion different things that can go wrong with a job. Improvements like incremental and better metrics and the proposed spark replay debugger provide more insight into what's going on under the covers. However, it's difficult for non-advanced users to synthesize this information and understand where to direct their attention. It would be helpful to have some sort of central location on the UI users could go to that would provide indications about why an app/job is failing or performing poorly. Some helpful messages that we could provide: * Warn that the tasks in a particular stage are spending a long time in GC. * Warn that spark.shuffle.memoryFraction does not fit inside the young generation. * Warn that tasks in a particular stage are very short, and that the number of partitions should probably be decreased. * Warn that tasks in a particular stage are spilling a lot, and that the number of partitions should probably be decreased. * Warn that a cached RDD that gets a lot of use does not fit in memory, and a lot of time is being spent recomputing it. To start, probably two kinds of warnings would be most helpful. * Warnings at the app level that report on misconfigurations, issues with the general health of executors. * Warnings at the job level that indicate why a job might be performing slowly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2131) Collect per-task filesystem-bytes-read/written metrics
[ https://issues.apache.org/jira/browse/SPARK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved SPARK-2131. --- Resolution: Duplicate Collect per-task filesystem-bytes-read/written metrics -- Key: SPARK-2131 URL: https://issues.apache.org/jira/browse/SPARK-2131 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3682) Add helpful warnings to the UI
[ https://issues.apache.org/jira/browse/SPARK-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3682: - Target Version/s: 1.2.0 Affects Version/s: 1.1.0 Add helpful warnings to the UI -- Key: SPARK-3682 URL: https://issues.apache.org/jira/browse/SPARK-3682 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 1.1.0 Reporter: Sandy Ryza Spark has a zillion configuration options and a zillion different things that can go wrong with a job. Improvements like incremental and better metrics and the proposed spark replay debugger provide more insight into what's going on under the covers. However, it's difficult for non-advanced users to synthesize this information and understand where to direct their attention. It would be helpful to have some sort of central location on the UI users could go to that would provide indications about why an app/job is failing or performing poorly. Some helpful messages that we could provide: * Warn that the tasks in a particular stage are spending a long time in GC. * Warn that spark.shuffle.memoryFraction does not fit inside the young generation. * Warn that tasks in a particular stage are very short, and that the number of partitions should probably be decreased. * Warn that tasks in a particular stage are spilling a lot, and that the number of partitions should probably be decreased. * Warn that a cached RDD that gets a lot of use does not fit in memory, and a lot of time is being spent recomputing it. To start, probably two kinds of warnings would be most helpful. * Warnings at the app level that report on misconfigurations, issues with the general health of executors. * Warnings at the job level that indicate why a job might be performing slowly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-889) Bring back DFS broadcast
[ https://issues.apache.org/jira/browse/SPARK-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147032#comment-14147032 ] Josh Rosen commented on SPARK-889: -- In fact, I think [~rxin] has some JIRAs and PRs to make TorrentBroadcast _even_ better than it is now (it was greatly improved from 1.0.2 to 1.1.0), so it's probably safe to close this. Bring back DFS broadcast Key: SPARK-889 URL: https://issues.apache.org/jira/browse/SPARK-889 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Priority: Minor DFS broadcast was a simple way to get better-than-single-master performance for broadcast, so we should add it back for people who have HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3639) Kinesis examples set master as local
[ https://issues.apache.org/jira/browse/SPARK-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147042#comment-14147042 ] Josh Rosen commented on SPARK-3639: --- This sounds reasonable to me; feel free to open a PR. If you look at most of the other Spark examples, they only set the appName when creating the SparkContext and leave the master unspecified in order to allow it to be set when passing the script to {{spark-submit}}. Kinesis examples set master as local Key: SPARK-3639 URL: https://issues.apache.org/jira/browse/SPARK-3639 Project: Spark Issue Type: Bug Components: Examples, Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Aniket Bhatnagar Priority: Minor Labels: examples Kinesis examples set master as local thus not allowing the example to be tested on a cluster -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-889) Bring back DFS broadcast
[ https://issues.apache.org/jira/browse/SPARK-889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-889. --- Resolution: Won't Fix Bring back DFS broadcast Key: SPARK-889 URL: https://issues.apache.org/jira/browse/SPARK-889 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Priority: Minor DFS broadcast was a simple way to get better-than-single-master performance for broadcast, so we should add it back for people who have HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2691) Allow Spark on Mesos to be launched with Docker
[ https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2691: --- Assignee: Timothy Chen (was: Tim Chen) Allow Spark on Mesos to be launched with Docker --- Key: SPARK-2691 URL: https://issues.apache.org/jira/browse/SPARK-2691 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Assignee: Timothy Chen Labels: mesos Currently to launch Spark with Mesos one must upload a tarball and specifiy the executor URI to be passed in that is to be downloaded on each slave or even each execution depending coarse mode or not. We want to make Spark able to support launching Executors via a Docker image that utilizes the recent Docker and Mesos integration work. With the recent integration Spark can simply specify a Docker image and options that is needed and it should continue to work as-is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3678) Yarn app name reported in RM is different between cluster and client mode
[ https://issues.apache.org/jira/browse/SPARK-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3678: - Affects Version/s: (was: 1.2.0) 1.1.0 Yarn app name reported in RM is different between cluster and client mode - Key: SPARK-3678 URL: https://issues.apache.org/jira/browse/SPARK-3678 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Thomas Graves If you launch an application in yarn cluster mode the name of the application in the ResourceManager generally shows up as the full name org.apache.spark.examples.SparkHdfsLR. If you start the same app in client mode it shows up as SparkHdfsLR. We should be consistent between them. I haven't looked at it in detail, perhaps its only the examples but I think I've seen this with customer apps also. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2691) Allow Spark on Mesos to be launched with Docker
[ https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2691: --- Assignee: Tim Chen (was: Timothy Hunter) Allow Spark on Mesos to be launched with Docker --- Key: SPARK-2691 URL: https://issues.apache.org/jira/browse/SPARK-2691 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Assignee: Tim Chen Labels: mesos Currently to launch Spark with Mesos one must upload a tarball and specifiy the executor URI to be passed in that is to be downloaded on each slave or even each execution depending coarse mode or not. We want to make Spark able to support launching Executors via a Docker image that utilizes the recent Docker and Mesos integration work. With the recent integration Spark can simply specify a Docker image and options that is needed and it should continue to work as-is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3684) Can't configure local dirs in Yarn mode
Andrew Or created SPARK-3684: Summary: Can't configure local dirs in Yarn mode Key: SPARK-3684 URL: https://issues.apache.org/jira/browse/SPARK-3684 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Andrew Or We can't set SPARK_LOCAL_DIRS or spark.local.dirs because they're not picked up in Yarn mode. However, we can't set YARN_LOCAL_DIRS or LOCAL_DIRS either because these are overridden by Yarn. I'm trying to set these through SPARK_YARN_USER_ENV. I'm aware that the default behavior is for Spark to use Yarn's local dirs, but right now there's no way to change it even if the user wants to. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3604) unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD
[ https://issues.apache.org/jira/browse/SPARK-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3604. Resolution: Not a Problem unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD -- Key: SPARK-3604 URL: https://issues.apache.org/jira/browse/SPARK-3604 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: linux. Used python, but error is in Scala land. Reporter: Eric Friedman Priority: Critical I have a large number of parquet files all with the same schema and attempted to make a UnionRDD out of them. When I call getNumPartitions(), I get a stack overflow error that looks like this: Py4JJavaError: An error occurred while calling o3275.partitions. : java.lang.StackOverflowError at scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239) at scala.collection.TraversableLike$class.map(TraversableLike.scala:243) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3681) Failed to serialized ArrayType or MapType after accessing them in Python
[ https://issues.apache.org/jira/browse/SPARK-3681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3681: --- Component/s: PySpark Failed to serialized ArrayType or MapType after accessing them in Python - Key: SPARK-3681 URL: https://issues.apache.org/jira/browse/SPARK-3681 Project: Spark Issue Type: Bug Components: PySpark Reporter: Davies Liu Assignee: Davies Liu {code} files_schema_rdd.map(lambda x: x.files).take(1) {code} Also it will lose the schema after iterate an ArrayType. {code} files_schema_rdd.map(lambda x: [f.batch for f in x.files]).take(1) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3663) Document SPARK_LOG_DIR and SPARK_PID_DIR
[ https://issues.apache.org/jira/browse/SPARK-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3663: --- Component/s: Documentation Document SPARK_LOG_DIR and SPARK_PID_DIR Key: SPARK-3663 URL: https://issues.apache.org/jira/browse/SPARK-3663 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Andrew Ash Assignee: Andrew Ash I'm using these two parameters in some puppet scripts for standalone deployment and realized that they're not documented anywhere. We should document them -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3610) History server log name should not be based on user input
[ https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3610: --- Component/s: Spark Core History server log name should not be based on user input - Key: SPARK-3610 URL: https://issues.apache.org/jira/browse/SPARK-3610 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: SK Priority: Critical Right now we use the user-defined application name when creating the logging file for the history server. We should use some type of GUID generated from inside of Spark instead of allowing user input here. It can cause errors if users provide characters that are not valid in filesystem paths. Original bug report: {quote} The default log files for the Mllib examples use a rather long naming convention that includes special characters like parentheses and comma.For e.g. one of my log files is named binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032. When I click on the program on the history server page (at port 18080), to view the detailed application logs, the history server crashes and I need to restart it. I am using Spark 1.1 on a mesos cluster. I renamed the log file by removing the special characters and then it loads up correctly. I am not sure which program is creating the log files. Can it be changed so that the default log file naming convention does not include special characters? {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3615) Kafka test should not hard code Zookeeper port
[ https://issues.apache.org/jira/browse/SPARK-3615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3615. Resolution: Fixed https://github.com/apache/spark/pull/2483 Kafka test should not hard code Zookeeper port -- Key: SPARK-3615 URL: https://issues.apache.org/jira/browse/SPARK-3615 Project: Spark Issue Type: Bug Components: Streaming Reporter: Patrick Wendell Assignee: Saisai Shao Priority: Blocker This is causing failures in our master build if port 2181 is contented. Instead of binding to a static port we should re-factor this such that it opens a socket on port 0 and then reads back the port. So we can never have contention. {code} sbt.ForkMain$ForkError: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:444) at sun.nio.ch.Net.bind(Net.java:436) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:67) at org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:95) at org.apache.spark.streaming.kafka.KafkaTestUtils$EmbeddedZookeeper.init(KafkaStreamSuite.scala:200) at org.apache.spark.streaming.kafka.KafkaStreamSuite.beforeFunction(KafkaStreamSuite.scala:62) at org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.setUp(JavaKafkaStreamSuite.java:51) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:27) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.runners.ParentRunner.run(ParentRunner.java:300) at org.junit.runners.Suite.runChild(Suite.java:128) at org.junit.runners.Suite.runChild(Suite.java:24) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.runners.ParentRunner.run(ParentRunner.java:300) at org.junit.runner.JUnitCore.run(JUnitCore.java:157) at org.junit.runner.JUnitCore.run(JUnitCore.java:136) at com.novocode.junit.JUnitRunner.run(JUnitRunner.java:90) at sbt.RunnerWrapper$1.runRunner2(FrameworkWrapper.java:223) at sbt.RunnerWrapper$1.execute(FrameworkWrapper.java:236) at sbt.ForkMain$Run$2.call(ForkMain.java:294) at sbt.ForkMain$Run$2.call(ForkMain.java:284) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3685) Spark's local dir scheme is not configurable
Andrew Or created SPARK-3685: Summary: Spark's local dir scheme is not configurable Key: SPARK-3685 URL: https://issues.apache.org/jira/browse/SPARK-3685 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Andrew Or When you try to set local dirs to hdfs:/tmp/foo it doesn't work. What it will try to do is create a folder called hdfs: and put tmp inside it. This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead of Hadoop's file system to parse this path. We also need to resolve the path appropriately. This may not have an urgent use case, but it fails silently and does what is least expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3476) Yarn ClientBase.validateArgs memory checks wrong
[ https://issues.apache.org/jira/browse/SPARK-3476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147180#comment-14147180 ] Apache Spark commented on SPARK-3476: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/2528 Yarn ClientBase.validateArgs memory checks wrong Key: SPARK-3476 URL: https://issues.apache.org/jira/browse/SPARK-3476 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves The yarn ClientBase.validateArgs memory checks are no longer valid. It used to be that the overhead was taken out of what the user specified, now we add it on top of what the user specifies. We can probably just remove these. (args.amMemory = memoryOverhead) - (Error: AM memory size must be + greater than: + memoryOverhead), (args.executorMemory = memoryOverhead) - (Error: Executor memory size + must be greater than: + memoryOverhead.toString) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3686) flume.SparkSinkSuite.Success is flaky
Patrick Wendell created SPARK-3686: -- Summary: flume.SparkSinkSuite.Success is flaky Key: SPARK-3686 URL: https://issues.apache.org/jira/browse/SPARK-3686 Project: Spark Issue Type: Bug Components: Streaming Reporter: Patrick Wendell Assignee: Hari Shreedharan Priority: Blocker {code} Error Message 4000 did not equal 5000 Stacktrace sbt.ForkMain$ForkError: 4000 did not equal 5000 at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:498) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1559) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:416) at org.apache.spark.streaming.flume.sink.SparkSinkSuite.org$apache$spark$streaming$flume$sink$SparkSinkSuite$$assertChannelIsEmpty(SparkSinkSuite.scala:195) at org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply$mcV$sp(SparkSinkSuite.scala:54) at org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40) at org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158) at org.scalatest.Suite$class.withFixture(Suite.scala:1121) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167) at org.scalatest.FunSuite.runTest(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200) at org.scalatest.FunSuite.runTests(FunSuite.scala:1559) at org.scalatest.Suite$class.run(Suite.scala:1423) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204) at org.scalatest.FunSuite.run(FunSuite.scala:1559) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:444) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:651) at sbt.ForkMain$Run$2.call(ForkMain.java:294) at sbt.ForkMain$Run$2.call(ForkMain.java:284) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Example test result (this will stop working in a few days): https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/719/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.streaming.flume.sink/SparkSinkSuite/Success_with_ack/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3685) Spark's local dir scheme is not configurable
[ https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147206#comment-14147206 ] Andrew Or commented on SPARK-3685: -- Note that this is not meaningful unless we also change the usages of this to use the Hadoop FileSystem. This requires a non-trivial refactor of shuffle and spill code to use the Hadoop API. Spark's local dir scheme is not configurable Key: SPARK-3685 URL: https://issues.apache.org/jira/browse/SPARK-3685 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Andrew Or When you try to set local dirs to hdfs:/tmp/foo it doesn't work. What it will try to do is create a folder called hdfs: and put tmp inside it. This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead of Hadoop's file system to parse this path. We also need to resolve the path appropriately. This may not have an urgent use case, but it fails silently and does what is least expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3412) Add Missing Types for Row API
[ https://issues.apache.org/jira/browse/SPARK-3412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147261#comment-14147261 ] Apache Spark commented on SPARK-3412: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/2529 Add Missing Types for Row API - Key: SPARK-3412 URL: https://issues.apache.org/jira/browse/SPARK-3412 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3610) History server log name should not be based on user input
[ https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147274#comment-14147274 ] Kousuke Saruta commented on SPARK-3610: --- Hi [~skrishna...@gmail.com], I'm trying to resolve similar issue and I think I can resolve this issue using Application ID. See https://github.com/apache/spark/pull/2432 History server log name should not be based on user input - Key: SPARK-3610 URL: https://issues.apache.org/jira/browse/SPARK-3610 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: SK Priority: Critical Right now we use the user-defined application name when creating the logging file for the history server. We should use some type of GUID generated from inside of Spark instead of allowing user input here. It can cause errors if users provide characters that are not valid in filesystem paths. Original bug report: {quote} The default log files for the Mllib examples use a rather long naming convention that includes special characters like parentheses and comma.For e.g. one of my log files is named binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032. When I click on the program on the history server page (at port 18080), to view the detailed application logs, the history server crashes and I need to restart it. I am using Spark 1.1 on a mesos cluster. I renamed the log file by removing the special characters and then it loads up correctly. I am not sure which program is creating the log files. Can it be changed so that the default log file naming convention does not include special characters? {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3610) History server log name should not be based on user input
[ https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147274#comment-14147274 ] Kousuke Saruta edited comment on SPARK-3610 at 9/25/14 2:35 AM: Hi [~SK], I'm trying to resolve similar issue and I think I can resolve this issue using Application ID. See https://github.com/apache/spark/pull/2432 was (Author: sarutak): Hi [~skrishna...@gmail.com], I'm trying to resolve similar issue and I think I can resolve this issue using Application ID. See https://github.com/apache/spark/pull/2432 History server log name should not be based on user input - Key: SPARK-3610 URL: https://issues.apache.org/jira/browse/SPARK-3610 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: SK Priority: Critical Right now we use the user-defined application name when creating the logging file for the history server. We should use some type of GUID generated from inside of Spark instead of allowing user input here. It can cause errors if users provide characters that are not valid in filesystem paths. Original bug report: {quote} The default log files for the Mllib examples use a rather long naming convention that includes special characters like parentheses and comma.For e.g. one of my log files is named binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032. When I click on the program on the history server page (at port 18080), to view the detailed application logs, the history server crashes and I need to restart it. I am using Spark 1.1 on a mesos cluster. I renamed the log file by removing the special characters and then it loads up correctly. I am not sure which program is creating the log files. Can it be changed so that the default log file naming convention does not include special characters? {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3665) Java API for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-3665: -- Description: The Java API will wrap the Scala API in a similar manner as JavaRDD. Components will include: # JavaGraph #- removes optional param from persist, subgraph, mapReduceTriplets, Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply #- removes implicit {{=:=}} param from mapVertices, outerJoinVertices #- merges multiple parameters lists #- incorporates GraphOps # JavaVertexRDD # JavaEdgeRDD # JavaGraphLoader #- removes optional params, or uses builder pattern was: The Java API will wrap the Scala API in a similar manner as JavaRDD. Components will include: 1. JavaGraph -- removes optional param from persist, subgraph, mapReduceTriplets, Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply -- removes implicit {{=:=}} param from mapVertices, outerJoinVertices -- merges multiple parameters lists -- incorporates GraphOps 2. JavaVertexRDD 3. JavaEdgeRDD 4. JavaGraphLoader -- removes optional params, or uses builder pattern Java API for GraphX --- Key: SPARK-3665 URL: https://issues.apache.org/jira/browse/SPARK-3665 Project: Spark Issue Type: Improvement Components: GraphX, Java API Reporter: Ankur Dave Assignee: Ankur Dave The Java API will wrap the Scala API in a similar manner as JavaRDD. Components will include: # JavaGraph #- removes optional param from persist, subgraph, mapReduceTriplets, Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply #- removes implicit {{=:=}} param from mapVertices, outerJoinVertices #- merges multiple parameters lists #- incorporates GraphOps # JavaVertexRDD # JavaEdgeRDD # JavaGraphLoader #- removes optional params, or uses builder pattern -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3666) Extract interfaces for EdgeRDD and VertexRDD
[ https://issues.apache.org/jira/browse/SPARK-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147280#comment-14147280 ] Apache Spark commented on SPARK-3666: - User 'ankurdave' has created a pull request for this issue: https://github.com/apache/spark/pull/2530 Extract interfaces for EdgeRDD and VertexRDD Key: SPARK-3666 URL: https://issues.apache.org/jira/browse/SPARK-3666 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3686) flume.SparkSinkSuite.Success is flaky
[ https://issues.apache.org/jira/browse/SPARK-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147314#comment-14147314 ] Hari Shreedharan commented on SPARK-3686: - Looking into this. flume.SparkSinkSuite.Success is flaky - Key: SPARK-3686 URL: https://issues.apache.org/jira/browse/SPARK-3686 Project: Spark Issue Type: Bug Components: Streaming Reporter: Patrick Wendell Assignee: Hari Shreedharan Priority: Blocker {code} Error Message 4000 did not equal 5000 Stacktrace sbt.ForkMain$ForkError: 4000 did not equal 5000 at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:498) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1559) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:416) at org.apache.spark.streaming.flume.sink.SparkSinkSuite.org$apache$spark$streaming$flume$sink$SparkSinkSuite$$assertChannelIsEmpty(SparkSinkSuite.scala:195) at org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply$mcV$sp(SparkSinkSuite.scala:54) at org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40) at org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158) at org.scalatest.Suite$class.withFixture(Suite.scala:1121) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167) at org.scalatest.FunSuite.runTest(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200) at org.scalatest.FunSuite.runTests(FunSuite.scala:1559) at org.scalatest.Suite$class.run(Suite.scala:1423) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204) at org.scalatest.FunSuite.run(FunSuite.scala:1559) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:444) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:651) at sbt.ForkMain$Run$2.call(ForkMain.java:294) at sbt.ForkMain$Run$2.call(ForkMain.java:284) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Example test result (this will stop working in a few days): https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/719/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.streaming.flume.sink/SparkSinkSuite/Success_with_ack/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-3686) flume.SparkSinkSuite.Success is flaky
[ https://issues.apache.org/jira/browse/SPARK-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147316#comment-14147316 ] Hari Shreedharan commented on SPARK-3686: - Unlike the other tests in this suite, this one does not have a sleep to let the sink commit the transactions back to the channel. So because this does not give enough time for the channel to actually becoming empty. Let me add a sleep - will send a PR and run the pre-commit hook a bunch of times to ensure that it fixes it. flume.SparkSinkSuite.Success is flaky - Key: SPARK-3686 URL: https://issues.apache.org/jira/browse/SPARK-3686 Project: Spark Issue Type: Bug Components: Streaming Reporter: Patrick Wendell Assignee: Hari Shreedharan Priority: Blocker {code} Error Message 4000 did not equal 5000 Stacktrace sbt.ForkMain$ForkError: 4000 did not equal 5000 at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:498) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1559) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:416) at org.apache.spark.streaming.flume.sink.SparkSinkSuite.org$apache$spark$streaming$flume$sink$SparkSinkSuite$$assertChannelIsEmpty(SparkSinkSuite.scala:195) at org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply$mcV$sp(SparkSinkSuite.scala:54) at org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40) at org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158) at org.scalatest.Suite$class.withFixture(Suite.scala:1121) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167) at org.scalatest.FunSuite.runTest(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200) at org.scalatest.FunSuite.runTests(FunSuite.scala:1559) at org.scalatest.Suite$class.run(Suite.scala:1423) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204) at org.scalatest.FunSuite.run(FunSuite.scala:1559) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:444) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:651) at sbt.ForkMain$Run$2.call(ForkMain.java:294) at sbt.ForkMain$Run$2.call(ForkMain.java:284) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Example test result (this will stop working in a few days):
[jira] [Resolved] (SPARK-546) Support full outer join and multiple join in a single shuffle
[ https://issues.apache.org/jira/browse/SPARK-546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-546. --- Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Aaron Staple Fixed by: https://github.com/apache/spark/pull/1395 Support full outer join and multiple join in a single shuffle - Key: SPARK-546 URL: https://issues.apache.org/jira/browse/SPARK-546 Project: Spark Issue Type: Improvement Components: Spark Core, Streaming Reporter: Reynold Xin Assignee: Aaron Staple Fix For: 1.2.0 RDD[(K,V)] now supports left/right outer join but not full outer join. Also it'd be nice to provide a way for users to join multiple RDDs on the same key in a single shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3686) flume.SparkSinkSuite.Success is flaky
[ https://issues.apache.org/jira/browse/SPARK-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147331#comment-14147331 ] Apache Spark commented on SPARK-3686: - User 'harishreedharan' has created a pull request for this issue: https://github.com/apache/spark/pull/2531 flume.SparkSinkSuite.Success is flaky - Key: SPARK-3686 URL: https://issues.apache.org/jira/browse/SPARK-3686 Project: Spark Issue Type: Bug Components: Streaming Reporter: Patrick Wendell Assignee: Hari Shreedharan Priority: Blocker {code} Error Message 4000 did not equal 5000 Stacktrace sbt.ForkMain$ForkError: 4000 did not equal 5000 at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:498) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1559) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:416) at org.apache.spark.streaming.flume.sink.SparkSinkSuite.org$apache$spark$streaming$flume$sink$SparkSinkSuite$$assertChannelIsEmpty(SparkSinkSuite.scala:195) at org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply$mcV$sp(SparkSinkSuite.scala:54) at org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40) at org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158) at org.scalatest.Suite$class.withFixture(Suite.scala:1121) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167) at org.scalatest.FunSuite.runTest(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200) at org.scalatest.FunSuite.runTests(FunSuite.scala:1559) at org.scalatest.Suite$class.run(Suite.scala:1423) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204) at org.scalatest.FunSuite.run(FunSuite.scala:1559) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:444) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:651) at sbt.ForkMain$Run$2.call(ForkMain.java:294) at sbt.ForkMain$Run$2.call(ForkMain.java:284) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Example test result (this will stop working in a few days): https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/719/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.streaming.flume.sink/SparkSinkSuite/Success_with_ack/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (SPARK-3687) Spark hang while
Ziv Huang created SPARK-3687: Summary: Spark hang while Key: SPARK-3687 URL: https://issues.apache.org/jira/browse/SPARK-3687 Project: Spark Issue Type: Bug Reporter: Ziv Huang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files
[ https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ziv Huang updated SPARK-3687: - Summary: Spark hang while processing more than 100 sequence files (was: Spark hang while ) Spark hang while processing more than 100 sequence files Key: SPARK-3687 URL: https://issues.apache.org/jira/browse/SPARK-3687 Project: Spark Issue Type: Bug Reporter: Ziv Huang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files
[ https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ziv Huang updated SPARK-3687: - Affects Version/s: 1.0.2 1.1.0 Spark hang while processing more than 100 sequence files Key: SPARK-3687 URL: https://issues.apache.org/jira/browse/SPARK-3687 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Ziv Huang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files
[ https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ziv Huang updated SPARK-3687: - Component/s: Spark Core Spark hang while processing more than 100 sequence files Key: SPARK-3687 URL: https://issues.apache.org/jira/browse/SPARK-3687 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Ziv Huang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files
[ https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ziv Huang updated SPARK-3687: - Description: I use spark Spark hang while processing more than 100 sequence files Key: SPARK-3687 URL: https://issues.apache.org/jira/browse/SPARK-3687 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Ziv Huang I use spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files
[ https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ziv Huang updated SPARK-3687: - Description: In my application, I read more than 100 sequence files to a JavaPairRDD, perform flatmap to get another JavaRDD, and then use takeOrdered (was: In my application, I read more than 100 sequence files, ) Spark hang while processing more than 100 sequence files Key: SPARK-3687 URL: https://issues.apache.org/jira/browse/SPARK-3687 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Ziv Huang In my application, I read more than 100 sequence files to a JavaPairRDD, perform flatmap to get another JavaRDD, and then use takeOrdered -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files
[ https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ziv Huang updated SPARK-3687: - Description: In my application, I read more than 100 sequence files to a JavaPairRDD, perform flatmap to get another JavaRDD, and then use takeOrdered to get the result. It is quite often (but not always) that the spark hangs while the executing some of 110th-130th tasks. The job can hang for several hours, maybe forever (I can't wait for its completion). When the spark job hangs, I can't find any error message in anywhere, and I can't kill the job from web UI. The current workaround is to use coalesce to reduce the number of partitions to be processed. I never get job hanged if the number of partitions to be processed is no greater than 80. was:In my application, I read more than 100 sequence files to a JavaPairRDD, perform flatmap to get another JavaRDD, and then use takeOrdered Spark hang while processing more than 100 sequence files Key: SPARK-3687 URL: https://issues.apache.org/jira/browse/SPARK-3687 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Ziv Huang In my application, I read more than 100 sequence files to a JavaPairRDD, perform flatmap to get another JavaRDD, and then use takeOrdered to get the result. It is quite often (but not always) that the spark hangs while the executing some of 110th-130th tasks. The job can hang for several hours, maybe forever (I can't wait for its completion). When the spark job hangs, I can't find any error message in anywhere, and I can't kill the job from web UI. The current workaround is to use coalesce to reduce the number of partitions to be processed. I never get job hanged if the number of partitions to be processed is no greater than 80. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files
[ https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ziv Huang updated SPARK-3687: - Description: In my application, I read more than 100 sequence files to a JavaPairRDD, perform flatmap to get another JavaRDD, and then use takeOrdered to get the result. It is quite often (but not always) that the spark hangs while the executing some of 110th-130th tasks. The job can hang for several hours, maybe forever (I can't wait for its completion). When the spark job hangs, I can't find any error message in anywhere, and I can't kill the job from web UI. The current workaround is to use coalesce to reduce the number of partitions to be processed. I never get a job hanged if the number of partitions to be processed is no greater than 80. was: In my application, I read more than 100 sequence files to a JavaPairRDD, perform flatmap to get another JavaRDD, and then use takeOrdered to get the result. It is quite often (but not always) that the spark hangs while the executing some of 110th-130th tasks. The job can hang for several hours, maybe forever (I can't wait for its completion). When the spark job hangs, I can't find any error message in anywhere, and I can't kill the job from web UI. The current workaround is to use coalesce to reduce the number of partitions to be processed. I never get job hanged if the number of partitions to be processed is no greater than 80. Spark hang while processing more than 100 sequence files Key: SPARK-3687 URL: https://issues.apache.org/jira/browse/SPARK-3687 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Ziv Huang In my application, I read more than 100 sequence files to a JavaPairRDD, perform flatmap to get another JavaRDD, and then use takeOrdered to get the result. It is quite often (but not always) that the spark hangs while the executing some of 110th-130th tasks. The job can hang for several hours, maybe forever (I can't wait for its completion). When the spark job hangs, I can't find any error message in anywhere, and I can't kill the job from web UI. The current workaround is to use coalesce to reduce the number of partitions to be processed. I never get a job hanged if the number of partitions to be processed is no greater than 80. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2647) DAGScheduler plugs others when processing one JobSubmitted event
[ https://issues.apache.org/jira/browse/SPARK-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147378#comment-14147378 ] Nan Zhu commented on SPARK-2647: isn't it the expected behaviour as we keep DAGScheduler as a single-thread mode? DAGScheduler plugs others when processing one JobSubmitted event Key: SPARK-2647 URL: https://issues.apache.org/jira/browse/SPARK-2647 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: YanTang Zhai If a few of jobs are submitted, DAGScheduler plugs others when processing one JobSubmitted event. For example ont JobSubmitted event is processed as follows and costs much time spark-akka.actor.default-dispatcher-67 daemon prio=10 tid=0x7f75ec001000 nid=0x7dd6 in Object.wait() [0x7f76063e1000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.hadoopcdh3.ipc.Client.call(Client.java:1130) - locked 0x000783b17330 (a org.apache.hadoopcdh3.ipc.Client$Call) at org.apache.hadoopcdh3.ipc.RPC$Invoker.invoke(RPC.java:241) at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source) at sun.reflect.GeneratedMethodAccessor86.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:83) at org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:60) at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source) at org.apache.hadoopcdh3.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1472) at org.apache.hadoopcdh3.hdfs.DFSClient.getBlockLocations(DFSClient.java:1498) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:208) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:204) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getFileBlockLocations(Cdh3DistributedFileSystem.java:204) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1812) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1797) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:233) at StorageEngineClient.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:141) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:172) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:54) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at