[jira] [Commented] (SPARK-2546) Configuration object thread safety issue
[ https://issues.apache.org/jira/browse/SPARK-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064625#comment-14064625 ] Andrew Ash commented on SPARK-2546: --- On the thread: Me: {quote} Reynold's recent announcement of the broadcast RDD object patch may also have implications of the right path forward here. I'm not sure I fully understand the implications though: https://github.com/apache/spark/pull/1452 Once this is committed, we can also remove the JobConf broadcast in HadoopRDD. {quote} [~pwendell]: {quote} I think you are correct and a follow up to SPARK-2521 will end up fixing this. The desing of SPARK-2521 automatically broadcasts RDD data in tasks and the approach creates a new copy of the RDD and associated data for each task. A natural follow-up to that patch is to stop handling the jobConf separately (since we will now broadcast all referents of the RDD itself) and just have it broadcasted with the RDD. I'm not sure if Reynold plans to include this in SPARK-2521 or afterwards, but it's likely we'd do that soon. {quote} Configuration object thread safety issue Key: SPARK-2546 URL: https://issues.apache.org/jira/browse/SPARK-2546 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.1 Reporter: Andrew Ash // observed in 0.9.1 but expected to exist in 1.0.1 as well This ticket is copy-pasted from a thread on the dev@ list: {quote} We discovered a very interesting bug in Spark at work last week in Spark 0.9.1 — that the way Spark uses the Hadoop Configuration object is prone to thread safety issues. I believe it still applies in Spark 1.0.1 as well. Let me explain: Observations - Was running a relatively simple job (read from Avro files, do a map, do another map, write back to Avro files) - 412 of 413 tasks completed, but the last task was hung in RUNNING state - The 412 successful tasks completed in median time 3.4s - The last hung task didn't finish even in 20 hours - The executor with the hung task was responsible for 100% of one core of CPU usage - Jstack of the executor attached (relevant thread pasted below) Diagnosis After doing some code spelunking, we determined the issue was concurrent use of a Configuration object for each task on an executor. In Hadoop each task runs in its own JVM, but in Spark multiple tasks can run in the same JVM, so the single-threaded access assumptions of the Configuration object no longer hold in Spark. The specific issue is that the AvroRecordReader actually _modifies_ the JobConf it's given when it's instantiated! It adds a key for the RPC protocol engine in the process of connecting to the Hadoop FileSystem. When many tasks start at the same time (like at the start of a job), many tasks are adding this configuration item to the one Configuration object at once. Internally Configuration uses a java.lang.HashMap, which isn't threadsafe… The below post is an excellent explanation of what happens in the situation where multiple threads insert into a HashMap at the same time. http://mailinator.blogspot.com/2009/06/beautiful-race-condition.html The gist is that you have a thread following a cycle of linked list nodes indefinitely. This exactly matches our observations of the 100% CPU core and also the final location in the stack trace. So it seems the way Spark shares a Configuration object between task threads in an executor is incorrect. We need some way to prevent concurrent access to a single Configuration object. Proposed fix We can clone the JobConf object in HadoopRDD.getJobConf() so each task gets its own JobConf object (and thus Configuration object). The optimization of broadcasting the Configuration object across the cluster can remain, but on the other side I think it needs to be cloned for each task to allow for concurrent access. I'm not sure the performance implications, but the comments suggest that the Configuration object is ~10KB so I would expect a clone on the object to be relatively speedy. Has this been observed before? Does my suggested fix make sense? I'd be happy to file a Jira ticket and continue discussion there for the right way to fix. Thanks! Andrew P.S. For others seeing this issue, our temporary workaround is to enable spark.speculation, which retries failed (or hung) tasks on other machines. {noformat} Executor task launch worker-6 daemon prio=10 tid=0x7f91f01fe000 nid=0x54b1 runnable [0x7f92d74f1000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.transfer(HashMap.java:601) at java.util.HashMap.resize(HashMap.java:581) at java.util.HashMap.addEntry(HashMap.java:879) at java.util.HashMap.put(HashMap.java:505) at
[jira] [Commented] (SPARK-2521) Broadcast RDD object once per TaskSet (instead of sending it for every task)
[ https://issues.apache.org/jira/browse/SPARK-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064626#comment-14064626 ] Andrew Ash commented on SPARK-2521: --- Reynold's PR: https://github.com/apache/spark/pull/1452 Broadcast RDD object once per TaskSet (instead of sending it for every task) Key: SPARK-2521 URL: https://issues.apache.org/jira/browse/SPARK-2521 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin This can substantially reduce task size, as well as being able to support very large closures (e.g. closures that reference large variables). Once this is in, we can also remove broadcasting the Hadoop JobConf. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (SPARK-2521) Broadcast RDD object once per TaskSet (instead of sending it for every task)
[ https://issues.apache.org/jira/browse/SPARK-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-2521: -- Comment: was deleted (was: Reynold's PR: https://github.com/apache/spark/pull/1452) Broadcast RDD object once per TaskSet (instead of sending it for every task) Key: SPARK-2521 URL: https://issues.apache.org/jira/browse/SPARK-2521 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin This can substantially reduce task size, as well as being able to support very large closures (e.g. closures that reference large variables). Once this is in, we can also remove broadcasting the Hadoop JobConf. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2492) KafkaReceiver minor changes to align with Kafka 0.8
[ https://issues.apache.org/jira/browse/SPARK-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064636#comment-14064636 ] Saisai Shao commented on SPARK-2492: Hi TD, I revisit the Kafka's ConsoleConsumer carefully, I start to doubt the purpose of this tricky modification. When auto.offset.reset = small, consumer offset will seek to the beginning of the partition *only* when *current offset is out of range*. Delete zookeeper metadata will *force* reading data for beginning of partition *immediately*. So actually only when we want to explicitly fetch data from beginning like ConsoleConsumer which specify parameter --from-beginning, we need not to delete zookeeper metadata explicitly. Also I revisit the previous PR about this part ([https://github.com/mesos/spark/pull/527]) carefully, seems I redo the same thing as this [commit|https://github.com/Reinvigorate/spark/commit/cfa8e769a86664722f47182fa572179e8beadcb7]. I'm not sure what's original purpose of deleting zookeeper metadata not matter auto.offset.reset is smallset or largest? After rethinking about this part, I think this tricky hack is needless, only when we need to read data from beginning immediately, we need to delete this metadata. So I'm not sure if you know the original purpose. Sorry for immature thought and PR, if you think it's no need to modify I will close this PR. KafkaReceiver minor changes to align with Kafka 0.8 Key: SPARK-2492 URL: https://issues.apache.org/jira/browse/SPARK-2492 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.0 Reporter: Saisai Shao Assignee: Saisai Shao Priority: Minor Fix For: 1.1.0 Update to delete Zookeeper metadata when Kafka's parameter auto.offset.reset is set to smallest, which is aligned with Kafka 0.8's ConsoleConsumer. Also use Kafka offered API without directly using zkClient. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2548) JavaRecoverableWordCount is missing
Xiangrui Meng created SPARK-2548: Summary: JavaRecoverableWordCount is missing Key: SPARK-2548 URL: https://issues.apache.org/jira/browse/SPARK-2548 Project: Spark Issue Type: Bug Components: Documentation, Streaming Affects Versions: 0.9.2, 1.0.1 Reporter: Xiangrui Meng JavaRecoverableWordCount was mentioned in the doc but not in the codebase. We need to rewrite the example because the code was lost during the migration from spark/spark-incubating to apache/spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2549) Functions defined inside of other functions trigger failures
Patrick Wendell created SPARK-2549: -- Summary: Functions defined inside of other functions trigger failures Key: SPARK-2549 URL: https://issues.apache.org/jira/browse/SPARK-2549 Project: Spark Issue Type: Sub-task Reporter: Patrick Wendell Assignee: Prashant Sharma If we have a function reference inside of another function, it still triggers mima failures. We should look at how that is implemented in byte code and just always exclude functions like that. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2549) Functions defined inside of other functions trigger failures
[ https://issues.apache.org/jira/browse/SPARK-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2549: --- Description: If we have a function declaration inside of another function, it still triggers mima failures. We should look at how that is implemented in byte code and just always exclude functions like that. {code} def a() = { /* Changing b() should not trigger failures, but it does. */ def b() = {} } {code} was:If we have a function reference inside of another function, it still triggers mima failures. We should look at how that is implemented in byte code and just always exclude functions like that. Functions defined inside of other functions trigger failures Key: SPARK-2549 URL: https://issues.apache.org/jira/browse/SPARK-2549 Project: Spark Issue Type: Sub-task Components: Build Reporter: Patrick Wendell Assignee: Prashant Sharma Fix For: 1.1.0 If we have a function declaration inside of another function, it still triggers mima failures. We should look at how that is implemented in byte code and just always exclude functions like that. {code} def a() = { /* Changing b() should not trigger failures, but it does. */ def b() = {} } {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2549) Functions defined inside of other functions trigger failures
[ https://issues.apache.org/jira/browse/SPARK-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2549: --- Description: If we have a function declaration inside of another function, it still triggers mima failures. We should look at how that is implemented in byte code and just always exclude functions like that. {code} def a() = { /* Changing b() should not trigger failures, but it does. */ def b() = {} } {code} I dug into the byte code for inner functions a bit more. I noticed that they tend to use `$$` before the function name. There is more information on that string sequence here: https://github.com/scala/scala/blob/2.10.x/src/reflect/scala/reflect/internal/StdNames.scala#L286 I did a cursory look and it appears that symbol is mostly (exclusively?) used for anonymous or inner functions: {code} # in RDD package classes $ ls *.class | xargs -I {} javap {} |grep \\$\\$ public final java.lang.Object org$apache$spark$rdd$PairRDDFunctions$$createZero$1(scala.reflect.ClassTag, byte[], scala.runtime.ObjectRef, scala.runtime.VolatileByteRef); public final java.lang.Object org$apache$spark$rdd$PairRDDFunctions$$createZero$2(byte[], scala.runtime.ObjectRef, scala.runtime.VolatileByteRef); public final scala.collection.Iterator org$apache$spark$rdd$PairRDDFunctions$$reducePartition$1(scala.collection.Iterator, scala.Function2); public final java.util.HashMap org$apache$spark$rdd$PairRDDFunctions$$mergeMaps$1(java.util.HashMap, java.util.HashMap, scala.Function2); ... public final class org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$1 extends scala.runtime.AbstractFunction0$mcJ$sp implements scala.Serializable { public org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$1(org.apache.spark.rdd.AsyncRDDActionsT); public final class org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$2 extends scala.runtime.AbstractFunction2$mcVIJ$sp implements scala.Serializable { public org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$2(org.apache.spark.rdd.AsyncRDDActionsT); {code} was: If we have a function declaration inside of another function, it still triggers mima failures. We should look at how that is implemented in byte code and just always exclude functions like that. {code} def a() = { /* Changing b() should not trigger failures, but it does. */ def b() = {} } {code} Functions defined inside of other functions trigger failures Key: SPARK-2549 URL: https://issues.apache.org/jira/browse/SPARK-2549 Project: Spark Issue Type: Sub-task Components: Build Reporter: Patrick Wendell Assignee: Prashant Sharma Fix For: 1.1.0 If we have a function declaration inside of another function, it still triggers mima failures. We should look at how that is implemented in byte code and just always exclude functions like that. {code} def a() = { /* Changing b() should not trigger failures, but it does. */ def b() = {} } {code} I dug into the byte code for inner functions a bit more. I noticed that they tend to use `$$` before the function name. There is more information on that string sequence here: https://github.com/scala/scala/blob/2.10.x/src/reflect/scala/reflect/internal/StdNames.scala#L286 I did a cursory look and it appears that symbol is mostly (exclusively?) used for anonymous or inner functions: {code} # in RDD package classes $ ls *.class | xargs -I {} javap {} |grep \\$\\$ public final java.lang.Object org$apache$spark$rdd$PairRDDFunctions$$createZero$1(scala.reflect.ClassTag, byte[], scala.runtime.ObjectRef, scala.runtime.VolatileByteRef); public final java.lang.Object org$apache$spark$rdd$PairRDDFunctions$$createZero$2(byte[], scala.runtime.ObjectRef, scala.runtime.VolatileByteRef); public final scala.collection.Iterator org$apache$spark$rdd$PairRDDFunctions$$reducePartition$1(scala.collection.Iterator, scala.Function2); public final java.util.HashMap org$apache$spark$rdd$PairRDDFunctions$$mergeMaps$1(java.util.HashMap, java.util.HashMap, scala.Function2); ... public final class org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$1 extends scala.runtime.AbstractFunction0$mcJ$sp implements scala.Serializable { public org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$1(org.apache.spark.rdd.AsyncRDDActionsT); public final class org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$2 extends scala.runtime.AbstractFunction2$mcVIJ$sp implements scala.Serializable { public org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$2(org.apache.spark.rdd.AsyncRDDActionsT); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2412) CoalescedRDD throws exception with certain pref locs
[ https://issues.apache.org/jira/browse/SPARK-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2412. Resolution: Fixed Fix Version/s: 1.1.0 1.0.2 Issue resolved by pull request 1337 [https://github.com/apache/spark/pull/1337] CoalescedRDD throws exception with certain pref locs Key: SPARK-2412 URL: https://issues.apache.org/jira/browse/SPARK-2412 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 1.0.2, 1.1.0 If the first pass of CoalescedRDD does not find the target number of locations AND the second pass finds new locations, an exception is thrown, as groupHash.get(nxt_replica).get is not valid. The fix is just to add an ArrayBuffer to groupHash for that replica if it didn't already exist. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2526) Simplify make-distribution.sh to just pass through Maven options
[ https://issues.apache.org/jira/browse/SPARK-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2526. Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1445 [https://github.com/apache/spark/pull/1445] Simplify make-distribution.sh to just pass through Maven options Key: SPARK-2526 URL: https://issues.apache.org/jira/browse/SPARK-2526 Project: Spark Issue Type: Bug Components: Build Reporter: Patrick Wendell Assignee: Patrick Wendell Fix For: 1.1.0 There is a some complexity make-distribution.sh around selecting profiles. This is both annoying to maintain and also limits the number of ways that packagers can use this. For instance, it's not possible to build with separate HDFS and YARN versions, and supporting this with our current flags would get pretty complicated. We should just allow the user to pass a list of profiles directly to make-distribution.sh - the Maven build itself is already parameterized to support this. We also now have good docs explaining the use of profiles in the Maven build. All of this logic was more necessary when we used SBT for the package build, but we haven't done that for several versions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2550) Support regularization and intercept in pyspark's linear methods
Xiangrui Meng created SPARK-2550: Summary: Support regularization and intercept in pyspark's linear methods Key: SPARK-2550 URL: https://issues.apache.org/jira/browse/SPARK-2550 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.0.0 Reporter: Xiangrui Meng Python API doesn't provide options to set regularization parameter and intercept in linear methods, which should be fixed in v1.1. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2551) Cleanup FilteringParquetRowInputFormat
[ https://issues.apache.org/jira/browse/SPARK-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-2551: -- Description: To workaround [PARQUET-16|https://issues.apache.org/jira/browse/PARQUET-16] and fix [SPARK-2119|https://issues.apache.org/jira/browse/SPARK-2119], we did some reflection hacks in {{FilteringParquetRowInputFormat}}. This should be cleaned up once PARQUET-16 is fixed. A PR for PARQUET-16 is [here|https://github.com/apache/incubator-parquet-mr/pull/17]. was: To workaround [PARQUET-16|https://issues.apache.org/jira/browse/PARQUET-16], we did some reflection hacks in {{FilteringParquetRowInputFormat}}. This should be cleaned up once PARQUET-16 is fixed. A PR for PARQUET-16 is [here|https://github.com/apache/incubator-parquet-mr/pull/17]. Cleanup FilteringParquetRowInputFormat -- Key: SPARK-2551 URL: https://issues.apache.org/jira/browse/SPARK-2551 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian Priority: Minor To workaround [PARQUET-16|https://issues.apache.org/jira/browse/PARQUET-16] and fix [SPARK-2119|https://issues.apache.org/jira/browse/SPARK-2119], we did some reflection hacks in {{FilteringParquetRowInputFormat}}. This should be cleaned up once PARQUET-16 is fixed. A PR for PARQUET-16 is [here|https://github.com/apache/incubator-parquet-mr/pull/17]. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2551) Cleanup FilteringParquetRowInputFormat
Cheng Lian created SPARK-2551: - Summary: Cleanup FilteringParquetRowInputFormat Key: SPARK-2551 URL: https://issues.apache.org/jira/browse/SPARK-2551 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian Priority: Minor To workaround [PARQUET-16|https://issues.apache.org/jira/browse/PARQUET-16], we did some reflection hacks in {{FilteringParquetRowInputFormat}}. This should be cleaned up once PARQUET-16 is fixed. A PR for PARQUET-16 is [here|https://github.com/apache/incubator-parquet-mr/pull/17]. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2552) Stabilize the computation of logistic function in pyspark
Xiangrui Meng created SPARK-2552: Summary: Stabilize the computation of logistic function in pyspark Key: SPARK-2552 URL: https://issues.apache.org/jira/browse/SPARK-2552 Project: Spark Issue Type: Bug Components: MLlib, PySpark Reporter: Xiangrui Meng exp(1000) throws an error in python. For logistic function, we can use either 1 / ( 1 + exp(-x) ) or 1 - 1 / (1 + exp(x) ) to compute its value which ensuring exp always takes a negative value. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2552) Stabilize the computation of logistic function in pyspark
[ https://issues.apache.org/jira/browse/SPARK-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2552: - Description: exp(1000) throws an error in python. For logistic function, we can use either 1 / ( 1 + exp( -x ) ) or 1 - 1 / (1 + exp( x ) ) to compute its value which ensuring exp always takes a negative value. (was: exp(1000) throws an error in python. For logistic function, we can use either 1 / ( 1 + exp(-x) ) or 1 - 1 / (1 + exp(x) ) to compute its value which ensuring exp always takes a negative value.) Stabilize the computation of logistic function in pyspark - Key: SPARK-2552 URL: https://issues.apache.org/jira/browse/SPARK-2552 Project: Spark Issue Type: Bug Components: MLlib, PySpark Reporter: Xiangrui Meng Labels: Starter exp(1000) throws an error in python. For logistic function, we can use either 1 / ( 1 + exp( -x ) ) or 1 - 1 / (1 + exp( x ) ) to compute its value which ensuring exp always takes a negative value. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2423) Clean up SparkSubmit for readability
[ https://issues.apache.org/jira/browse/SPARK-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2423: --- Assignee: Andrew Or Clean up SparkSubmit for readability Key: SPARK-2423 URL: https://issues.apache.org/jira/browse/SPARK-2423 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Andrew Or Fix For: 1.1.0 It is currently not trivial to trace through how different combinations of cluster managers (e.g. yarn) and deploy modes (e.g. cluster) are processed in SparkSubmit. We should clean up the logic a little if we want to extend it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2552) Stabilize the computation of logistic function in pyspark
[ https://issues.apache.org/jira/browse/SPARK-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2552: - Labels: Starter (was: ) Stabilize the computation of logistic function in pyspark - Key: SPARK-2552 URL: https://issues.apache.org/jira/browse/SPARK-2552 Project: Spark Issue Type: Bug Components: MLlib, PySpark Reporter: Xiangrui Meng Labels: Starter exp(1000) throws an error in python. For logistic function, we can use either 1 / ( 1 + exp(-x) ) or 1 - 1 / (1 + exp(x) ) to compute its value which ensuring exp always takes a negative value. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2119) Reading Parquet InputSplits dominates query execution time when reading off S3
[ https://issues.apache.org/jira/browse/SPARK-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064701#comment-14064701 ] Cheng Lian commented on SPARK-2119: --- Agree. Created SPARK-2551 for removing those hacks. Reading Parquet InputSplits dominates query execution time when reading off S3 -- Key: SPARK-2119 URL: https://issues.apache.org/jira/browse/SPARK-2119 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Critical Fix For: 1.1.0 Here's the relevant stack trace where things are hanging: {code} at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:326) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:370) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:90) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) {code} We should parallelize or cache or something here. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2476) Have sbt-assembly include runtime dependencies in jar
[ https://issues.apache.org/jira/browse/SPARK-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2476: --- Priority: Minor (was: Major) Have sbt-assembly include runtime dependencies in jar - Key: SPARK-2476 URL: https://issues.apache.org/jira/browse/SPARK-2476 Project: Spark Issue Type: Sub-task Components: Build Reporter: Patrick Wendell Assignee: Prashant Sharma Priority: Minor If possible, we should try to contribute the ability to include runtime-scoped dependencies in the assembly jar created with sbt-assembly. Currently in only reads compile-scoped dependencies: https://github.com/sbt/sbt-assembly/blob/master/src/main/scala/sbtassembly/Plugin.scala#L495 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2553) CoGroupedRDD unnecessarily allocates a Tuple2 per dep per key
Sandy Ryza created SPARK-2553: - Summary: CoGroupedRDD unnecessarily allocates a Tuple2 per dep per key Key: SPARK-2553 URL: https://issues.apache.org/jira/browse/SPARK-2553 Project: Spark Issue Type: Sub-task Affects Versions: 1.0.0 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2553) CoGroupedRDD unnecessarily allocates a Tuple2 per dep per key
[ https://issues.apache.org/jira/browse/SPARK-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064709#comment-14064709 ] Sandy Ryza commented on SPARK-2553: --- https://github.com/apache/spark/pull/1461 CoGroupedRDD unnecessarily allocates a Tuple2 per dep per key - Key: SPARK-2553 URL: https://issues.apache.org/jira/browse/SPARK-2553 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2554) CountDistinct and SumDistinct should do partial aggregation
Cheng Lian created SPARK-2554: - Summary: CountDistinct and SumDistinct should do partial aggregation Key: SPARK-2554 URL: https://issues.apache.org/jira/browse/SPARK-2554 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian {{CountDistinct}} and {{SumDistinct}} should first do a partial aggregation and return unique value sets in each partition as partial results. Shuffle IO can be greatly reduced in in cases that there are only a few unique values. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2551) Cleanup FilteringParquetRowInputFormat
[ https://issues.apache.org/jira/browse/SPARK-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-2551: -- Issue Type: Improvement (was: Bug) Cleanup FilteringParquetRowInputFormat -- Key: SPARK-2551 URL: https://issues.apache.org/jira/browse/SPARK-2551 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian Priority: Minor To workaround [PARQUET-16|https://issues.apache.org/jira/browse/PARQUET-16] and fix [SPARK-2119|https://issues.apache.org/jira/browse/SPARK-2119], we did some reflection hacks in {{FilteringParquetRowInputFormat}}. This should be cleaned up once PARQUET-16 is fixed. A PR for PARQUET-16 is [here|https://github.com/apache/incubator-parquet-mr/pull/17]. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2492) KafkaReceiver minor changes to align with Kafka 0.8
[ https://issues.apache.org/jira/browse/SPARK-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064721#comment-14064721 ] Saisai Shao commented on SPARK-2492: Hi TD, Also I did some experiments on the previous code. In previous code, zookeeper group metadata will be cleaned if auto.offset.reset is set, no matter it is smallest or largest, this will lead to two results: 1. smallest: we will always read data from the beginning of partition no matter the groupid is new or old. 2. largest: we will always read data from the end of partition no matter the groupid is new or old. I think the reason is that we delete the group metadata in zookeeper, so Kafka can only relies on auto.offset.reset to position the offset. If we do not remove zookeeper metadata, the result will turn to: 1. smallest: we will read from the beginning of the partition for new groupid, and for old groupid, the start point is the last commit offset. 2. largest: we will read from the end of the partition for new groupid, and for old groupid, the start point is the last commit offset. So I think in the previous code, auto.offset.reset is not a hint for out-range seeking, it is a immediate enforcement for offset to seek to the beginning or end of the partition, I'm not sure what's the purpose of previous design ? I think directly seeking to the beginning or end of the partition when auto.offset.reset is set may has the different purpose of Kafka's own behavior, and will lead to unwanted result when people set this parameter (because of different from Kafka's predefined meaning). So I'd prefer to remove this code path. What's your thought and concern ? KafkaReceiver minor changes to align with Kafka 0.8 Key: SPARK-2492 URL: https://issues.apache.org/jira/browse/SPARK-2492 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.0 Reporter: Saisai Shao Assignee: Saisai Shao Priority: Minor Fix For: 1.1.0 Update to delete Zookeeper metadata when Kafka's parameter auto.offset.reset is set to smallest, which is aligned with Kafka 0.8's ConsoleConsumer. Also use Kafka offered API without directly using zkClient. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2555) Support configuration spark.scheduler.minRegisteredExecutorsRatio in Mesos mode.
Zhihui created SPARK-2555: - Summary: Support configuration spark.scheduler.minRegisteredExecutorsRatio in Mesos mode. Key: SPARK-2555 URL: https://issues.apache.org/jira/browse/SPARK-2555 Project: Spark Issue Type: Improvement Components: Mesos, Spark Core Affects Versions: 1.0.0 Reporter: Zhihui -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2555) Support configuration spark.scheduler.minRegisteredExecutorsRatio in Mesos mode.
[ https://issues.apache.org/jira/browse/SPARK-2555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihui updated SPARK-2555: -- Description: In SPARK-1946, Support configuration spark.scheduler.minRegisteredExecutorsRatio in Mesos mode. Key: SPARK-2555 URL: https://issues.apache.org/jira/browse/SPARK-2555 Project: Spark Issue Type: Improvement Components: Mesos, Spark Core Affects Versions: 1.0.0 Reporter: Zhihui In SPARK-1946, -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2555) Support configuration spark.scheduler.minRegisteredExecutorsRatio in Mesos mode.
[ https://issues.apache.org/jira/browse/SPARK-2555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihui updated SPARK-2555: -- Description: In SPARK-1946, configuration spark.scheduler.minRegisteredExecutorsRatio was introduced, but it only support Standalone and Yarn mode. This jira ticket try to introduce the configuration to Mesos mode. was:In SPARK-1946, Support configuration spark.scheduler.minRegisteredExecutorsRatio in Mesos mode. Key: SPARK-2555 URL: https://issues.apache.org/jira/browse/SPARK-2555 Project: Spark Issue Type: Improvement Components: Mesos, Spark Core Affects Versions: 1.0.0 Reporter: Zhihui In SPARK-1946, configuration spark.scheduler.minRegisteredExecutorsRatio was introduced, but it only support Standalone and Yarn mode. This jira ticket try to introduce the configuration to Mesos mode. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2555) Support configuration spark.scheduler.minRegisteredExecutorsRatio in Mesos mode.
[ https://issues.apache.org/jira/browse/SPARK-2555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064743#comment-14064743 ] Zhihui commented on SPARK-2555: --- I submit a PR https://github.com/apache/spark/pull/1462 Support configuration spark.scheduler.minRegisteredExecutorsRatio in Mesos mode. Key: SPARK-2555 URL: https://issues.apache.org/jira/browse/SPARK-2555 Project: Spark Issue Type: Improvement Components: Mesos, Spark Core Affects Versions: 1.0.0 Reporter: Zhihui In SPARK-1946, configuration spark.scheduler.minRegisteredExecutorsRatio was introduced, but it only support Standalone and Yarn mode. This jira ticket try to introduce the configuration to Mesos mode. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2555) Support configuration spark.scheduler.minRegisteredExecutorsRatio in Mesos mode.
[ https://issues.apache.org/jira/browse/SPARK-2555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihui updated SPARK-2555: -- Description: In SPARK-1946, configuration spark.scheduler.minRegisteredExecutorsRatio was introduced, but it only support Standalone and Yarn mode. This is try to introduce the configuration to Mesos mode. was: In SPARK-1946, configuration spark.scheduler.minRegisteredExecutorsRatio was introduced, but it only support Standalone and Yarn mode. This jira ticket try to introduce the configuration to Mesos mode. Support configuration spark.scheduler.minRegisteredExecutorsRatio in Mesos mode. Key: SPARK-2555 URL: https://issues.apache.org/jira/browse/SPARK-2555 Project: Spark Issue Type: Improvement Components: Mesos, Spark Core Affects Versions: 1.0.0 Reporter: Zhihui In SPARK-1946, configuration spark.scheduler.minRegisteredExecutorsRatio was introduced, but it only support Standalone and Yarn mode. This is try to introduce the configuration to Mesos mode. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2556) Multiple SparkContexts can coexist in one process
YanTang Zhai created SPARK-2556: --- Summary: Multiple SparkContexts can coexist in one process Key: SPARK-2556 URL: https://issues.apache.org/jira/browse/SPARK-2556 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: YanTang Zhai Priority: Minor Multiple SparkContexts could not coexist in one process at present since different SparkContexts share same global variables. These global variables and objects will be modified to local in order that multiple SparkContexts can coexist. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2491) When an OOM is thrown,the executor does not stop properly.
[ https://issues.apache.org/jira/browse/SPARK-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-2491: --- Summary: When an OOM is thrown,the executor does not stop properly. (was: When an OOM is thrown,the executor is not properly stopped.) When an OOM is thrown,the executor does not stop properly. -- Key: SPARK-2491 URL: https://issues.apache.org/jira/browse/SPARK-2491 Project: Spark Issue Type: Bug Reporter: Guoqiang Li The executor log: {code} # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError=kill %p # Executing /bin/sh -c kill 44942... 14/07/15 10:38:29 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM 14/07/15 10:38:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Connection manager future execution context-6,5,main] java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:331) at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94) at org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176) at org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63) at org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:125) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:122) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/07/15 10:38:29 WARN HadoopRDD: Exception in RecordReader.close() java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.hadoop.util.LineReader.close(LineReader.java:150) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:243) at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) - 14/07/15 10:38:30 INFO Executor: Running task ID 969 14/07/15 10:38:30 INFO BlockManager: Found block broadcast_0 locally 14/07/15 10:38:30 INFO HadoopRDD: Input split: hdfs://10dian72.domain.test:8020/input/lbs/recommend/toona/rating/20140712/part-7:0+68016537 14/07/15 10:38:30 ERROR Executor: Exception in task ID 969 java.io.FileNotFoundException: /yarn/nm/usercache/spark/appcache/application_1404728465401_0070/spark-local-20140715103235-ffda/2e/merged_shuffle_4_85_0 (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116) at
[jira] [Commented] (SPARK-2156) When the size of serialized results for one partition is slightly smaller than 10MB (the default akka.frameSize), the execution blocks
[ https://issues.apache.org/jira/browse/SPARK-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064965#comment-14064965 ] DjvuLee commented on SPARK-2156: I see this fixed in the spark branch-0.9 in the github, but does it updated in the spark v0.9.1 in the http://spark.apache.org/ site? When the size of serialized results for one partition is slightly smaller than 10MB (the default akka.frameSize), the execution blocks -- Key: SPARK-2156 URL: https://issues.apache.org/jira/browse/SPARK-2156 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.1, 1.0.0 Environment: AWS EC2 1 master 2 slaves with the instance type of r3.2xlarge Reporter: Chen Jin Assignee: Xiangrui Meng Priority: Blocker Fix For: 0.9.2, 1.0.1, 1.1.0 Original Estimate: 504h Remaining Estimate: 504h I have done some experiments when the frameSize is around 10MB . 1) spark.akka.frameSize = 10 If one of the partition size is very close to 10MB, say 9.97MB, the execution blocks without any exception or warning. Worker finished the task to send the serialized result, and then throw exception saying hadoop IPC client connection stops (changing the logging to debug level). However, the master never receives the results and the program just hangs. But if sizes for all the partitions less than some number btw 9.96MB amd 9.97MB, the program works fine. 2) spark.akka.frameSize = 9 when the partition size is just a little bit smaller than 9MB, it fails as well. This bug behavior is not exactly what spark-1112 is about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064996#comment-14064996 ] Ken Carlile commented on SPARK-2282: So we've just given this a try with a 32 node cluster. Without the two sysctl commands, it obviously failed, using this code in pyspark: {code} data = sc.parallelize(range(0,3000), 2000).map(lambda x: range(0,300)) data.cache() data.count() for i in range(0,20): data.count() {code} Unfortunately, with the two sysctls implemented on all nodes in the cluster, it also failed. Here's the java errors we see: {code:java} 14/07/17 10:55:37 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed; shutting down SparkContext java.net.NoRouteToHostException: Cannot assign requested address at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at java.net.Socket.init(Socket.java:425) at java.net.Socket.init(Socket.java:208) at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:404) at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387) at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:72) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:280) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:278) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.Accumulators$.add(Accumulators.scala:278) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:820) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1226) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Traceback (most recent call last): File stdin, line 1, in module File /usr/local/spark-current/python/pyspark/rdd.py, line 708, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /usr/local/spark-current/python/pyspark/rdd.py, line 699, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /usr/local/spark-current/python/pyspark/rdd.py, line 619, in reduce vals = self.mapPartitions(func).collect() File /usr/local/spark-current/python/pyspark/rdd.py, line 583, in collect bytesInJava = self._jrdd.collect().iterator() File /usr/local/spark-current/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /usr/local/spark-current/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o158.collect. : org.apache.spark.SparkException: Job 14 cancelled as part of cancellation of all jobs at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044) at org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:1009) at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply$mcVI$sp(DAGScheduler.scala:499) at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:499) at
[jira] [Created] (SPARK-2557) createTaskScheduler should be consistent between local and local-n-failures
Ye Xianjin created SPARK-2557: - Summary: createTaskScheduler should be consistent between local and local-n-failures Key: SPARK-2557 URL: https://issues.apache.org/jira/browse/SPARK-2557 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Ye Xianjin Priority: Minor In SparkContext.createTaskScheduler, we can use {code}local[*]{code} to estimates the number of cores on the machine. I think we should also be able to use * in the local-n-failures mode. And according to the code in the LOCAL_N_REGEX pattern matching code, I believe the regular expression of LOCAL_N_REGEX is wrong. LOCAL_N_REFEX should be {code} local\[([0-9]+|\*)\].r {code} rather than {code} local\[([0-9\*]+)\].r {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2557) createTaskScheduler should be consistent between local and local-n-failures
[ https://issues.apache.org/jira/browse/SPARK-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065001#comment-14065001 ] Ye Xianjin commented on SPARK-2557: --- I will send a pr for this. createTaskScheduler should be consistent between local and local-n-failures Key: SPARK-2557 URL: https://issues.apache.org/jira/browse/SPARK-2557 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Ye Xianjin Priority: Minor Labels: starter Original Estimate: 2h Remaining Estimate: 2h In SparkContext.createTaskScheduler, we can use {code}local[*]{code} to estimates the number of cores on the machine. I think we should also be able to use * in the local-n-failures mode. And according to the code in the LOCAL_N_REGEX pattern matching code, I believe the regular expression of LOCAL_N_REGEX is wrong. LOCAL_N_REFEX should be {code} local\[([0-9]+|\*)\].r {code} rather than {code} local\[([0-9\*]+)\].r {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2494) Hash of None is different cross machines in CPython
[ https://issues.apache.org/jira/browse/SPARK-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065006#comment-14065006 ] Matthew Farrellee commented on SPARK-2494: -- [~davies] will you provide an example that demonstrates the issue? Hash of None is different cross machines in CPython --- Key: SPARK-2494 URL: https://issues.apache.org/jira/browse/SPARK-2494 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0, 1.0.1 Environment: CPython 2.x Reporter: Davies Liu Priority: Blocker Labels: pyspark, shuffle Fix For: 1.0.0, 1.0.1 Original Estimate: 24h Remaining Estimate: 24h The hash of None, also tuple with None in it, is different cross machines, so the result will be wrong if None appear in the key of partitionBy(). It should use an portable hash function as the default partition function, which generate same hash for all the builtin immutable types, especially tuple. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065025#comment-14065025 ] Ken Carlile commented on SPARK-2282: A little more info: Nodes are running Scientific Linux 6.3 (Linux 2.6.32-279.el6.x86_64 #1 SMP Thu Jun 21 07:08:44 CDT 2012 x86_64 x86_64 x86_64 GNU/Linux) Spark is run against Python 2.7.6, Java 1.7.0.25, and Scala 2.10.3. spark-env.sh {code} #!/usr/bin/env bash ulimit -n 65535 export SCALA_HOME=/usr/local/scala-2.10.3 export SPARK_WORKER_DIR=/scratch/spark/work export JAVA_HOME=/usr/local/jdk1.7.0_25 export SPARK_LOG_DIR=~/.spark/logs/$JOB_ID/ export SPARK_EXECUTOR_MEMORY=100g export SPARK_DRIVER_MEMORY=100g export SPARK_WORKER_MEMORY=100g export SPARK_LOCAL_DIRS=/scratch/spark/tmp export PYSPARK_PYTHON=/usr/local/python-2.7.6/bin/python export SPARK_SLAVES=/scratch/spark/tmp/slaves {code} spark-defaults.conf: {code} spark.akka.timeout=300 spark.storage.blockManagerHeartBeatMs=3 spark.akka.retry.wait=30 spark.akka.frameSize=1 {code} PySpark crashes if too many tasks complete quickly -- Key: SPARK-2282 URL: https://issues.apache.org/jira/browse/SPARK-2282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.1, 1.0.0, 1.0.1 Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 0.9.2, 1.0.0, 1.0.1 Upon every task completion, PythonAccumulatorParam constructs a new socket to the Accumulator server running inside the pyspark daemon. This can cause a buildup of used ephemeral ports from sockets in the TIME_WAIT termination stage, which will cause the SparkContext to crash if too many tasks complete too quickly. We ran into this bug with 17k tasks completing in 15 seconds. This bug can be fixed outside of Spark by ensuring these properties are set (on a linux server); echo 1 /proc/sys/net/ipv4/tcp_tw_reuse echo 1 /proc/sys/net/ipv4/tcp_tw_recycle or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2557) createTaskScheduler should be consistent between local and local-n-failures
[ https://issues.apache.org/jira/browse/SPARK-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065029#comment-14065029 ] Ye Xianjin commented on SPARK-2557: --- Github pr: https://github.com/apache/spark/pull/1464 createTaskScheduler should be consistent between local and local-n-failures Key: SPARK-2557 URL: https://issues.apache.org/jira/browse/SPARK-2557 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Ye Xianjin Priority: Minor Labels: starter Original Estimate: 2h Remaining Estimate: 2h In SparkContext.createTaskScheduler, we can use {code}local[*]{code} to estimates the number of cores on the machine. I think we should also be able to use * in the local-n-failures mode. And according to the code in the LOCAL_N_REGEX pattern matching code, I believe the regular expression of LOCAL_N_REGEX is wrong. LOCAL_N_REFEX should be {code} local\[([0-9]+|\*)\].r {code} rather than {code} local\[([0-9\*]+)\].r {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2523) Potential Bugs if SerDe is not the identical among partitions and table
[ https://issues.apache.org/jira/browse/SPARK-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-2523: Target Version/s: 1.1.0 Potential Bugs if SerDe is not the identical among partitions and table --- Key: SPARK-2523 URL: https://issues.apache.org/jira/browse/SPARK-2523 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao In HiveTableScan.scala, ObjectInspector was created for all of the partition based records, which probably causes ClassCastException if the object inspector is not identical among table partitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2256) pyspark: RDD.take doesn't work ... sometimes ...
[ https://issues.apache.org/jira/browse/SPARK-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065032#comment-14065032 ] Matthew Farrellee commented on SPARK-2256: -- [~angel2014] i've tried this using a local file and line lengths from 1 to 64K (by powers of 2) and have not been able to reproduce this. how frequently does this fail? are you still seeing this issue on the tip of master? pyspark: RDD.take doesn't work ... sometimes ... -- Key: SPARK-2256 URL: https://issues.apache.org/jira/browse/SPARK-2256 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Environment: local file/remote HDFS Reporter: Ángel Álvarez Labels: RDD, pyspark, take If I try to take some lines from a file, sometimes it doesn't work Code: myfile = sc.textFile(A_ko) print myfile.take(10) Stacktrace: 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19 Traceback (most recent call last): File mytest.py, line 19, in module print myfile.take(10) File spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py, line 868, in take iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator() File spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py, line 537, in __call__ File spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py, line 300, in get_return_value Test data: START TEST DATA A A A
[jira] [Commented] (SPARK-2523) Potential Bugs if SerDe is not the identical among partitions and table
[ https://issues.apache.org/jira/browse/SPARK-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065035#comment-14065035 ] Yin Huai commented on SPARK-2523: - I see. Although we are using the right SerDe to deserialize a row, we are using the wrong ObjectInspector to extract fields (in attributeFunctions)... Also, creating Rows in TableReader makes sense. Will finish my review soon. Potential Bugs if SerDe is not the identical among partitions and table --- Key: SPARK-2523 URL: https://issues.apache.org/jira/browse/SPARK-2523 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao In HiveTableScan.scala, ObjectInspector was created for all of the partition based records, which probably causes ClassCastException if the object inspector is not identical among table partitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2021) External hashing in PySpark
[ https://issues.apache.org/jira/browse/SPARK-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065040#comment-14065040 ] Matthew Farrellee commented on SPARK-2021: -- [~matei][~prashant_] what do you mean by external hashing? External hashing in PySpark --- Key: SPARK-2021 URL: https://issues.apache.org/jira/browse/SPARK-2021 Project: Spark Issue Type: Bug Components: PySpark Reporter: Matei Zaharia Assignee: Prashant Sharma -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1670) PySpark Fails to Create SparkContext Due To Debugging Options in conf/java-opts
[ https://issues.apache.org/jira/browse/SPARK-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065047#comment-14065047 ] Matthew Farrellee commented on SPARK-1670: -- SPARK-2313 is the root cause of this. a workaround for this would be complex because the extra text on stdout is coming from the same jvm that should produce the py4j port. PySpark Fails to Create SparkContext Due To Debugging Options in conf/java-opts --- Key: SPARK-1670 URL: https://issues.apache.org/jira/browse/SPARK-1670 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Environment: pats-air:spark pat$ IPYTHON=1 bin/pyspark Python 2.7.5 (default, Aug 25 2013, 00:04:04) ... IPython 1.1.0 ... Spark version 1.0.0-SNAPSHOT Using Python version 2.7.5 (default, Aug 25 2013 00:04:04) Reporter: Pat McDonough When JVM debugging options are in conf/java-opts, it causes pyspark to fail when creating the SparkContext. The java-opts file looks like the following: {code}-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 {code} Here's the error: {code}--- ValueErrorTraceback (most recent call last) /Library/Python/2.7/site-packages/IPython/utils/py3compat.pyc in execfile(fname, *where) 202 else: 203 filename = fname -- 204 __builtin__.execfile(filename, *where) /Users/pat/Projects/spark/python/pyspark/shell.py in module() 41 SparkContext.setSystemProperty(spark.executor.uri, os.environ[SPARK_EXECUTOR_URI]) 42 --- 43 sc = SparkContext(os.environ.get(MASTER, local[*]), PySparkShell, pyFiles=add_files) 44 45 print(Welcome to /Users/pat/Projects/spark/python/pyspark/context.pyc in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway) 92 tempNamedTuple = namedtuple(Callsite, function file linenum) 93 self._callsite = tempNamedTuple(function=None, file=None, linenum=None) --- 94 SparkContext._ensure_initialized(self, gateway=gateway) 95 96 self.environment = environment or {} /Users/pat/Projects/spark/python/pyspark/context.pyc in _ensure_initialized(cls, instance, gateway) 172 with SparkContext._lock: 173 if not SparkContext._gateway: -- 174 SparkContext._gateway = gateway or launch_gateway() 175 SparkContext._jvm = SparkContext._gateway.jvm 176 SparkContext._writeToFile = SparkContext._jvm.PythonRDD.writeToFile /Users/pat/Projects/spark/python/pyspark/java_gateway.pyc in launch_gateway() 44 proc = Popen(command, stdout=PIPE, stdin=PIPE) 45 # Determine which ephemeral port the server started on: --- 46 port = int(proc.stdout.readline()) 47 # Create a thread to echo output from the GatewayServer, which is required 48 # for Java log output to show up: ValueError: invalid literal for int() with base 10: 'Listening for transport dt_socket at address: 5005\n' {code} Note that when you use JVM debugging, the very first line of output (e.g. when running spark-shell) looks like this: {code}Listening for transport dt_socket at address: 5005{code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2256) pyspark: RDD.take doesn't work ... sometimes ...
[ https://issues.apache.org/jira/browse/SPARK-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ángel Álvarez updated SPARK-2256: - Attachment: A_test.zip I've tried with different files and sizes ... but I can't figure out the reason why it doesn't work ... If I try with the files downloaded from https://github.com/richardbishop/PerformanceTestData ... everything works OK. pyspark: RDD.take doesn't work ... sometimes ... -- Key: SPARK-2256 URL: https://issues.apache.org/jira/browse/SPARK-2256 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Environment: local file/remote HDFS Reporter: Ángel Álvarez Labels: RDD, pyspark, take Attachments: A_test.zip If I try to take some lines from a file, sometimes it doesn't work Code: myfile = sc.textFile(A_ko) print myfile.take(10) Stacktrace: 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19 Traceback (most recent call last): File mytest.py, line 19, in module print myfile.take(10) File spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py, line 868, in take iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator() File spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py, line 537, in __call__ File spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py, line 300, in get_return_value Test data: START TEST DATA A A A
[jira] [Commented] (SPARK-1662) PySpark fails if python class is used as a data container
[ https://issues.apache.org/jira/browse/SPARK-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065059#comment-14065059 ] Matthew Farrellee commented on SPARK-1662: -- [~nrchandan] and [~pwendell] - i recommend you close this as not a bug. it's not pyspark's fault that the user-defined class is not able to be pickled. you can change the Point class in the example to make it pickleable and the example program will work. see https://docs.python.org/2/library/pickle.html#what-can-be-pickled-and-unpickled original gist for posterity - {code} import pyspark class Point(object): '''this class being used as container''' pass def to_point_obj(point_as_dict): '''convert a dict representation of a point to Point object''' p = Point() p.x = point_as_dict['x'] p.y = point_as_dict['y'] return p def add_two_points(point_obj1, point_obj2): print type(point_obj1), type(point_obj2) point_obj1.x += point_obj2.x point_obj1.y += point_obj2.y return point_obj1 def zero_point(): p = Point() p.x = p.y = 0 return p sc = pyspark.SparkContext('local', 'test_app') a = sc.parallelize([{'x':1, 'y':1}, {'x':2, 'y':2}, {'x':3, 'y':3}]) b = a.map(to_point_obj) # convert to an RDD of Point objects c = b.fold(zero_point(), add_two_points) {code} PySpark fails if python class is used as a data container - Key: SPARK-1662 URL: https://issues.apache.org/jira/browse/SPARK-1662 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Environment: Ubuntu 14, Python 2.7.6 Reporter: Chandan Kumar Priority: Minor PySpark fails if RDD operations are performed on data encapsulated in Python objects (rare use case where plain python objects are used as data containers instead of regular dict or tuples). I have written a small piece of code to reproduce the bug: https://gist.github.com/nrchandan/11394440 script src=https://gist.github.com/nrchandan/11394440.js;/script -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2256) pyspark: RDD.take doesn't work ... sometimes ...
[ https://issues.apache.org/jira/browse/SPARK-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065066#comment-14065066 ] Matthew Farrellee commented on SPARK-2256: -- are you using a local master, mesos, yarn? for me - {code} ./dist/bin/pyspark ... [repeat this a bunch, w/ while True] sc.textFile(A_ko).take(10) sc.textFile(A_ko).take(50) ... {code} and i cannot reproduce pyspark: RDD.take doesn't work ... sometimes ... -- Key: SPARK-2256 URL: https://issues.apache.org/jira/browse/SPARK-2256 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Environment: local file/remote HDFS Reporter: Ángel Álvarez Labels: RDD, pyspark, take Attachments: A_test.zip If I try to take some lines from a file, sometimes it doesn't work Code: myfile = sc.textFile(A_ko) print myfile.take(10) Stacktrace: 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19 Traceback (most recent call last): File mytest.py, line 19, in module print myfile.take(10) File spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py, line 868, in take iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator() File spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py, line 537, in __call__ File spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py, line 300, in get_return_value Test data: START TEST DATA A A A
[jira] [Created] (SPARK-2558) Mention --queue argument in YARN documentation
Matei Zaharia created SPARK-2558: Summary: Mention --queue argument in YARN documentation Key: SPARK-2558 URL: https://issues.apache.org/jira/browse/SPARK-2558 Project: Spark Issue Type: Documentation Components: YARN Reporter: Matei Zaharia Priority: Trivial The docs about it went away when we updated the page to spark-submit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2558) Mention --queue argument in YARN documentation
[ https://issues.apache.org/jira/browse/SPARK-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2558: - Labels: Starter (was: ) Mention --queue argument in YARN documentation --- Key: SPARK-2558 URL: https://issues.apache.org/jira/browse/SPARK-2558 Project: Spark Issue Type: Documentation Components: YARN Reporter: Matei Zaharia Priority: Trivial Labels: Starter The docs about it went away when we updated the page to spark-submit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065121#comment-14065121 ] Aaron Davidson commented on SPARK-2282: --- This problem does look identical. I think I gave you the wrong netstat command, as -l only show listening sockets. Try with -a instead to see all open connections to confirm this, but the rest of your symptoms align perfectly. I did a little Googling around for your specific kernel version, and it turns out [someone else|http://lists.openwall.net/netdev/2011/07/13/39] has had success with tcp_tw_recycle on 2.6.32. Could you try to make absolutely sure that the sysctl is taking effect? Perhaps you can try adding net.ipv4.tcp_tw_recycle = 1 to /etc/sysctl.conf and then running a sysctl -p before restarting pyspark. PySpark crashes if too many tasks complete quickly -- Key: SPARK-2282 URL: https://issues.apache.org/jira/browse/SPARK-2282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.1, 1.0.0, 1.0.1 Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 0.9.2, 1.0.0, 1.0.1 Upon every task completion, PythonAccumulatorParam constructs a new socket to the Accumulator server running inside the pyspark daemon. This can cause a buildup of used ephemeral ports from sockets in the TIME_WAIT termination stage, which will cause the SparkContext to crash if too many tasks complete too quickly. We ran into this bug with 17k tasks completing in 15 seconds. This bug can be fixed outside of Spark by ensuring these properties are set (on a linux server); echo 1 /proc/sys/net/ipv4/tcp_tw_reuse echo 1 /proc/sys/net/ipv4/tcp_tw_recycle or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2083) Allow local task to retry after failure.
[ https://issues.apache.org/jira/browse/SPARK-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065143#comment-14065143 ] Bill Havanki commented on SPARK-2083: - Pull request available: https://github.com/apache/spark/pull/1465 (Please feel free to assign this ticket to me - I don't have that permission.) Allow local task to retry after failure. Key: SPARK-2083 URL: https://issues.apache.org/jira/browse/SPARK-2083 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.0.0 Reporter: Peng Cheng Priority: Trivial Labels: easyfix Original Estimate: 1h Remaining Estimate: 1h If a job is submitted to run locally using masterURL = local[X], spark will not retry a failed task regardless of your spark.task.maxFailures setting. This design is to facilitate debugging and QA of spark application where all tasks are expected to succeed and yield a results. Unfortunately, such setting will prevent a local job from finished if any of its task cannot guarantee a result (e.g. visiting an external resouce/API), and retrying inside the task is less favoured (e.g. the task needs to be executed on a different computer on production). User however can still set masterURL =local[X,Y] to override this (where Y is the local maxFailures), but it is not documented and hard to manage. A quick fix to this can be to add a new configuration property spark.local.maxFailures with a default value of 1. So user knows exactly where to change when reading the documentation -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2559) Add A Link to Download the Application Events Log for Offline Analysis
Pat McDonough created SPARK-2559: Summary: Add A Link to Download the Application Events Log for Offline Analysis Key: SPARK-2559 URL: https://issues.apache.org/jira/browse/SPARK-2559 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Pat McDonough To analyze application issues offline (eg. on another machine while supporting an end user), provide end users a link to download an archive of the application event logs. The archive can then by opened via an offline History Server. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2256) pyspark: RDD.take doesn't work ... sometimes ...
[ https://issues.apache.org/jira/browse/SPARK-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065230#comment-14065230 ] Ángel Álvarez commented on SPARK-2256: -- I've tried using local and master spark in standalone mode. pyspark: RDD.take doesn't work ... sometimes ... -- Key: SPARK-2256 URL: https://issues.apache.org/jira/browse/SPARK-2256 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Environment: local file/remote HDFS Reporter: Ángel Álvarez Labels: RDD, pyspark, take Attachments: A_test.zip If I try to take some lines from a file, sometimes it doesn't work Code: myfile = sc.textFile(A_ko) print myfile.take(10) Stacktrace: 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19 Traceback (most recent call last): File mytest.py, line 19, in module print myfile.take(10) File spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py, line 868, in take iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator() File spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py, line 537, in __call__ File spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py, line 300, in get_return_value Test data: START TEST DATA A A A
[jira] [Commented] (SPARK-2494) Hash of None is different cross machines in CPython
[ https://issues.apache.org/jira/browse/SPARK-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065240#comment-14065240 ] Davies Liu commented on SPARK-2494: --- This bug only happen in cluster mode, so it's can not be reproduced in unit tests. In cluster mode (workers on different machines), it will happen: rdd = sc.parallelize([(None, 1), (None, 2)], 2) rdd.groupByKey(2).collect() ((None, [1]), (None, [2])) The same key `None` will be put into different partitions and can not be aggregated together. Hash of None is different cross machines in CPython --- Key: SPARK-2494 URL: https://issues.apache.org/jira/browse/SPARK-2494 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0, 1.0.1 Environment: CPython 2.x Reporter: Davies Liu Priority: Blocker Labels: pyspark, shuffle Fix For: 1.0.0, 1.0.1 Original Estimate: 24h Remaining Estimate: 24h The hash of None, also tuple with None in it, is different cross machines, so the result will be wrong if None appear in the key of partitionBy(). It should use an portable hash function as the default partition function, which generate same hash for all the builtin immutable types, especially tuple. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2256) pyspark: RDD.take doesn't work ... sometimes ...
[ https://issues.apache.org/jira/browse/SPARK-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065249#comment-14065249 ] Matthew Farrellee commented on SPARK-2256: -- maybe there's an issue in the platform? i'm on - {code} $ head -n1 /etc/issue Fedora release 20 (Heisenbug) $ python --version Python 2.7.5 $ java -version openjdk version 1.8.0_05 OpenJDK Runtime Environment (build 1.8.0_05-b13) OpenJDK 64-Bit Server VM (build 25.5-b02, mixed mode) {code} pyspark: RDD.take doesn't work ... sometimes ... -- Key: SPARK-2256 URL: https://issues.apache.org/jira/browse/SPARK-2256 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Environment: local file/remote HDFS Reporter: Ángel Álvarez Labels: RDD, pyspark, take Attachments: A_test.zip If I try to take some lines from a file, sometimes it doesn't work Code: myfile = sc.textFile(A_ko) print myfile.take(10) Stacktrace: 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19 Traceback (most recent call last): File mytest.py, line 19, in module print myfile.take(10) File spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py, line 868, in take iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator() File spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py, line 537, in __call__ File spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py, line 300, in get_return_value Test data: START TEST DATA A A A
[jira] [Commented] (SPARK-2316) StorageStatusListener should avoid O(blocks) operations
[ https://issues.apache.org/jira/browse/SPARK-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065256#comment-14065256 ] Shivaram Venkataraman commented on SPARK-2316: -- I'd just like to add that in cases where we have many thousands of blocks, this stack trace occupies one core constantly on the Master and is probably one of the reasons why the WebUI stops functioning after a certain point. StorageStatusListener should avoid O(blocks) operations --- Key: SPARK-2316 URL: https://issues.apache.org/jira/browse/SPARK-2316 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.0.0 Reporter: Patrick Wendell Assignee: Andrew Or In the case where jobs are frequently causing dropped blocks the storage status listener can bottleneck. This is slow for a few reasons, one being that we use Scala collection operations, the other being that we operations that are O(number of blocks). I think using a few indices here could make this much faster. {code} at java.lang.Integer.valueOf(Integer.java:642) at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70) at org.apache.spark.storage.StorageUtils$$anonfun$9.apply(StorageUtils.scala:82) at scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:328) at scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327) at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105) at org.apache.spark.storage.StorageUtils$.rddInfoFromStorageStatus(StorageUtils.scala:82) at org.apache.spark.ui.storage.StorageListener.updateRDDInfo(StorageTab.scala:56) at org.apache.spark.ui.storage.StorageListener.onTaskEnd(StorageTab.scala:67) - locked 0xa27ebe30 (a org.apache.spark.ui.storage.StorageListener) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065306#comment-14065306 ] Aaron Davidson commented on SPARK-2282: --- This problem is kinda silly because we're accumulating these updates from a single thread in the DAGScheduler, so we should only really have one socket open at a time, but it's very short lived. We could just reuse the connection with a relatively minor refactor of accumulators.py and PythonAccumulatorParam. PySpark crashes if too many tasks complete quickly -- Key: SPARK-2282 URL: https://issues.apache.org/jira/browse/SPARK-2282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.1, 1.0.0, 1.0.1 Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 0.9.2, 1.0.0, 1.0.1 Upon every task completion, PythonAccumulatorParam constructs a new socket to the Accumulator server running inside the pyspark daemon. This can cause a buildup of used ephemeral ports from sockets in the TIME_WAIT termination stage, which will cause the SparkContext to crash if too many tasks complete too quickly. We ran into this bug with 17k tasks completing in 15 seconds. This bug can be fixed outside of Spark by ensuring these properties are set (on a linux server); echo 1 /proc/sys/net/ipv4/tcp_tw_reuse echo 1 /proc/sys/net/ipv4/tcp_tw_recycle or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2447: - Assignee: Ted Malaska Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2494) Hash of None is different cross machines in CPython
[ https://issues.apache.org/jira/browse/SPARK-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065317#comment-14065317 ] Davies Liu commented on SPARK-2494: --- The tip version already handle hash of None, but it can not handle hash of tuple with None in it. Here is the updated test cases, sorry for that: rdd = sc.parallelize([((None, 1), 1),] *100 , 100) assert rdd.groupByKey(10).collect() == 1 Hash of None is different cross machines in CPython --- Key: SPARK-2494 URL: https://issues.apache.org/jira/browse/SPARK-2494 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0, 1.0.1 Environment: CPython 2.x Reporter: Davies Liu Priority: Blocker Labels: pyspark, shuffle Fix For: 1.0.0, 1.0.1 Original Estimate: 24h Remaining Estimate: 24h The hash of None, also tuple with None in it, is different cross machines, so the result will be wrong if None appear in the key of partitionBy(). It should use an portable hash function as the default partition function, which generate same hash for all the builtin immutable types, especially tuple. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2528) spark-ec2 security group permissions are too open
[ https://issues.apache.org/jira/browse/SPARK-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-2528: Description: {{spark-ec2}} configures EC2 security groups with ports [open to the world | https://github.com/apache/spark/blob/9c73822a08848a0cde545282d3eb1c3f1a4c2a82/ec2/spark_ec2.py#L280]. This is an unnecessary security risk, even for a short-lived cluster. Wherever possible, it would be better if, when launching a new cluster, {{spark-ec2}} detects the host's external IP address (e.g. via {{icanhazip.com}}) and grants access specifically to that IP address. was: {{spark-ec2}} configures EC2 security groups with ports [open to the world | https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L280]. This is an unnecessary security risk, even for a short-lived cluster. Wherever possible, it would be better if, when launching a new cluster, {{spark-ec2}} detects the host's external IP address (e.g. via {{icanhazip.com}}) and grants access specifically to that IP address. spark-ec2 security group permissions are too open - Key: SPARK-2528 URL: https://issues.apache.org/jira/browse/SPARK-2528 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.0.0 Reporter: Nicholas Chammas Priority: Minor {{spark-ec2}} configures EC2 security groups with ports [open to the world | https://github.com/apache/spark/blob/9c73822a08848a0cde545282d3eb1c3f1a4c2a82/ec2/spark_ec2.py#L280]. This is an unnecessary security risk, even for a short-lived cluster. Wherever possible, it would be better if, when launching a new cluster, {{spark-ec2}} detects the host's external IP address (e.g. via {{icanhazip.com}}) and grants access specifically to that IP address. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2501) Handle stage re-submissions properly in the UI
[ https://issues.apache.org/jira/browse/SPARK-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065353#comment-14065353 ] Masayoshi TSUZUKI commented on SPARK-2501: -- Yes, this ticket covers it. I think that problem (like 1010/1000) is caused because we only use stageId (not stageId + attemptId) as the key in JobProgressListener. Handle stage re-submissions properly in the UI -- Key: SPARK-2501 URL: https://issues.apache.org/jira/browse/SPARK-2501 Project: Spark Issue Type: Bug Components: Web UI Reporter: Patrick Wendell Assignee: Masayoshi TSUZUKI Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2560) Create Spark SQL syntax reference
Nicholas Chammas created SPARK-2560: --- Summary: Create Spark SQL syntax reference Key: SPARK-2560 URL: https://issues.apache.org/jira/browse/SPARK-2560 Project: Spark Issue Type: Documentation Components: SQL Reporter: Nicholas Chammas -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2560) Create Spark SQL syntax reference
[ https://issues.apache.org/jira/browse/SPARK-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-2560: Description: Does Spark SQL support {{LEN()}}? How about {{LIMIT}}? And what about {{MY FAVOURITE SYNTAX}}? Right now there is no reference page to document this. [Hive has one.| https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select] Spark SQL should have one, too. Create Spark SQL syntax reference - Key: SPARK-2560 URL: https://issues.apache.org/jira/browse/SPARK-2560 Project: Spark Issue Type: Documentation Components: SQL Reporter: Nicholas Chammas Does Spark SQL support {{LEN()}}? How about {{LIMIT}}? And what about {{MY FAVOURITE SYNTAX}}? Right now there is no reference page to document this. [Hive has one.| https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select] Spark SQL should have one, too. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2542) Exit Code Class should be renamed and placed package properly
[ https://issues.apache.org/jira/browse/SPARK-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065528#comment-14065528 ] Kousuke Saruta commented on SPARK-2542: --- PR: https://github.com/apache/spark/pull/1467 Exit Code Class should be renamed and placed package properly - Key: SPARK-2542 URL: https://issues.apache.org/jira/browse/SPARK-2542 Project: Spark Issue Type: Bug Reporter: Kousuke Saruta org.apache.spark.executor.ExecutorExitCode represents some of Exit Codes. The name of the class associates the set of exit code of Executor. But, the exit codes defined in the class can be used not only Executor (e.g Driver). Actually, DiskBlockManager uses ExecutorExitCode.DISK_STORE_FAILED_TO_CREATE_DIR and DiskBlockManager can be used Driver. We should rename and move the class to new package. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065549#comment-14065549 ] Ken Carlile commented on SPARK-2282: Awesome. I was afraid we were trying to chase down something else here. Glad to hear that it's a known issue and that you've got a good idea how to fix it. Thanks for the quick response! --Ken PySpark crashes if too many tasks complete quickly -- Key: SPARK-2282 URL: https://issues.apache.org/jira/browse/SPARK-2282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.1, 1.0.0, 1.0.1 Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 0.9.2, 1.0.0, 1.0.1 Upon every task completion, PythonAccumulatorParam constructs a new socket to the Accumulator server running inside the pyspark daemon. This can cause a buildup of used ephemeral ports from sockets in the TIME_WAIT termination stage, which will cause the SparkContext to crash if too many tasks complete too quickly. We ran into this bug with 17k tasks completing in 15 seconds. This bug can be fixed outside of Spark by ensuring these properties are set (on a linux server); echo 1 /proc/sys/net/ipv4/tcp_tw_reuse echo 1 /proc/sys/net/ipv4/tcp_tw_recycle or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1215) Clustering: Index out of bounds error
[ https://issues.apache.org/jira/browse/SPARK-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1215. -- Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1468 [https://github.com/apache/spark/pull/1468] Clustering: Index out of bounds error - Key: SPARK-1215 URL: https://issues.apache.org/jira/browse/SPARK-1215 Project: Spark Issue Type: Bug Components: MLlib Reporter: dewshick Assignee: Joseph K. Bradley Fix For: 1.1.0 Attachments: test.csv code: import org.apache.spark.mllib.clustering._ val test = sc.makeRDD(Array(4,4,4,4,4).map(e = Array(e.toDouble))) val kmeans = new KMeans().setK(4) kmeans.run(test) evals with java.lang.ArrayIndexOutOfBoundsException error: 14/01/17 12:35:54 INFO scheduler.DAGScheduler: Stage 25 (collectAsMap at KMeans.scala:243) finished in 0.047 s 14/01/17 12:35:54 INFO spark.SparkContext: Job finished: collectAsMap at KMeans.scala:243, took 16.389537116 s Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.simontuffs.onejar.Boot.run(Boot.java:340) at com.simontuffs.onejar.Boot.main(Boot.java:166) Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:47) at org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:247) at org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) at scala.collection.immutable.Range.foreach(Range.scala:81) at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) at scala.collection.immutable.Range.map(Range.scala:46) at org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:244) at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:124) at Clustering$$anonfun$1.apply$mcDI$sp(Clustering.scala:21) at Clustering$$anonfun$1.apply(Clustering.scala:19) at Clustering$$anonfun$1.apply(Clustering.scala:19) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) at scala.collection.immutable.Range.foreach(Range.scala:78) at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) at scala.collection.immutable.Range.map(Range.scala:46) at Clustering$.main(Clustering.scala:19) at Clustering.main(Clustering.scala) ... 6 more -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2470) Fix PEP 8 violations
[ https://issues.apache.org/jira/browse/SPARK-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065655#comment-14065655 ] Reynold Xin commented on SPARK-2470: That PR only covers a small fraction of the changes required. Fix PEP 8 violations Key: SPARK-2470 URL: https://issues.apache.org/jira/browse/SPARK-2470 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Reynold Xin Assignee: Prashant Sharma Let's fix all our pep8 violations so we can turn the pep8 checker on in continuous integration. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2494) Hash of None is different cross machines in CPython
[ https://issues.apache.org/jira/browse/SPARK-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065671#comment-14065671 ] Matthew Farrellee commented on SPARK-2494: -- thank you. i've confirmed this: {code} rdd.groupByKey(10).collect() [((None, 1), pyspark.resultiterable.ResultIterable object at 0x19d4410), ((None, 1), pyspark.resultiterable.ResultIterable object at 0x19d4310), ((None, 1), pyspark.resultiterable.ResultIterable object at 0x19d7290)] {code} i have 3 workers in my cluster Hash of None is different cross machines in CPython --- Key: SPARK-2494 URL: https://issues.apache.org/jira/browse/SPARK-2494 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0, 1.0.1 Environment: CPython 2.x Reporter: Davies Liu Priority: Blocker Labels: pyspark, shuffle Fix For: 1.0.0, 1.0.1 Original Estimate: 24h Remaining Estimate: 24h The hash of None, also tuple with None in it, is different cross machines, so the result will be wrong if None appear in the key of partitionBy(). It should use an portable hash function as the default partition function, which generate same hash for all the builtin immutable types, especially tuple. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2562) Add Date datatype support to Spark SQL
Zongheng Yang created SPARK-2562: Summary: Add Date datatype support to Spark SQL Key: SPARK-2562 URL: https://issues.apache.org/jira/browse/SPARK-2562 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.1 Reporter: Zongheng Yang Priority: Minor Spark SQL currently supports Timestamp, but not Date. Hive introduced support for Date in [HIVE-4055|https://issues.apache.org/jira/browse/HIVE-4055], where the underlying representation is {{java.sql.Date}}. (Thanks to user Rindra Ramamonjison for reporting this.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1458) Expose sc.version in PySpark
[ https://issues.apache.org/jira/browse/SPARK-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065693#comment-14065693 ] Nicholas Chammas commented on SPARK-1458: - Perhaps that could also be some kind of unit test: Check that certain hand-inputted values are identical across Scala and Python, like the shell banner and {{sc.version}}. Expose sc.version in PySpark Key: SPARK-1458 URL: https://issues.apache.org/jira/browse/SPARK-1458 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core Affects Versions: 0.9.0 Reporter: Nicholas Chammas Priority: Minor As discussed [here|http://apache-spark-user-list.1001560.n3.nabble.com/programmatic-way-to-tell-Spark-version-td1929.html], I think it would be nice if there was a way to programmatically determine what version of Spark you are running. The potential use cases are not that important, but they include: # Branching your code based on what version of Spark is running. # Checking your version without having to quit and restart the Spark shell. Right now in PySpark, I believe the only way to determine your version is by firing up the Spark shell and looking at the startup banner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store
[ https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-2365: -- Attachment: 2014-07-07-IndexedRDD-design-review.pdf Slides explaining the motivation, design, and performance of IndexedRDD. Add IndexedRDD, an efficient updatable key-value store -- Key: SPARK-2365 URL: https://issues.apache.org/jira/browse/SPARK-2365 Project: Spark Issue Type: New Feature Components: GraphX, Spark Core Reporter: Ankur Dave Assignee: Ankur Dave Attachments: 2014-07-07-IndexedRDD-design-review.pdf RDDs currently provide a bulk-updatable, iterator-based interface. This imposes minimal requirements on the storage layer, which only needs to support sequential access, enabling on-disk and serialized storage. However, many applications would benefit from a richer interface. Efficient support for point lookups would enable serving data out of RDDs, but it currently requires iterating over an entire partition to find the desired element. Point updates similarly require copying an entire iterator. Joins are also expensive, requiring a shuffle and local hash joins. To address these problems, we propose IndexedRDD, an efficient key-value store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key uniqueness and pre-indexing the entries for efficient joins and point lookups, updates, and deletions. It would be implemented by (1) hash-partitioning the entries by key, (2) maintaining a hash index within each partition, and (3) using purely functional (immutable and efficiently updatable) data structures to enable efficient modifications and deletions. GraphX would be the first user of IndexedRDD, since it currently implements a limited form of this functionality in VertexRDD. We envision a variety of other uses for IndexedRDD, including streaming updates to RDDs, direct serving from RDDs, and as an execution strategy for Spark SQL. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store
[ https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065694#comment-14065694 ] Ankur Dave edited comment on SPARK-2365 at 7/17/14 10:31 PM: - Added slides explaining the motivation, design, and performance of IndexedRDD. was (Author: ankurd): Slides explaining the motivation, design, and performance of IndexedRDD. Add IndexedRDD, an efficient updatable key-value store -- Key: SPARK-2365 URL: https://issues.apache.org/jira/browse/SPARK-2365 Project: Spark Issue Type: New Feature Components: GraphX, Spark Core Reporter: Ankur Dave Assignee: Ankur Dave Attachments: 2014-07-07-IndexedRDD-design-review.pdf RDDs currently provide a bulk-updatable, iterator-based interface. This imposes minimal requirements on the storage layer, which only needs to support sequential access, enabling on-disk and serialized storage. However, many applications would benefit from a richer interface. Efficient support for point lookups would enable serving data out of RDDs, but it currently requires iterating over an entire partition to find the desired element. Point updates similarly require copying an entire iterator. Joins are also expensive, requiring a shuffle and local hash joins. To address these problems, we propose IndexedRDD, an efficient key-value store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key uniqueness and pre-indexing the entries for efficient joins and point lookups, updates, and deletions. It would be implemented by (1) hash-partitioning the entries by key, (2) maintaining a hash index within each partition, and (3) using purely functional (immutable and efficiently updatable) data structures to enable efficient modifications and deletions. GraphX would be the first user of IndexedRDD, since it currently implements a limited form of this functionality in VertexRDD. We envision a variety of other uses for IndexedRDD, including streaming updates to RDDs, direct serving from RDDs, and as an execution strategy for Spark SQL. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-872) Should revive offer after tasks finish in Mesos fine-grained mode
[ https://issues.apache.org/jira/browse/SPARK-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065703#comment-14065703 ] Timothy Chen commented on SPARK-872: I'm not quite understanding your statement where Mesos master will call resourceOffer until 4 cores are free? Can you elaborate what that means? Should revive offer after tasks finish in Mesos fine-grained mode -- Key: SPARK-872 URL: https://issues.apache.org/jira/browse/SPARK-872 Project: Spark Issue Type: Improvement Components: Mesos Affects Versions: 0.8.0 Reporter: xiajunluan when running spark on latest Mesos release, I notice that spark on mesos fine-grained could not schedule spark tasks effectively, for example, if slave has 4 cpu cores resource, mesos master will call resourceOffer function of spark until 4 cpu cores are all free. but In my points like standalone scheduler mode, if one task finished and one cpus core is free, Mesos master should call spark resourceOffer to allocate resource to tasks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2454) Separate driver spark home from executor spark home
[ https://issues.apache.org/jira/browse/SPARK-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2454: - Description: The driver may not always share the same directory structure as the executors. It makes little sense to always re-use the driver's spark home on the executors. https://github.com/apache/spark/pull/1244/ is an open effort to fix this. However, this still requires us to set SPARK_HOME on all the executor nodes. Really we should separate this out into something like `spark.executor.home` and `spark.driver.home` rather than re-using SPARK_HOME everywhere. was: The driver may not always share the same directory structure as the executors. It makes little sense to always re-use the driver's spark home on the executors. https://github.com/apache/spark/pull/1244/ is an open effort to fix this. However, this still requires us to set SPARK_HOME on all the executor nodes. Really we should separate this out into something like `spark.executor.home` rather than re-using SPARK_HOME everywhere. Separate driver spark home from executor spark home --- Key: SPARK-2454 URL: https://issues.apache.org/jira/browse/SPARK-2454 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Fix For: 1.1.0 The driver may not always share the same directory structure as the executors. It makes little sense to always re-use the driver's spark home on the executors. https://github.com/apache/spark/pull/1244/ is an open effort to fix this. However, this still requires us to set SPARK_HOME on all the executor nodes. Really we should separate this out into something like `spark.executor.home` and `spark.driver.home` rather than re-using SPARK_HOME everywhere. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1702) Mesos executor won't start because of a ClassNotFoundException
[ https://issues.apache.org/jira/browse/SPARK-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065706#comment-14065706 ] Timothy Chen commented on SPARK-1702: - The PR is merged and closed already, is this still an issue? Mesos executor won't start because of a ClassNotFoundException -- Key: SPARK-1702 URL: https://issues.apache.org/jira/browse/SPARK-1702 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.0 Reporter: Bouke van der Bijl Labels: executors, mesos, spark Some discussion here: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html Fix here (which is probably not the right fix): https://github.com/apache/spark/pull/620 This was broken in v0.9.0, was fixed in v0.9.1 and is now broken again. Error in Mesos executor stderr: WARNING: Logging before InitGoogleLogging() is written to STDERR I0502 17:31:42.672224 14688 exec.cpp:131] Version: 0.18.0 I0502 17:31:42.674959 14707 exec.cpp:205] Executor registered on slave 20140501-182306-16842879-5050-10155-0 14/05/02 17:31:42 INFO MesosExecutorBackend: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/05/02 17:31:42 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20140501-182306-16842879-5050-10155-0 14/05/02 17:31:43 INFO SecurityManager: Changing view acls to: vagrant 14/05/02 17:31:43 INFO SecurityManager: SecurityManager, is authentication enabled: false are ui acls enabled: false users with view permissions: Set(vagrant) 14/05/02 17:31:43 INFO Slf4jLogger: Slf4jLogger started 14/05/02 17:31:43 INFO Remoting: Starting remoting 14/05/02 17:31:43 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@localhost:50843] 14/05/02 17:31:43 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@localhost:50843] java.lang.ClassNotFoundException: org/apache/spark/serializer/JavaSerializer at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:165) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:176) at org.apache.spark.executor.Executor.init(Executor.scala:106) at org.apache.spark.executor.MesosExecutorBackend.registered(MesosExecutorBackend.scala:56) Exception in thread Thread-0 I0502 17:31:43.710039 14707 exec.cpp:412] Deactivating the executor libprocess The problem is that it can't find the class. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1764) EOF reached before Python server acknowledged
[ https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065709#comment-14065709 ] Timothy Chen commented on SPARK-1764: - I'm not sure how this is related to Mesos, is this reproable using YARN or standalone? EOF reached before Python server acknowledged - Key: SPARK-1764 URL: https://issues.apache.org/jira/browse/SPARK-1764 Project: Spark Issue Type: Bug Components: Mesos, PySpark Affects Versions: 1.0.0 Reporter: Bouke van der Bijl Priority: Blocker Labels: mesos, pyspark I'm getting EOF reached before Python server acknowledged while using PySpark on Mesos. The error manifests itself in multiple ways. One is: 14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed due to the error EOF reached before Python server acknowledged; shutting down SparkContext And the other has a full stacktrace: 14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server acknowledged org.apache.spark.SparkException: EOF reached before Python server acknowledged at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416) at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387) at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.Accumulators$.add(Accumulators.scala:277) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) This error causes the SparkContext to shutdown. I have not been able to reliably reproduce this bug, it seems to happen randomly, but if you run enough tasks on a SparkContext it'll hapen eventually -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2563) Make number of connection retries configurable
Shivaram Venkataraman created SPARK-2563: Summary: Make number of connection retries configurable Key: SPARK-2563 URL: https://issues.apache.org/jira/browse/SPARK-2563 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shivaram Venkataraman Priority: Minor In a large EC2 cluster, I often see the first shuffle stage in a job fail due to connection timeout exceptions. We should make the number of retries before failing configurable to handle these cases. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2563) Make number of connection retries configurable
[ https://issues.apache.org/jira/browse/SPARK-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065735#comment-14065735 ] Shivaram Venkataraman commented on SPARK-2563: -- https://github.com/apache/spark/pull/1471 Make number of connection retries configurable -- Key: SPARK-2563 URL: https://issues.apache.org/jira/browse/SPARK-2563 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shivaram Venkataraman Priority: Minor In a large EC2 cluster, I often see the first shuffle stage in a job fail due to connection timeout exceptions. We should make the number of retries before failing configurable to handle these cases. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2491) When an OOM is thrown,the executor does not stop properly.
[ https://issues.apache.org/jira/browse/SPARK-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065757#comment-14065757 ] Kousuke Saruta commented on SPARK-2491: --- Hi [~gq] I found the issue related to you reported. https://issues.apache.org/jira/browse/SPARK-1667 I know 3 situation executor does not stop properly at least. 1) At the time executor cannot write files to spark-local-* 2) At the time executor cannot fetch locally from spark-local-* 3) At the time executor cannot fetch from remote because remote executor cannot read from spark-local-* I think if those case occurred, executor should shutdown. I'm trying to solve this issue at https://github.com/apache/spark/pull/1383 When an OOM is thrown,the executor does not stop properly. -- Key: SPARK-2491 URL: https://issues.apache.org/jira/browse/SPARK-2491 Project: Spark Issue Type: Bug Reporter: Guoqiang Li The executor log: {code} # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError=kill %p # Executing /bin/sh -c kill 44942... 14/07/15 10:38:29 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM 14/07/15 10:38:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Connection manager future execution context-6,5,main] java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:331) at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94) at org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176) at org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63) at org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:125) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:122) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/07/15 10:38:29 WARN HadoopRDD: Exception in RecordReader.close() java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.hadoop.util.LineReader.close(LineReader.java:150) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:243) at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) - 14/07/15 10:38:30 INFO Executor: Running task ID 969 14/07/15 10:38:30 INFO BlockManager: Found block broadcast_0 locally 14/07/15 10:38:30 INFO HadoopRDD: Input split: hdfs://10dian72.domain.test:8020/input/lbs/recommend/toona/rating/20140712/part-7:0+68016537 14/07/15 10:38:30 ERROR Executor: Exception in task ID 969 java.io.FileNotFoundException:
[jira] [Resolved] (SPARK-2534) Avoid pulling in the entire RDD or PairRDDFunctions in various operators
[ https://issues.apache.org/jira/browse/SPARK-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2534. Resolution: Fixed Fix Version/s: 1.0.2 1.1.0 Avoid pulling in the entire RDD or PairRDDFunctions in various operators Key: SPARK-2534 URL: https://issues.apache.org/jira/browse/SPARK-2534 Project: Spark Issue Type: Bug Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.1.0, 1.0.2 The way groupByKey is written actually pulls the entire PairRDDFunctions into the 3 closures, sometimes resulting in gigantic task sizes: {code} def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = { // groupByKey shouldn't use map side combine because map side combine does not // reduce the amount of data shuffled and requires all map side data be inserted // into a hash table, leading to more objects in the old gen. def createCombiner(v: V) = ArrayBuffer(v) def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v def mergeCombiners(c1: ArrayBuffer[V], c2: ArrayBuffer[V]) = c1 ++ c2 val bufs = combineByKey[ArrayBuffer[V]]( createCombiner _, mergeValue _, mergeCombiners _, partitioner, mapSideCombine=false) bufs.mapValues(_.toIterable) } {code} Changing the functions from def to val would solve it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2564) ShuffleReadMetrics.totalBlocksFetched is redundant
Sandy Ryza created SPARK-2564: - Summary: ShuffleReadMetrics.totalBlocksFetched is redundant Key: SPARK-2564 URL: https://issues.apache.org/jira/browse/SPARK-2564 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza We already track remoteBlocksFetched and localBlocksFetched -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2565) Update ShuffleReadMetrics as blocks are fetched
Sandy Ryza created SPARK-2565: - Summary: Update ShuffleReadMetrics as blocks are fetched Key: SPARK-2565 URL: https://issues.apache.org/jira/browse/SPARK-2565 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza Updating ShuffleReadMetrics as a task progresses will allow reporting incremental progress after SPARK-2099. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2566) Update ShuffleWriteMetrics as data is written
Sandy Ryza created SPARK-2566: - Summary: Update ShuffleWriteMetrics as data is written Key: SPARK-2566 URL: https://issues.apache.org/jira/browse/SPARK-2566 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Sandy Ryza This will allow reporting incremental progress once we have SPARK-2099. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2564) ShuffleReadMetrics.totalBlocksFetched is redundant
[ https://issues.apache.org/jira/browse/SPARK-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065826#comment-14065826 ] Sandy Ryza commented on SPARK-2564: --- https://github.com/apache/spark/pull/1474 ShuffleReadMetrics.totalBlocksFetched is redundant -- Key: SPARK-2564 URL: https://issues.apache.org/jira/browse/SPARK-2564 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza We already track remoteBlocksFetched and localBlocksFetched -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2567) Resubmitted stage sometimes remains as active stage in the web UI
Masayoshi TSUZUKI created SPARK-2567: Summary: Resubmitted stage sometimes remains as active stage in the web UI Key: SPARK-2567 URL: https://issues.apache.org/jira/browse/SPARK-2567 Project: Spark Issue Type: Bug Reporter: Masayoshi TSUZUKI When a stage is resubmitted because of executor lost for example, sometimes more than one resubmitted task appears in the web UI and one stage remains as active even after the job has finished. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2567) Resubmitted stage sometimes remains as active stage in the web UI
[ https://issues.apache.org/jira/browse/SPARK-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masayoshi TSUZUKI updated SPARK-2567: - Attachment: SPARK-2567.png Resubmitted stage sometimes remains as active stage in the web UI - Key: SPARK-2567 URL: https://issues.apache.org/jira/browse/SPARK-2567 Project: Spark Issue Type: Bug Reporter: Masayoshi TSUZUKI Attachments: SPARK-2567.png When a stage is resubmitted because of executor lost for example, sometimes more than one resubmitted task appears in the web UI and one stage remains as active even after the job has finished. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2568) RangePartitioner should go through the data only once
Reynold Xin created SPARK-2568: -- Summary: RangePartitioner should go through the data only once Key: SPARK-2568 URL: https://issues.apache.org/jira/browse/SPARK-2568 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Reynold Xin As of Spark 1.0, RangePartitioner goes through data twice: once to compute the count and once to do sampling. As a result, to do sortByKey, Spark goes through data 3 times (once to count, once to sample, and once to sort). RangePartitioner should go through data only once (remove the count step). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2299) Consolidate various stageIdTo* hash maps
[ https://issues.apache.org/jira/browse/SPARK-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2299. Resolution: Fixed Fix Version/s: 1.1.0 Consolidate various stageIdTo* hash maps Key: SPARK-2299 URL: https://issues.apache.org/jira/browse/SPARK-2299 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.1.0 In JobProgressListener: {code} val stageIdToTime = HashMap[Int, Long]() val stageIdToShuffleRead = HashMap[Int, Long]() val stageIdToShuffleWrite = HashMap[Int, Long]() val stageIdToMemoryBytesSpilled = HashMap[Int, Long]() val stageIdToDiskBytesSpilled = HashMap[Int, Long]() val stageIdToTasksActive = HashMap[Int, HashMap[Long, TaskInfo]]() val stageIdToTasksComplete = HashMap[Int, Int]() val stageIdToTasksFailed = HashMap[Int, Int]() val stageIdToTaskData = HashMap[Int, HashMap[Long, TaskUIData]]() val stageIdToExecutorSummaries = HashMap[Int, HashMap[String, ExecutorSummary]]() val stageIdToPool = HashMap[Int, String]() val stageIdToDescription = HashMap[Int, String]() {code} We should consolidate them to reduce memory be less error prone. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2569) Customized UDFs in hive not running with Spark SQL
jacky hung created SPARK-2569: - Summary: Customized UDFs in hive not running with Spark SQL Key: SPARK-2569 URL: https://issues.apache.org/jira/browse/SPARK-2569 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Environment: linux or mac, hive 0.9.0 and hive 0.13.0 with hadoop 1.0.4, scala 2.10.3, spark 1.0.0 Reporter: jacky hung start spark-shell, init (like create hiveContext, import ._ ect, make sure the jar including the UDFs is in classpath) hql(CREATE TEMPORARY FUNCTION t_ts AS 'udf.Timestamp'), which is successful. then i tried hql(select t_ts(time) from data_common where limit 1).collect().foreach(println), which failed with NullPointException we had discussion about it in the mail list. http://apache-spark-user-list.1001560.n3.nabble.com/run-sparksql-hiveudf-error-throw-NPE-td.html#a9006 java.lang.NullPointerException org.apache.spark.sql.hive.HiveFunctionFactory$class.getFunctionClass(hiveUdfs.scala:117) org.apache.spark.sql.hive.HiveUdf.getFunctionClass(hiveUdfs.scala:157) org.apache.spark.sql.hive.HiveFunctionFactory$class.createFunction(hiveUdfs.scala:119) org.apache.spark.sql.hive.HiveUdf.createFunction(hiveUdfs.scala:157) org.apache.spark.sql.hive.HiveUdf.function$lzycompute(hiveUdfs.scala:170) org.apache.spark.sql.hive.HiveUdf.function(hiveUdfs.scala:170) org.apache.spark.sql.hive.HiveSimpleUdf.method$lzycompute(hiveUdfs.scala:181) org.apache.spark.sql.hive.HiveSimpleUdf.method(hiveUdfs.scala:180) org.apache.spark.sql.hive.HiveSimpleUdf.wrappers$lzycompute(hiveUdfs.scala:186) org.apache.spark.sql.hive.HiveSimpleUdf.wrappers(hiveUdfs.scala:186) org.apache.spark.sql.hive.HiveSimpleUdf.eval(hiveUdfs.scala:220) org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:64) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:160) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:153) org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) org.apache.spark.rdd.RDD.iterator(RDD.scala:228) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2570) ClassCastException from HiveFromSpark(examples)
Cheng Hao created SPARK-2570: Summary: ClassCastException from HiveFromSpark(examples) Key: SPARK-2570 URL: https://issues.apache.org/jira/browse/SPARK-2570 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Minor The Exception is thrown when run the example of HiveFromSpark Exception in thread main java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) at org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:145) at org.apache.spark.examples.sql.hive.HiveFromSpark$.main(HiveFromSpark.scala:45) at org.apache.spark.examples.sql.hive.HiveFromSpark.main(HiveFromSpark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2570) ClassCastException from HiveFromSpark(examples)
[ https://issues.apache.org/jira/browse/SPARK-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065905#comment-14065905 ] Cheng Hao commented on SPARK-2570: -- https://github.com/apache/spark/pull/1475 ClassCastException from HiveFromSpark(examples) --- Key: SPARK-2570 URL: https://issues.apache.org/jira/browse/SPARK-2570 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Minor The Exception is thrown when run the example of HiveFromSpark Exception in thread main java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) at org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:145) at org.apache.spark.examples.sql.hive.HiveFromSpark$.main(HiveFromSpark.scala:45) at org.apache.spark.examples.sql.hive.HiveFromSpark.main(HiveFromSpark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2571) Shuffle read bytes are reported incorrectly for stages with multiple shuffle dependencies
Kay Ousterhout created SPARK-2571: - Summary: Shuffle read bytes are reported incorrectly for stages with multiple shuffle dependencies Key: SPARK-2571 URL: https://issues.apache.org/jira/browse/SPARK-2571 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.1, 0.9.3 Reporter: Kay Ousterhout Assignee: Kay Ousterhout In BlockStoreShuffleFetcher, we set the shuffle metrics for a task to include information about data fetched from one BlockFetcherIterator. When tasks have multiple shuffle dependencies (e.g., a stage that joins two datasets together), the metrics will get set based on data fetched from the last BlockFetcherIterator to complete, rather than the sum of all data fetched from all BlockFetcherIterators. This can lead to dramatically underreporting the shuffle read bytes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2571) Shuffle read bytes are reported incorrectly for stages with multiple shuffle dependencies
[ https://issues.apache.org/jira/browse/SPARK-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-2571: -- Description: In BlockStoreShuffleFetcher, we set the shuffle metrics for a task to include information about data fetched from one BlockFetcherIterator. When tasks have multiple shuffle dependencies (e.g., a stage that joins two datasets together), the metrics will get set based on data fetched from the last BlockFetcherIterator to complete, rather than the sum of all data fetched from all BlockFetcherIterators. This can lead to dramatically underreporting the shuffle read bytes. Thanks [~andrewor14] and [~rxin] for helping to diagnose this issue. was:In BlockStoreShuffleFetcher, we set the shuffle metrics for a task to include information about data fetched from one BlockFetcherIterator. When tasks have multiple shuffle dependencies (e.g., a stage that joins two datasets together), the metrics will get set based on data fetched from the last BlockFetcherIterator to complete, rather than the sum of all data fetched from all BlockFetcherIterators. This can lead to dramatically underreporting the shuffle read bytes. Shuffle read bytes are reported incorrectly for stages with multiple shuffle dependencies - Key: SPARK-2571 URL: https://issues.apache.org/jira/browse/SPARK-2571 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.1, 0.9.3 Reporter: Kay Ousterhout Assignee: Kay Ousterhout In BlockStoreShuffleFetcher, we set the shuffle metrics for a task to include information about data fetched from one BlockFetcherIterator. When tasks have multiple shuffle dependencies (e.g., a stage that joins two datasets together), the metrics will get set based on data fetched from the last BlockFetcherIterator to complete, rather than the sum of all data fetched from all BlockFetcherIterators. This can lead to dramatically underreporting the shuffle read bytes. Thanks [~andrewor14] and [~rxin] for helping to diagnose this issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1458) Expose sc.version in PySpark
[ https://issues.apache.org/jira/browse/SPARK-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065929#comment-14065929 ] Patrick Wendell commented on SPARK-1458: Isn't it possible to just have the python function call the Java/Scala one? That way there is no new version number to be updated anywhere. Expose sc.version in PySpark Key: SPARK-1458 URL: https://issues.apache.org/jira/browse/SPARK-1458 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core Affects Versions: 0.9.0 Reporter: Nicholas Chammas Priority: Minor As discussed [here|http://apache-spark-user-list.1001560.n3.nabble.com/programmatic-way-to-tell-Spark-version-td1929.html], I think it would be nice if there was a way to programmatically determine what version of Spark you are running. The potential use cases are not that important, but they include: # Branching your code based on what version of Spark is running. # Checking your version without having to quit and restart the Spark shell. Right now in PySpark, I believe the only way to determine your version is by firing up the Spark shell and looking at the startup banner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2411) Standalone Master - direct users to turn on event logs
[ https://issues.apache.org/jira/browse/SPARK-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2411: --- Assignee: Andrew Or Standalone Master - direct users to turn on event logs -- Key: SPARK-2411 URL: https://issues.apache.org/jira/browse/SPARK-2411 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Andrew Or Fix For: 1.1.0 Attachments: Application history load error.png, Application history not found.png, Event logging not enabled.png Right now if the user attempts to click on a finished application's UI, it simply refreshes. This is simply because the event logs are not there, in which case we set the href=. We could provide more information by pointing them to configure spark.eventLog.enabled if they click on the empty link. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2411) Standalone Master - direct users to turn on event logs
[ https://issues.apache.org/jira/browse/SPARK-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2411: --- Fix Version/s: 1.1.0 Standalone Master - direct users to turn on event logs -- Key: SPARK-2411 URL: https://issues.apache.org/jira/browse/SPARK-2411 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Fix For: 1.1.0 Attachments: Application history load error.png, Application history not found.png, Event logging not enabled.png Right now if the user attempts to click on a finished application's UI, it simply refreshes. This is simply because the event logs are not there, in which case we set the href=. We could provide more information by pointing them to configure spark.eventLog.enabled if they click on the empty link. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2543) Resizable serialization buffers for kryo
[ https://issues.apache.org/jira/browse/SPARK-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2543: --- Assignee: Koert Kuipers Resizable serialization buffers for kryo Key: SPARK-2543 URL: https://issues.apache.org/jira/browse/SPARK-2543 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: koert kuipers Assignee: Koert Kuipers Priority: Minor Kryo supports resizing serialization output buffers with the maxBufferSize parameter of KryoOutput. I suggest we expose this through the config spark.kryoserializer.buffer.max.mb For pull request see: https://github.com/apache/spark/pull/735 -- This message was sent by Atlassian JIRA (v6.2#6252)