[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117062#comment-14117062 ] Josh Rosen commented on SPARK-: --- [~shivaram] and I discussed this; we have a few ideas about what might be happening. I tried running {{sc.parallelize(1 to 10).repartition(24000).keyBy(x=x).reduceByKey(_+_).collect()}} in 1.0.2 and observed similarly slow speed to what I saw in the current 1.1.0 release candidate. When I modified my job to use fewer reducers ({{reduceByKey(_+_, 4)}}, then the job completed quickly. You can see similar behavior in Python by explicitly specifying a smaller number of reducers. I think the issue here is that the overhead of sending and processing task completions is proportional to O(numReducers). Specifically, the uncompressed size of ShuffleMapTask results is roughly O(numReducers), and there's a O(numReducers) processing cost for task completions within DAGScheduler (since mapOutputLocations is O(numReducers)). This normally isn't a problem, but it can impact performance for jobs with large numbers of extremely small map tasks (like this job, where nearly all of the map tasks are effectively no-ops). For larger tasks, this cost should be masked by larger overheads (such as task processing time). I'm not sure where the OOM is coming from, but the slow performance that you're observing here is probably due to the new default number of reducers (https://github.com/apache/spark/pull/1138 exposed this in Python by changing it's defaults to match Scala Spark). As a result, I'm not sure that this is a regression from 1.0.2, since it behaves similarly for Scala jobs. I think we already do some compression of the task results and there are probably other improvements that we can make to lower these overheads, but I think we should postpone that to 1.2.0. Large number of partitions causes OOM - Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances Reporter: Nicholas Chammas Here’s a repro for PySpark: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) {code} This code runs fine on 1.0.2. It returns the following result in just over a minute: {code} [(4, 'NickJohn')] {code} However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it runs for a very, very long time (at least 45 min) and then fails with {{java.lang.OutOfMemoryError: Java heap space}}. Here is a stack trace taken from a run on 1.1.0-rc2: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent heart beats: 175143ms exceeds 45000ms 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent heart beats: 175359ms exceeds 45000ms 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent heart beats: 173061ms exceeds 45000ms 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent heart beats: 176816ms exceeds 45000ms 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent heart beats: 182241ms exceeds 45000ms 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent heart beats: 178406ms exceeds 45000ms 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver thread-3 java.lang.OutOfMemoryError: Java heap space at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at
[jira] [Updated] (SPARK-3330) Successive test runs with different profiles fail SparkSubmitSuite
[ https://issues.apache.org/jira/browse/SPARK-3330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-3330: --- Assignee: Sean Owen Successive test runs with different profiles fail SparkSubmitSuite -- Key: SPARK-3330 URL: https://issues.apache.org/jira/browse/SPARK-3330 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Sean Owen Assignee: Sean Owen Maven-based Jenkins builds have been failing for a while: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/480/HADOOP_PROFILE=hadoop-2.4,label=centos/console One common cause is that on the second and subsequent runs of mvn clean test, at least two assembly JARs will exist in assembly/target. Because assembly is not a submodule of parent, mvn clean is not invoked for assembly. The presence of two assembly jars causes spark-submit to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3330) Successive test runs with different profiles fail SparkSubmitSuite
[ https://issues.apache.org/jira/browse/SPARK-3330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-3330: --- Assignee: (was: Sean Owen) Successive test runs with different profiles fail SparkSubmitSuite -- Key: SPARK-3330 URL: https://issues.apache.org/jira/browse/SPARK-3330 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Sean Owen Maven-based Jenkins builds have been failing for a while: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/480/HADOOP_PROFILE=hadoop-2.4,label=centos/console One common cause is that on the second and subsequent runs of mvn clean test, at least two assembly JARs will exist in assembly/target. Because assembly is not a submodule of parent, mvn clean is not invoked for assembly. The presence of two assembly jars causes spark-submit to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3328) ./make-distribution.sh --with-tachyon build is broken
[ https://issues.apache.org/jira/browse/SPARK-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-3328: --- Summary: ./make-distribution.sh --with-tachyon build is broken (was: --with-tachyon build is broken) ./make-distribution.sh --with-tachyon build is broken - Key: SPARK-3328 URL: https://issues.apache.org/jira/browse/SPARK-3328 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Elijah Epifanov cp: tachyon-0.5.0/target/tachyon-0.5.0-jar-with-dependencies.jar: No such file or directory -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3322) Log a ConnectionManager error when the application ends
[ https://issues.apache.org/jira/browse/SPARK-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117117#comment-14117117 ] Prashant Sharma commented on SPARK-3322: I remember seeing this discussion somewhere else as well, probably on a pull request. It seems the PR is not linked to this issue somehow. Would be great if you could do it. Log a ConnectionManager error when the application ends --- Key: SPARK-3322 URL: https://issues.apache.org/jira/browse/SPARK-3322 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: wangfei Athough it does not influence the result, it always would log an error from ConnectionManager. Sometimes only log ConnectionManagerId(vm2,40992) not found and sometimes it also will log CancelledKeyException The log Info as fellow: 14/08/29 16:54:53 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(vm2,40992) not found 14/08/29 16:54:53 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@457245f9 java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3306) Addition of external resource dependency in executors
[ https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117138#comment-14117138 ] Prashant Sharma commented on SPARK-3306: I am not 100% sure, why you are referring to Spark Executors. I was feeling there is no need to touch spark executors to support something like that. You can probably add a new spark conf option and by default all spark conf options are propogated to executors. I will let you close this issue, if you are convinced. Addition of external resource dependency in executors - Key: SPARK-3306 URL: https://issues.apache.org/jira/browse/SPARK-3306 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Yan Currently, Spark executors only support static and read-only external resources of side files and jar files. With emerging disparate data sources, there is a need to support more versatile external resources, such as connections to data sources, to facilitate efficient data accesses to the sources. For one, the JDBCRDD, with some modifications, could benefit from this feature by reusing established JDBC connections from the same Spark context before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3306) Addition of external resource dependency in executors
[ https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-3306: --- Target Version/s: (was: 1.3.0) Addition of external resource dependency in executors - Key: SPARK-3306 URL: https://issues.apache.org/jira/browse/SPARK-3306 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Yan Currently, Spark executors only support static and read-only external resources of side files and jar files. With emerging disparate data sources, there is a need to support more versatile external resources, such as connections to data sources, to facilitate efficient data accesses to the sources. For one, the JDBCRDD, with some modifications, could benefit from this feature by reusing established JDBC connections from the same Spark context before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3292) Shuffle Tasks run incessantly even though there's no inputs
[ https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117156#comment-14117156 ] Prashant Sharma commented on SPARK-3292: This issue reminds me of a pull request that was trying to not do anything if there is nothing to save. I will try to recollect the PR number. But in the meantime, would be great if you could link it. Shuffle Tasks run incessantly even though there's no inputs --- Key: SPARK-3292 URL: https://issues.apache.org/jira/browse/SPARK-3292 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2 Reporter: guowei such as repartition groupby join and cogroup for example. if i want the shuffle outputs save as hadoop file ,even though there is no inputs , many emtpy file generate too. it's too expensive , -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3292) Shuffle Tasks run incessantly even though there's no inputs
[ https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117156#comment-14117156 ] Prashant Sharma edited comment on SPARK-3292 at 9/1/14 8:12 AM: This issue reminds me of a pull request that was trying to not do anything if there is nothing to save for streaming jobs. I will try to recollect the PR number. But in the meantime, would be great if you could link it. was (Author: prashant_): This issue reminds me of a pull request that was trying to not do anything if there is nothing to save. I will try to recollect the PR number. But in the meantime, would be great if you could link it. Shuffle Tasks run incessantly even though there's no inputs --- Key: SPARK-3292 URL: https://issues.apache.org/jira/browse/SPARK-3292 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2 Reporter: guowei such as repartition groupby join and cogroup for example. if i want the shuffle outputs save as hadoop file ,even though there is no inputs , many emtpy file generate too. it's too expensive , -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3335) [Spark SQL] In pyspark, cannot use broadcast variables in UDF
kay feng created SPARK-3335: --- Summary: [Spark SQL] In pyspark, cannot use broadcast variables in UDF Key: SPARK-3335 URL: https://issues.apache.org/jira/browse/SPARK-3335 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.1.0 Reporter: kay feng Running pyspark on a spark cluster with standalone master, it cannot use broadcast variables in UDF. For example, bar={a:aa, b:bb, c:abc} foo=sc.broadcast(bar) sqlContext.registerFunction(MYUDF, lambda x: foo.value[x] if x else ''). Got the following exception: Py4JJavaError: An error occurred while calling o169.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 51.0 failed 4 times, most recent failure: Lost task 4.3 in stage 51.0 (TID 13040, ip-10-33-9-144.us-west-2.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /root/spark/python/pyspark/worker.py, line 75, in main command = pickleSer._read_with_length(infile) File /root/spark/python/pyspark/serializers.py, line 150, in _read_with_length return self.loads(obj) File /root/spark/python/pyspark/broadcast.py, line 41, in _from_id raise Exception(Broadcast variable '%s' not loaded! % bid) Exception: (Exception(Broadcast variable '21' not loaded!,), function _from_id at 0x35042a8, (21L,)) org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124) org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:154) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at
[jira] [Updated] (SPARK-3335) [Spark SQL] In pyspark, cannot use broadcast variables in UDF
[ https://issues.apache.org/jira/browse/SPARK-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kay feng updated SPARK-3335: Description: Running pyspark on a spark cluster with standalone master, it cannot use broadcast variables in UDF. For example, bar={a:aa, b:bb, c:abc} foo=sc.broadcast(bar) sqlContext.registerFunction(MYUDF, lambda x: foo.value[x] if x else ''). q= sqlContext.sql('SELECT MYUDF(c) FROM foobar') out = q.collect() Got the following exception: Py4JJavaError: An error occurred while calling o169.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 51.0 failed 4 times, most recent failure: Lost task 4.3 in stage 51.0 (TID 13040, ip-10-33-9-144.us-west-2.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /root/spark/python/pyspark/worker.py, line 75, in main command = pickleSer._read_with_length(infile) File /root/spark/python/pyspark/serializers.py, line 150, in _read_with_length return self.loads(obj) File /root/spark/python/pyspark/broadcast.py, line 41, in _from_id raise Exception(Broadcast variable '%s' not loaded! % bid) Exception: (Exception(Broadcast variable '21' not loaded!,), function _from_id at 0x35042a8, (21L,)) org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124) org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:154) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
[jira] [Created] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF
kay feng created SPARK-3336: --- Summary: [Spark SQL] In pyspark, cannot group by field on UDF Key: SPARK-3336 URL: https://issues.apache.org/jira/browse/SPARK-3336 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.1.0 Reporter: kay feng Running pyspark on a spark cluster with standalone master. Cannot group by field on a UDF. But we can group by UDF in Scala. For example: q = sqlContext.sql('SELECT COUNT(*), MYUDF(foo) FROM bar GROUP BY MYUDF(foo)') out = q.collect() I got this exception: Py4JJavaError: An error occurred while calling o183.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 56.0 failed 4 times, most recent failure: Lost task 26.3 in stage 56.0 (TID 14038, ip-10-33-9-144.us-west-2.compute.internal): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF#1278 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43) org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42) org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212) org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168) org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156) org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:42) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.immutable.List.foreach(List.scala:318) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.AbstractTraversable.map(Traversable.scala:105) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.init(Projection.scala:52) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.init(Aggregate.scala:176) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:172) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:115) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at
[jira] [Updated] (SPARK-3335) [Spark SQL] In pyspark, cannot use broadcast variables in UDF
[ https://issues.apache.org/jira/browse/SPARK-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kay feng updated SPARK-3335: Description: Running pyspark on a spark cluster with standalone master, spark sql cannot use broadcast variables in UDF. But we can use broadcast variable in spark in scala. For example, bar={a:aa, b:bb, c:abc} foo=sc.broadcast(bar) sqlContext.registerFunction(MYUDF, lambda x: foo.value[x] if x else ''). q= sqlContext.sql('SELECT MYUDF(c) FROM foobar') out = q.collect() Got the following exception: Py4JJavaError: An error occurred while calling o169.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 51.0 failed 4 times, most recent failure: Lost task 4.3 in stage 51.0 (TID 13040, ip-10-33-9-144.us-west-2.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /root/spark/python/pyspark/worker.py, line 75, in main command = pickleSer._read_with_length(infile) File /root/spark/python/pyspark/serializers.py, line 150, in _read_with_length return self.loads(obj) File /root/spark/python/pyspark/broadcast.py, line 41, in _from_id raise Exception(Broadcast variable '%s' not loaded! % bid) Exception: (Exception(Broadcast variable '21' not loaded!,), function _from_id at 0x35042a8, (21L,)) org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124) org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:154) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at
[jira] [Commented] (SPARK-3282) It should support multiple receivers at one socketInputDStream
[ https://issues.apache.org/jira/browse/SPARK-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117208#comment-14117208 ] Prashant Sharma commented on SPARK-3282: You can create as many receivers as you like, by creating multiple dstream and doing a union on them. There are examples illustrating this like: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/RawNetworkGrep.scala. Here you can specify how many dstreams to create and same number of receivers are created. It should support multiple receivers at one socketInputDStream --- Key: SPARK-3282 URL: https://issues.apache.org/jira/browse/SPARK-3282 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2 Reporter: shenhong At present, a socketInputDStream support at most one receiver, it will be bottleneck when large inputStrem appear. It should support multiple receivers at one socketInputDStream -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3275) Socket receiver can not recover when the socket server restarted
[ https://issues.apache.org/jira/browse/SPARK-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117216#comment-14117216 ] Prashant Sharma commented on SPARK-3275: AFAIK, from my conversation with TD once. Spark Streaming has a listener interface which lets you listen to events of receiver dying. So you can then manually restart/start another receiver somewhere. Socket receiver can not recover when the socket server restarted - Key: SPARK-3275 URL: https://issues.apache.org/jira/browse/SPARK-3275 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.2 Reporter: Jack Hu Labels: failover To reproduce this issue: 1. create a application with a socket dstream 2. start the socket server and start the application 3. restart the socket server 4. the socket dstream will fail to reconnect (it will close the connection after a successful connect) The main issue should be the status in SocketReceiver and ReceiverSupervisor is incorrect after the reconnect: In SocketReceiver ::receive() the while loop will never be entered after reconnect since the isStopped will returns true: val iterator = bytesToObjects(socket.getInputStream()) while(!isStopped iterator.hasNext) { store(iterator.next) } logInfo(Stopped receiving) restart(Retrying connecting to + host + : + port) That is caused by the status flag receiverState in ReceiverSupervisor will be set to Stopped when the connection losses, but it is reset after the call of Receiver start method: def startReceiver(): Unit = synchronized { try { logInfo(Starting receiver) receiver.onStart() logInfo(Called receiver onStart) onReceiverStart() receiverState = Started } catch { case t: Throwable = stop(Error starting receiver + streamId, Some(t)) } } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3071) Increase default driver memory
[ https://issues.apache.org/jira/browse/SPARK-3071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117241#comment-14117241 ] Prashant Sharma commented on SPARK-3071: This makes sense, but looking at it from a different angle that when we set too low memory the user is forced to think this should be configurable somewhere. However if you increase the default, most things will work without OOMs + this is effected cluster wide. And they may never need to look at it. (This is like being too critical, but just another angle.) Increase default driver memory -- Key: SPARK-3071 URL: https://issues.apache.org/jira/browse/SPARK-3071 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Xiangrui Meng The current default is 512M, which is usually too small because user also uses driver to do some computation. In local mode, executor memory setting is ignored while only driver memory is used, which provides more incentive to increase the default driver memory. I suggest 1. 2GB in local mode and warn users if executor memory is set a bigger value 2. same as worker memory on an EC2 standalone server -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3292) Shuffle Tasks run incessantly even though there's no inputs
[ https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117308#comment-14117308 ] guowei commented on SPARK-3292: --- i've created PR 2192 to fix it . but i don't know how to link it Shuffle Tasks run incessantly even though there's no inputs --- Key: SPARK-3292 URL: https://issues.apache.org/jira/browse/SPARK-3292 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2 Reporter: guowei such as repartition groupby join and cogroup for example. if i want the shuffle outputs save as hadoop file ,even though there is no inputs , many emtpy file generate too. it's too expensive , -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3178) setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the worker memory limit to zero
[ https://issues.apache.org/jira/browse/SPARK-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117350#comment-14117350 ] Apache Spark commented on SPARK-3178: - User 'bbejeck' has created a pull request for this issue: https://github.com/apache/spark/pull/2227 setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the worker memory limit to zero Key: SPARK-3178 URL: https://issues.apache.org/jira/browse/SPARK-3178 Project: Spark Issue Type: Bug Environment: osx Reporter: Jon Haddad Assignee: Bill Bejeck Labels: starter This should either default to m or just completely fail. Starting a worker with zero memory isn't very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3328) ./make-distribution.sh --with-tachyon build is broken
[ https://issues.apache.org/jira/browse/SPARK-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117361#comment-14117361 ] Apache Spark commented on SPARK-3328: - User 'prudhvije' has created a pull request for this issue: https://github.com/apache/spark/pull/2228 ./make-distribution.sh --with-tachyon build is broken - Key: SPARK-3328 URL: https://issues.apache.org/jira/browse/SPARK-3328 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Elijah Epifanov cp: tachyon-0.5.0/target/tachyon-0.5.0-jar-with-dependencies.jar: No such file or directory -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3337) Paranoid quoting in shell to allow install dirs with spaces within.
Prashant Sharma created SPARK-3337: -- Summary: Paranoid quoting in shell to allow install dirs with spaces within. Key: SPARK-3337 URL: https://issues.apache.org/jira/browse/SPARK-3337 Project: Spark Issue Type: Improvement Affects Versions: 1.0.2, 1.1.0 Reporter: Prashant Sharma Assignee: Prashant Sharma Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3337) Paranoid quoting in shell to allow install dirs with spaces within.
[ https://issues.apache.org/jira/browse/SPARK-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-3337: --- Component/s: Spark Core Paranoid quoting in shell to allow install dirs with spaces within. --- Key: SPARK-3337 URL: https://issues.apache.org/jira/browse/SPARK-3337 Project: Spark Issue Type: Improvement Components: Build, Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Prashant Sharma Assignee: Prashant Sharma Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3337) Paranoid quoting in shell to allow install dirs with spaces within.
[ https://issues.apache.org/jira/browse/SPARK-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-3337: --- Component/s: Build Paranoid quoting in shell to allow install dirs with spaces within. --- Key: SPARK-3337 URL: https://issues.apache.org/jira/browse/SPARK-3337 Project: Spark Issue Type: Improvement Components: Build, Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Prashant Sharma Assignee: Prashant Sharma Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2096) Correctly parse dot notations for accessing an array of structs
[ https://issues.apache.org/jira/browse/SPARK-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117381#comment-14117381 ] Apache Spark commented on SPARK-2096: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/2230 Correctly parse dot notations for accessing an array of structs --- Key: SPARK-2096 URL: https://issues.apache.org/jira/browse/SPARK-2096 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Yin Huai Priority: Minor Labels: starter For example, arrayOfStruct is an array of structs and every element of this array has a field called field1. arrayOfStruct[0].field1 means to access the value of field1 for the first element of arrayOfStruct, but the SQL parser (in sql-core) treats field1 as an alias. Also, arrayOfStruct.field1 means to access all values of field1 in this array of structs and the returns those values as an array. But, the SQL parser cannot resolve it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3191) Add explanation of support building spark with maven in http proxy environment
[ https://issues.apache.org/jira/browse/SPARK-3191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117400#comment-14117400 ] zhengbing li commented on SPARK-3191: - Another way to solve this issuse is to set environment variable: export MAVEN_OPTS=-Dmaven.wagon.http.ssl.insecure=true -Dmaven.wagon.http.ssl.allowall=true Add explanation of support building spark with maven in http proxy environment -- Key: SPARK-3191 URL: https://issues.apache.org/jira/browse/SPARK-3191 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.0.2 Environment: linux suse 11 maven version:apache-maven-3.0.5 spark version: 1.0.1 proxy setting of maven is: proxy idlzb/id activetrue/active protocolhttp/protocol usernameuser/username passwordpassword/password hostproxy.company.com/host port8080/port nonProxyHosts*.company.com/nonProxyHosts /proxy Reporter: zhengbing li Priority: Trivial Labels: build, maven Fix For: 1.1.0 Original Estimate: 1h Remaining Estimate: 1h When I use mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package in http proxy enviroment, I cannot finish this task.Error is as follows: [INFO] Spark Project YARN Stable API . SUCCESS [34.217s] [INFO] Spark Project Assembly FAILURE [43.133s] [INFO] Spark Project External Twitter SKIPPED [INFO] Spark Project External Kafka .. SKIPPED [INFO] Spark Project External Flume .. SKIPPED [INFO] Spark Project External ZeroMQ . SKIPPED [INFO] Spark Project External MQTT ... SKIPPED [INFO] Spark Project Examples SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 27:57.309s [INFO] Finished at: Sat Aug 23 09:43:21 CST 2014 [INFO] Final Memory: 51M/1080M [INFO] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project spark-assembly_2.10: Execution default of goal org.apache.maven.plugins:maven-shade-plugin:2.2:shade failed: Plugin org.apache.maven.plugins:maven-shade-plugin:2.2 or one of its dependencies could not be resolved: Could not find artifact com.google.code.findbugs:jsr305:jar:1.3.9 - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginResolutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :spark-assembly_2.10 If you use this command, It is ok mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -Dmaven.wagon.http.ssl.insecure=true -Dmaven.wagon.http.ssl.allowall=true -DskipTests clean package The error is not very obvious, I spent a long time to solve this issues In order to facilitate other guys who use spark in http proxy environment, I highly recommed add this to documents -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117453#comment-14117453 ] Guoqiang Li commented on SPARK-1405: Hi all [PR 1983|https://github.com/apache/spark/pull/1983] is ready to review. parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Xusen Yin Labels: features Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117453#comment-14117453 ] Guoqiang Li edited comment on SPARK-1405 at 9/1/14 2:25 PM: Hi all [PR 1983|https://github.com/apache/spark/pull/1983] is OK to review. was (Author: gq): Hi all [PR 1983|https://github.com/apache/spark/pull/1983] is ready to review. parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Xusen Yin Labels: features Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2594) Add CACHE TABLE name AS SELECT ...
[ https://issues.apache.org/jira/browse/SPARK-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117464#comment-14117464 ] Ravindra Pesala commented on SPARK-2594: Thank you Michael. Following are the tasks which I am planning to do to support this feature. 1. Change the SqlParser to support ADD CACHE TABLE syntax.And also we can change the HiveQl to support syntax for hive. 2. Add new strategy 'AddCacheTable' to 'SparkStrategies' 3. In the new strategy 'AddCacheTable', register the tableName with the plan and cache the same with tableName. Please review it and comment. Add CACHE TABLE name AS SELECT ... Key: SPARK-2594 URL: https://issues.apache.org/jira/browse/SPARK-2594 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117558#comment-14117558 ] Ye Xianjin commented on SPARK-3098: --- hi, [~srowen] and [~gq], I think what [~matei] wants to say is that because the ordering of elements in distinct() is not guaranteed, the result of zipWithIndex is not deterministic. If you recompute the RDD with distinct transformation, you are not guaranteed to get the same result. That explains the behavior here. But as [~srowen] said, It's surprised to see different results from the same RDD. [~matei], what do you think about this behavior? In some cases, operation zipWithIndex get a wrong results -- Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical The reproduce code: {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() c.join(c).filter(t = t._2._1 != t._2._2).take(3) {code} = {code} Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), (36579712,(13,14))) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2022) Spark 1.0.0 is failing if mesos.coarse set to true
[ https://issues.apache.org/jira/browse/SPARK-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117581#comment-14117581 ] Timothy Chen commented on SPARK-2022: - This should be resolved now, [~pwend...@gmail.com] please help close this. Spark 1.0.0 is failing if mesos.coarse set to true -- Key: SPARK-2022 URL: https://issues.apache.org/jira/browse/SPARK-2022 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.0 Reporter: Marek Wiewiorka Assignee: Tim Chen Priority: Critical more stderr --- WARNING: Logging before InitGoogleLogging() is written to STDERR I0603 16:07:53.721132 61192 exec.cpp:131] Version: 0.18.2 I0603 16:07:53.725230 61200 exec.cpp:205] Executor registered on slave 201405220917-134217738-5050-27119-0 Exception in thread main java.lang.NumberFormatException: For input string: sparkseq003.cloudapp.net at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:492) at java.lang.Integer.parseInt(Integer.java:527) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) at scala.collection.immutable.StringOps.toInt(StringOps.scala:31) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:135) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) more stdout --- Registered executor on sparkseq003.cloudapp.net Starting task 5 Forked command at 61202 sh -c '/home/mesos/spark-1.0.0/bin/spark-class org.apache.spark.executor.CoarseGrainedExecutorBackend -Dspark.mesos.coarse=true akka.tcp://sp...@sparkseq001.cloudapp.net:40312/user/CoarseG rainedScheduler 201405220917-134217738-5050-27119-0 sparkseq003.cloudapp.net 4' Command exited with status 1 (pid: 61202) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117622#comment-14117622 ] Matei Zaharia commented on SPARK-3098: -- It's true that the ordering of values after a shuffle is nondeterministic, so that for example on failure you might get a different order of keys in a reduceByKey or distinct or operations like that. However, I think that's the way it should be (and we can document it). RDDs are deterministic when viewed as a multiset, but not when viewed as an ordered collection, unless you do sortByKey. Operations like zipWithIndex are meant to be more of a convenience to get unique IDs or act on something with a known ordering (such as a text file where you want to know the line numbers). But the freedom to control fetch ordering is quite important for performance, especially if you want to have a push-based shuffle in the future. If we wanted to get the same result every time, we could design reduce tasks to tell the master the order they fetched stuff in after the first time they ran, but even then, notice that it might limit the kind of shuffle mechanisms we allow (e.g. it would be harder to make a push-based shuffle deterministic). I'd rather not make that guarantee now. In some cases, operation zipWithIndex get a wrong results -- Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical The reproduce code: {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() c.join(c).filter(t = t._2._1 != t._2._2).take(3) {code} = {code} Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), (36579712,(13,14))) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2312) Spark Actors do not handle unknown messages in their receive methods
[ https://issues.apache.org/jira/browse/SPARK-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2312: -- Assignee: Josh Rosen Spark Actors do not handle unknown messages in their receive methods Key: SPARK-2312 URL: https://issues.apache.org/jira/browse/SPARK-2312 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Kam Kasravi Assignee: Josh Rosen Priority: Minor Labels: starter Fix For: 1.1.0 Original Estimate: 24h Remaining Estimate: 24h Per akka documentation - an actor should provide a pattern match for all messages including _ otherwise akka.actor.UnhandledMessage will be propagated. Noted actors: MapOutputTrackerMasterActor, ClientActor, Master, Worker... Should minimally do a logWarning(sReceived unexpected actor system event: $_) so message info is logged in correct actor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2312) Spark Actors do not handle unknown messages in their receive methods
[ https://issues.apache.org/jira/browse/SPARK-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2312: -- Assignee: Isaias Barroso (was: Josh Rosen) Spark Actors do not handle unknown messages in their receive methods Key: SPARK-2312 URL: https://issues.apache.org/jira/browse/SPARK-2312 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Kam Kasravi Assignee: Isaias Barroso Priority: Minor Labels: starter Fix For: 1.1.0 Original Estimate: 24h Remaining Estimate: 24h Per akka documentation - an actor should provide a pattern match for all messages including _ otherwise akka.actor.UnhandledMessage will be propagated. Noted actors: MapOutputTrackerMasterActor, ClientActor, Master, Worker... Should minimally do a logWarning(sReceived unexpected actor system event: $_) so message info is logged in correct actor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3338) Respect user setting of spark.submit.pyFiles
Andrew Or created SPARK-3338: Summary: Respect user setting of spark.submit.pyFiles Key: SPARK-3338 URL: https://issues.apache.org/jira/browse/SPARK-3338 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Andrew Or We currently override any setting of spark.submit.pyFiles. Even though this is not documented, we should still respect this if the user explicitly sets this in his/her default properties file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3338) Respect user setting of spark.submit.pyFiles
[ https://issues.apache.org/jira/browse/SPARK-3338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3338: - Component/s: (was: PySpark) Respect user setting of spark.submit.pyFiles Key: SPARK-3338 URL: https://issues.apache.org/jira/browse/SPARK-3338 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Andrew Or We currently override any setting of spark.submit.pyFiles. Even though this is not documented, we should still respect this if the user explicitly sets this in his/her default properties file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117770#comment-14117770 ] Patrick Wendell commented on SPARK-: [~nchammas]. I think the default number of reducers could be the culprit. If you look at the actual job run in Spark 1.0.2, how many reducers are there in the shuffle? How many are there? What happens, if you run the code in Spark 1.1.0 and specify that number of reducers that matches the number chose in 1.0.2... then is the performance the same? If this is the case we should probably document this clearly in the release notes, since changing this default could have major implications for those relying on the default behavior. Large number of partitions causes OOM - Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances Reporter: Nicholas Chammas Here’s a repro for PySpark: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) {code} This code runs fine on 1.0.2. It returns the following result in just over a minute: {code} [(4, 'NickJohn')] {code} However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it runs for a very, very long time (at least 45 min) and then fails with {{java.lang.OutOfMemoryError: Java heap space}}. Here is a stack trace taken from a run on 1.1.0-rc2: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent heart beats: 175143ms exceeds 45000ms 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent heart beats: 175359ms exceeds 45000ms 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent heart beats: 173061ms exceeds 45000ms 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent heart beats: 176816ms exceeds 45000ms 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent heart beats: 182241ms exceeds 45000ms 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent heart beats: 178406ms exceeds 45000ms 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver thread-3 java.lang.OutOfMemoryError: Java heap space at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Exception in thread Result resolver thread-3
[jira] [Comment Edited] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117770#comment-14117770 ] Patrick Wendell edited comment on SPARK- at 9/1/14 10:51 PM: - [~nchammas]. I think the default number of reducers could be the culprit. If you look at the actual job run in Spark 1.0.2, how many reducers are there in the shuffle? What happens, if you run the code in Spark 1.1.0 and specify that number of reducers that matches the number chose in 1.0.2... then is the performance the same? If this is the case we should probably document this clearly in the release notes, since changing this default could have major implications for those relying on the default behavior. was (Author: pwendell): [~nchammas]. I think the default number of reducers could be the culprit. If you look at the actual job run in Spark 1.0.2, how many reducers are there in the shuffle? How many are there? What happens, if you run the code in Spark 1.1.0 and specify that number of reducers that matches the number chose in 1.0.2... then is the performance the same? If this is the case we should probably document this clearly in the release notes, since changing this default could have major implications for those relying on the default behavior. Large number of partitions causes OOM - Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances Reporter: Nicholas Chammas Here’s a repro for PySpark: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) {code} This code runs fine on 1.0.2. It returns the following result in just over a minute: {code} [(4, 'NickJohn')] {code} However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it runs for a very, very long time (at least 45 min) and then fails with {{java.lang.OutOfMemoryError: Java heap space}}. Here is a stack trace taken from a run on 1.1.0-rc2: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent heart beats: 175143ms exceeds 45000ms 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent heart beats: 175359ms exceeds 45000ms 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent heart beats: 173061ms exceeds 45000ms 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent heart beats: 176816ms exceeds 45000ms 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent heart beats: 182241ms exceeds 45000ms 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent heart beats: 178406ms exceeds 45000ms 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver thread-3 java.lang.OutOfMemoryError: Java heap space at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) at
[jira] [Created] (SPARK-3339) Support for skipping json lines that fail to parse
Michael Armbrust created SPARK-3339: --- Summary: Support for skipping json lines that fail to parse Key: SPARK-3339 URL: https://issues.apache.org/jira/browse/SPARK-3339 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust When dealing with large datasets there is alway some data that fails to parse. Would be nice to handle this instead of throwing an exception requiring the user to filter it out manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3339) Support for skipping json lines that fail to parse
[ https://issues.apache.org/jira/browse/SPARK-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3339: Assignee: Yin Huai Support for skipping json lines that fail to parse -- Key: SPARK-3339 URL: https://issues.apache.org/jira/browse/SPARK-3339 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Yin Huai When dealing with large datasets there is alway some data that fails to parse. Would be nice to handle this instead of throwing an exception requiring the user to filter it out manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3308) Ability to read JSON Arrays as tables
[ https://issues.apache.org/jira/browse/SPARK-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3308: Assignee: Yin Huai Ability to read JSON Arrays as tables - Key: SPARK-3308 URL: https://issues.apache.org/jira/browse/SPARK-3308 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Yin Huai Priority: Critical Right now we can only read json where each object is on its own line. It would be nice to be able to read top level json arrays where each element maps to a row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3319) Resolve spark.jars, spark.files, and spark.submit.pyFiles etc.
[ https://issues.apache.org/jira/browse/SPARK-3319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117775#comment-14117775 ] Apache Spark commented on SPARK-3319: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/2232 Resolve spark.jars, spark.files, and spark.submit.pyFiles etc. -- Key: SPARK-3319 URL: https://issues.apache.org/jira/browse/SPARK-3319 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Andrew Or We already do this for --jars, --files, and --py-files etc. For consistency, we should do the same for the corresponding spark configs as well in case the user sets them in spark-defaults.conf instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3340) Deprecate ADD_JARS and ADD_FILES
Andrew Or created SPARK-3340: Summary: Deprecate ADD_JARS and ADD_FILES Key: SPARK-3340 URL: https://issues.apache.org/jira/browse/SPARK-3340 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Andrew Or These were introduced before Spark submit even existed. Now that there are many better ways of setting jars and python files through Spark submit, we should deprecate these environment variables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3338) Respect user setting of spark.submit.pyFiles
[ https://issues.apache.org/jira/browse/SPARK-3338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117801#comment-14117801 ] Andrew Or commented on SPARK-3338: -- https://github.com/apache/spark/pull/2232 Respect user setting of spark.submit.pyFiles Key: SPARK-3338 URL: https://issues.apache.org/jira/browse/SPARK-3338 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Andrew Or We currently override any setting of spark.submit.pyFiles. Even though this is not documented, we should still respect this if the user explicitly sets this in his/her default properties file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2638) Improve concurrency of fetching Map outputs
[ https://issues.apache.org/jira/browse/SPARK-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-2638: - Assignee: Josh Rosen Improve concurrency of fetching Map outputs --- Key: SPARK-2638 URL: https://issues.apache.org/jira/browse/SPARK-2638 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Environment: All Reporter: Stephen Boesch Assignee: Josh Rosen Priority: Minor Labels: MapOutput, concurrency Fix For: 1.1.0 Original Estimate: 0h Remaining Estimate: 0h This issue was noticed while perusing the MapOutputTracker source code. Notice that the synchronization is on the containing fetching collection - which makes ALL fetches wait if any fetch were occurring. The fix is to synchronize instead on the shuffleId (interned as a string to ensure JVM wide visibility). def getServerStatuses(shuffleId: Int, reduceId: Int): Array[(BlockManagerId, Long)] = { val statuses = mapStatuses.get(shuffleId).orNull if (statuses == null) { logInfo(Don't have map outputs for shuffle + shuffleId + , fetching them) var fetchedStatuses: Array[MapStatus] = null fetching.synchronized { // This is existing code // shuffleId.toString.intern.synchronized { // New Code if (fetching.contains(shuffleId)) { // Someone else is fetching it; wait for them to be done while (fetching.contains(shuffleId)) { try { fetching.wait() } catch { case e: InterruptedException = } } This is only a small code change, but the testcases to prove (a) proper functionality and (b) proper performance improvement are not so trivial. For (b) it is not worthwhile to add a testcase to the codebase. Instead I have added a git project that demonstrates the concurrency/performance improvement using the fine-grained approach . The github project is at https://github.com/javadba/scalatesting.git . Simply run sbt test. Note: it is unclear how/where to include this ancillary testing/verification information that will not be included in the git PR: i am open for any suggestions - even as far as simply removing references to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1174) Adding port configuration for HttpFileServer
[ https://issues.apache.org/jira/browse/SPARK-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1174. Resolution: Duplicate Adding port configuration for HttpFileServer Key: SPARK-1174 URL: https://issues.apache.org/jira/browse/SPARK-1174 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.9.0 Reporter: Egor Pakhomov Assignee: Egor Pahomov Priority: Minor Fix For: 0.9.0 I run spark in big organization, where to open port accessible to other computers in network, I need to create a ticket on DevOps and it executes for days. I can't have port for some spark service to be changed all the time. I need ability to configure this port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2895) Support mapPartitionsWithContext in Spark Java API
[ https://issues.apache.org/jira/browse/SPARK-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated SPARK-2895: - Description: This is a requirement from Hive on Spark, mapPartitionsWithContext only exists in Spark Scala API, we expect to access from Spark Java API. For HIVE-7627, HIVE-7843, Hive operators which are invoked in mapPartitions closure need to get taskId. was:This is a requirement from Hive on Spark, mapPartitionsWithContext only exists in Spark Scala API, we expect to access from Spark Java API. Support mapPartitionsWithContext in Spark Java API -- Key: SPARK-2895 URL: https://issues.apache.org/jira/browse/SPARK-2895 Project: Spark Issue Type: New Feature Components: Java API Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: hive This is a requirement from Hive on Spark, mapPartitionsWithContext only exists in Spark Scala API, we expect to access from Spark Java API. For HIVE-7627, HIVE-7843, Hive operators which are invoked in mapPartitions closure need to get taskId. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2078) Use ISO8601 date formats in logging
[ https://issues.apache.org/jira/browse/SPARK-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2078. Resolution: Won't Fix Discussion on the JIRA suggested let's either keep the current one or just add milliseconds. https://github.com/apache/spark/pull/1018 Use ISO8601 date formats in logging --- Key: SPARK-2078 URL: https://issues.apache.org/jira/browse/SPARK-2078 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Ash Assignee: Andrew Ash Currently, logging has 2 digit years and doesn't include milliseconds in logging timestamps. Use ISO8601 date formats instead of the current custom formats. There is some precedent here for ISO8601 format -- it's what [Hadoop uses|https://github.com/apache/hadoop-common/blob/d92a8a29978e35ed36c4d4721a21c356c1ff1d4d/hadoop-common-project/hadoop-minikdc/src/main/resources/log4j.properties] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2078) Use ISO8601 date formats in logging
[ https://issues.apache.org/jira/browse/SPARK-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2078: --- Issue Type: Improvement (was: Bug) Use ISO8601 date formats in logging --- Key: SPARK-2078 URL: https://issues.apache.org/jira/browse/SPARK-2078 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Ash Assignee: Andrew Ash Currently, logging has 2 digit years and doesn't include milliseconds in logging timestamps. Use ISO8601 date formats instead of the current custom formats. There is some precedent here for ISO8601 format -- it's what [Hadoop uses|https://github.com/apache/hadoop-common/blob/d92a8a29978e35ed36c4d4721a21c356c1ff1d4d/hadoop-common-project/hadoop-minikdc/src/main/resources/log4j.properties] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3179) Add task OutputMetrics
[ https://issues.apache.org/jira/browse/SPARK-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117825#comment-14117825 ] Sandy Ryza commented on SPARK-3179: --- Hi Michael, Happy to help review your code or answer questions. Add task OutputMetrics -- Key: SPARK-3179 URL: https://issues.apache.org/jira/browse/SPARK-3179 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Sandy Ryza Track the bytes that tasks write to HDFS or other output destinations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2773) Shuffle:use growth rate to predict if need to spill
[ https://issues.apache.org/jira/browse/SPARK-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2773. Resolution: Invalid Shuffle:use growth rate to predict if need to spill --- Key: SPARK-2773 URL: https://issues.apache.org/jira/browse/SPARK-2773 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 0.9.0, 1.0.0 Reporter: uncleGen Priority: Minor Right now, Spark uses the total usage of shuffle memory of each thread to predict if need to spill. I think it is not very reasonable. For example, there are two threads pulling shuffle data. The total memory used to buffer data is 21G. The first time to trigger spilling it when one thread has used 7G memory to buffer shuffle data, here I assume another one has used the same size. Unfortunately, I still have remaining 7G to use. So, I think current prediction mode is too conservative, and can not maximize the usage of shuffle memory. In my solution, I use the growth rate of shuffle memory. Again, the growth of each time is limited, maybe 10K * 1024(my assumption), then the first time to trigger spilling is when the remaining shuffle memory is less than threads * growth * 2, i.e. 2 * 10M * 2. I think it can maximize the usage of shuffle memory. In my solution, there is also a conservative assumption, i.e. all of threads is pulling shuffle data in one executor. However it dose not have much effect, the grow is limited after all. Any suggestion? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2773) Shuffle:use growth rate to predict if need to spill
[ https://issues.apache.org/jira/browse/SPARK-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2773. Resolution: Won't Fix I don't think this is needed now that SPARK-2316 is fixed. This queue is not intended to overflow during normal operation. If you still observe issues in a version of Spark that contains SPARK-2316... please report it and we'll see what is going on. Shuffle:use growth rate to predict if need to spill --- Key: SPARK-2773 URL: https://issues.apache.org/jira/browse/SPARK-2773 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 0.9.0, 1.0.0 Reporter: uncleGen Priority: Minor Right now, Spark uses the total usage of shuffle memory of each thread to predict if need to spill. I think it is not very reasonable. For example, there are two threads pulling shuffle data. The total memory used to buffer data is 21G. The first time to trigger spilling it when one thread has used 7G memory to buffer shuffle data, here I assume another one has used the same size. Unfortunately, I still have remaining 7G to use. So, I think current prediction mode is too conservative, and can not maximize the usage of shuffle memory. In my solution, I use the growth rate of shuffle memory. Again, the growth of each time is limited, maybe 10K * 1024(my assumption), then the first time to trigger spilling is when the remaining shuffle memory is less than threads * growth * 2, i.e. 2 * 10M * 2. I think it can maximize the usage of shuffle memory. In my solution, there is also a conservative assumption, i.e. all of threads is pulling shuffle data in one executor. However it dose not have much effect, the grow is limited after all. Any suggestion? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-2773) Shuffle:use growth rate to predict if need to spill
[ https://issues.apache.org/jira/browse/SPARK-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reopened SPARK-2773: Shuffle:use growth rate to predict if need to spill --- Key: SPARK-2773 URL: https://issues.apache.org/jira/browse/SPARK-2773 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 0.9.0, 1.0.0 Reporter: uncleGen Priority: Minor Right now, Spark uses the total usage of shuffle memory of each thread to predict if need to spill. I think it is not very reasonable. For example, there are two threads pulling shuffle data. The total memory used to buffer data is 21G. The first time to trigger spilling it when one thread has used 7G memory to buffer shuffle data, here I assume another one has used the same size. Unfortunately, I still have remaining 7G to use. So, I think current prediction mode is too conservative, and can not maximize the usage of shuffle memory. In my solution, I use the growth rate of shuffle memory. Again, the growth of each time is limited, maybe 10K * 1024(my assumption), then the first time to trigger spilling is when the remaining shuffle memory is less than threads * growth * 2, i.e. 2 * 10M * 2. I think it can maximize the usage of shuffle memory. In my solution, there is also a conservative assumption, i.e. all of threads is pulling shuffle data in one executor. However it dose not have much effect, the grow is limited after all. Any suggestion? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3067) JobProgressPage could not show Fair Scheduler Pools section sometimes
[ https://issues.apache.org/jira/browse/SPARK-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3067: --- Target Version/s: 1.2.0 JobProgressPage could not show Fair Scheduler Pools section sometimes - Key: SPARK-3067 URL: https://issues.apache.org/jira/browse/SPARK-3067 Project: Spark Issue Type: Bug Components: Spark Core Reporter: YanTang Zhai Priority: Minor JobProgressPage could not show Fair Scheduler Pools section sometimes. SparkContext starts webui and then postEnvironmentUpdate. Sometimes JobProgressPage is accessed between webui starting and postEnvironmentUpdate, then the lazy val isFairScheduler will be false. The Fair Scheduler Pools section will not display any more. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117853#comment-14117853 ] Nicholas Chammas commented on SPARK-: - It looks like the default number of reducers does indeed explain most of the performance difference here. But there is still a significant difference even after controlling this variable. I have 2 identical EC2 clusters as described in this JIRA issue, one on 1.0.2 and one on 1.1.0-rc3. This time I ran the following PySpark code: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y, sc.defaultParallelism).take(1) {code} Here are my results for 3 runs on each cluster: ||1.0.2||1.1.0-rc3|| | 95s | 343s | | 89s | 336s | | 95s | 334s | So manually setting the number of reducers to a smaller number does help a lot, but there is still a 3-4x performance slowdown. Can anyone else replicate this result? Large number of partitions causes OOM - Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances Reporter: Nicholas Chammas Here’s a repro for PySpark: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) {code} This code runs fine on 1.0.2. It returns the following result in just over a minute: {code} [(4, 'NickJohn')] {code} However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it runs for a very, very long time (at least 45 min) and then fails with {{java.lang.OutOfMemoryError: Java heap space}}. Here is a stack trace taken from a run on 1.1.0-rc2: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent heart beats: 175143ms exceeds 45000ms 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent heart beats: 175359ms exceeds 45000ms 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent heart beats: 173061ms exceeds 45000ms 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent heart beats: 176816ms exceeds 45000ms 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent heart beats: 182241ms exceeds 45000ms 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent heart beats: 178406ms exceeds 45000ms 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver thread-3 java.lang.OutOfMemoryError: Java heap space at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) at
[jira] [Comment Edited] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117853#comment-14117853 ] Nicholas Chammas edited comment on SPARK- at 9/2/14 3:13 AM: - It looks like the default number of reducers does indeed explain most of the performance difference here. But there is still a significant difference even after controlling this variable. I have 2 identical EC2 clusters as described in this JIRA issue, one on 1.0.2 and one on 1.1.0-rc3. This time I ran the following PySpark code: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y, sc.defaultParallelism).take(1) {code} Here are the runtimes for 3 runs on each cluster: ||1.0.2||1.1.0-rc3|| | 95s | 343s | | 89s | 336s | | 95s | 334s | So manually setting the number of reducers to a smaller number does help a lot, but there is still a 3-4x performance slowdown. Can anyone else replicate this result? was (Author: nchammas): It looks like the default number of reducers does indeed explain most of the performance difference here. But there is still a significant difference even after controlling this variable. I have 2 identical EC2 clusters as described in this JIRA issue, one on 1.0.2 and one on 1.1.0-rc3. This time I ran the following PySpark code: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y, sc.defaultParallelism).take(1) {code} Here are my results for 3 runs on each cluster: ||1.0.2||1.1.0-rc3|| | 95s | 343s | | 89s | 336s | | 95s | 334s | So manually setting the number of reducers to a smaller number does help a lot, but there is still a 3-4x performance slowdown. Can anyone else replicate this result? Large number of partitions causes OOM - Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances Reporter: Nicholas Chammas Here’s a repro for PySpark: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) {code} This code runs fine on 1.0.2. It returns the following result in just over a minute: {code} [(4, 'NickJohn')] {code} However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it runs for a very, very long time (at least 45 min) and then fails with {{java.lang.OutOfMemoryError: Java heap space}}. Here is a stack trace taken from a run on 1.1.0-rc2: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent heart beats: 175143ms exceeds 45000ms 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent heart beats: 175359ms exceeds 45000ms 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent heart beats: 173061ms exceeds 45000ms 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent heart beats: 176816ms exceeds 45000ms 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent heart beats: 182241ms exceeds 45000ms 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent heart beats: 178406ms exceeds 45000ms 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver thread-3 java.lang.OutOfMemoryError: Java heap space at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at
[jira] [Commented] (SPARK-3168) The ServletContextHandler of webui lacks a SessionManager
[ https://issues.apache.org/jira/browse/SPARK-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117855#comment-14117855 ] Reynold Xin commented on SPARK-3168: [~tgraves] - can you take a look? I think you added the ui filter stuff. Thanks. The ServletContextHandler of webui lacks a SessionManager - Key: SPARK-3168 URL: https://issues.apache.org/jira/browse/SPARK-3168 Project: Spark Issue Type: Bug Components: Spark Core Environment: CAS Reporter: meiyoula When i use CAS to realize single sign of webui, it occurs a exception: {code} WARN [qtp1076146544-24] / org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:561) java.lang.IllegalStateException: No SessionManager at org.eclipse.jetty.server.Request.getSession(Request.java:1269) at org.eclipse.jetty.server.Request.getSession(Request.java:1248) at org.jasig.cas.client.validation.AbstractTicketValidationFilter.doFilter(AbstractTicketValidationFilter.java:178) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467) at org.jasig.cas.client.authentication.AuthenticationFilter.doFilter(AuthenticationFilter.java:116) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467) at org.jasig.cas.client.session.SingleSignOutFilter.doFilter(SingleSignOutFilter.java:76) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3135) Avoid memory copy in TorrentBroadcast serialization
[ https://issues.apache.org/jira/browse/SPARK-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3135. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Reynold Xin Avoid memory copy in TorrentBroadcast serialization --- Key: SPARK-3135 URL: https://issues.apache.org/jira/browse/SPARK-3135 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin Assignee: Reynold Xin Labels: starter Fix For: 1.2.0 TorrentBroadcast.blockifyObject uses a ByteArrayOutputStream to serialize broadcast object into a single giant byte array, and then separates it into smaller chunks. We should implement a new OutputStream that writes serialized bytes directly into chunks of byte arrays so we don't need the extra memory copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-: --- Target Version/s: 1.1.0 Large number of partitions causes OOM - Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances Reporter: Nicholas Chammas Here’s a repro for PySpark: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) {code} This code runs fine on 1.0.2. It returns the following result in just over a minute: {code} [(4, 'NickJohn')] {code} However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it runs for a very, very long time (at least 45 min) and then fails with {{java.lang.OutOfMemoryError: Java heap space}}. Here is a stack trace taken from a run on 1.1.0-rc2: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent heart beats: 175143ms exceeds 45000ms 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent heart beats: 175359ms exceeds 45000ms 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent heart beats: 173061ms exceeds 45000ms 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent heart beats: 176816ms exceeds 45000ms 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent heart beats: 182241ms exceeds 45000ms 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent heart beats: 178406ms exceeds 45000ms 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver thread-3 java.lang.OutOfMemoryError: Java heap space at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Exception in thread Result resolver thread-3 14/08/29 21:56:26 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(ip-10-73-142-223.ec2.internal,54014) java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295) at org.apache.spark.network.SendingConnection.read(Connection.scala:390) at org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:199) at
[jira] [Updated] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-: --- Priority: Blocker (was: Major) Large number of partitions causes OOM - Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances Reporter: Nicholas Chammas Priority: Blocker Here’s a repro for PySpark: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) {code} This code runs fine on 1.0.2. It returns the following result in just over a minute: {code} [(4, 'NickJohn')] {code} However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it runs for a very, very long time (at least 45 min) and then fails with {{java.lang.OutOfMemoryError: Java heap space}}. Here is a stack trace taken from a run on 1.1.0-rc2: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent heart beats: 175143ms exceeds 45000ms 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent heart beats: 175359ms exceeds 45000ms 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent heart beats: 173061ms exceeds 45000ms 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent heart beats: 176816ms exceeds 45000ms 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent heart beats: 182241ms exceeds 45000ms 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent heart beats: 178406ms exceeds 45000ms 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver thread-3 java.lang.OutOfMemoryError: Java heap space at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Exception in thread Result resolver thread-3 14/08/29 21:56:26 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(ip-10-73-142-223.ec2.internal,54014) java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295) at org.apache.spark.network.SendingConnection.read(Connection.scala:390) at org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:199) at
[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117905#comment-14117905 ] Josh Rosen commented on SPARK-: --- Still investigating. I tried this on my laptop by running the following script through spark-submit: {code} from pyspark import SparkContext sc = SparkContext(appName=test) a = sc.parallelize([Nick, John, Bob]) a = a.repartition(1) parallelism = 4 a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y, parallelism).take(1) {code} With spark-1.0.2-bin-hadoop1, this ran in ~46 seconds, while branch-1.1 took ~30 seconds. Spark 1.0.2 seemed to experience one long pause that might have been due to GC, but I'll have to measure that. Both ran to completion without crashing. I'll see what happens if I bump up the number of partitions. Large number of partitions causes OOM - Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances Reporter: Nicholas Chammas Priority: Blocker Here’s a repro for PySpark: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) {code} This code runs fine on 1.0.2. It returns the following result in just over a minute: {code} [(4, 'NickJohn')] {code} However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it runs for a very, very long time (at least 45 min) and then fails with {{java.lang.OutOfMemoryError: Java heap space}}. Here is a stack trace taken from a run on 1.1.0-rc2: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent heart beats: 175143ms exceeds 45000ms 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent heart beats: 175359ms exceeds 45000ms 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent heart beats: 173061ms exceeds 45000ms 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent heart beats: 176816ms exceeds 45000ms 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent heart beats: 182241ms exceeds 45000ms 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent heart beats: 178406ms exceeds 45000ms 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver thread-3 java.lang.OutOfMemoryError: Java heap space at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at
[jira] [Commented] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()
[ https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117909#comment-14117909 ] Patrick Wendell commented on SPARK-3005: I think it might be safe to just define the semantics of cancelTask to be best effort and have the default implementation be empty. From what I can tell the DAGScheduler is already resilient to a task finishing after the stage has been cleaned up (for other reasons I think it needs be resilient to this). So maybe we should just remove this entire ableToCancelStages logic in here and make it simpler. /cc [~kayousterhout] and [~markhamstra] who worked on this code. Hey [~xuzhongxing] what do you mean that the tasks themselves already died and exited? The code here is designed to cancel outstanding tasks that are still running. For instance, I have a job that has 500 tasks running. Then there is a failure of one task multiple time so I need to fail the stage, but there are still many tasks running. I think those tasks need to be killed still. Otherwise you have zombie tasks running. I think in mesos fine-grained mode we just won't be able to support this feature... but we should make it so it doesn't hang. Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask() --- Key: SPARK-3005 URL: https://issues.apache.org/jira/browse/SPARK-3005 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector Reporter: Xu Zhongxing Attachments: SPARK-3005_1.diff I am using Spark, Mesos, spark-cassandra-connector to do some work on a cassandra cluster. During the job running, I killed the Cassandra daemon to simulate some failure cases. This results in task failures. If I run the job in Mesos coarse-grained mode, the spark driver program throws an exception and shutdown cleanly. But when I run the job in Mesos fine-grained mode, the spark driver program hangs. The spark log is: {code} INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 Logging.scala (line 58) Cancelling stage 1 INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 Logging.scala (line 79) Could not cancel tasks for stage 1 java.lang.UnsupportedOperationException at org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32) at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at
[jira] [Updated] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()
[ https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3005: --- Target Version/s: 1.1.1, 1.2.0 Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask() --- Key: SPARK-3005 URL: https://issues.apache.org/jira/browse/SPARK-3005 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector Reporter: Xu Zhongxing Attachments: SPARK-3005_1.diff I am using Spark, Mesos, spark-cassandra-connector to do some work on a cassandra cluster. During the job running, I killed the Cassandra daemon to simulate some failure cases. This results in task failures. If I run the job in Mesos coarse-grained mode, the spark driver program throws an exception and shutdown cleanly. But when I run the job in Mesos fine-grained mode, the spark driver program hangs. The spark log is: {code} INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 Logging.scala (line 58) Cancelling stage 1 INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 Logging.scala (line 79) Could not cancel tasks for stage 1 java.lang.UnsupportedOperationException at org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32) at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at
[jira] [Updated] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles
[ https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-1239: -- Summary: Don't fetch all map output statuses at each reducer during shuffles (was: Don't fetch all map outputs at each reducer during shuffles) Don't fetch all map output statuses at each reducer during shuffles --- Key: SPARK-1239 URL: https://issues.apache.org/jira/browse/SPARK-1239 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Andrew Or Fix For: 1.1.0 Instead we should modify the way we fetch map output statuses to take both a mapper and a reducer - or we should just piggyback the statuses on each task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles
[ https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117914#comment-14117914 ] Patrick Wendell commented on SPARK-1239: I think [~andrewor] has this in his backlog but it's not actively being worked on. [~bcwalrus] do you or [~sandyr] want to take a crack? Don't fetch all map output statuses at each reducer during shuffles --- Key: SPARK-1239 URL: https://issues.apache.org/jira/browse/SPARK-1239 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Andrew Or Fix For: 1.1.0 Instead we should modify the way we fetch map output statuses to take both a mapper and a reducer - or we should just piggyback the statuses on each task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117917#comment-14117917 ] Nicholas Chammas commented on SPARK-1701: - OK, so it sounds like action is: * Replace all occurrences of slice with partition * Leave occurrences of task as-is [~darabos] - Feel free to run with this. Inconsistent naming: slice or partition --- Key: SPARK-1701 URL: https://issues.apache.org/jira/browse/SPARK-1701 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Daniel Darabos Priority: Minor Labels: starter Throughout the documentation and code slice and partition are used interchangeably. (Or so it seems to me.) It would avoid some confusion for new users to settle on one name. I think partition is winning, since that is the name of the class representing the concept. This should not be much more complicated to do than a search replace. I can take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org