[jira] [Resolved] (SPARK-2636) Expose job ID in JobWaiter API
[ https://issues.apache.org/jira/browse/SPARK-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2636. Resolution: Fixed Fix Version/s: 1.2.0 > Expose job ID in JobWaiter API > -- > > Key: SPARK-2636 > URL: https://issues.apache.org/jira/browse/SPARK-2636 > Project: Spark > Issue Type: New Feature > Components: Java API >Reporter: Chengxiang Li >Assignee: Chengxiang Li > Labels: hive > Fix For: 1.2.0 > > > In Hive on Spark, we want to track spark job status through Spark API, the > basic idea is as following: > # create an hive-specified spark listener and register it to spark listener > bus. > # hive-specified spark listener generate job status by spark listener events. > # hive driver track job status through hive-specified spark listener. > the current problem is that hive driver need job identifier to track > specified job status through spark listener, but there is no spark API to get > job identifier(like job id) while submit spark job. > I think other project whoever try to track job status with spark API would > suffer from this as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2633) enhance spark listener API to gather more spark job information
[ https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated SPARK-2633: - Attachment: (was: Spark listener enhancement for Hive on Spark job monitor and statistic.docx) > enhance spark listener API to gather more spark job information > --- > > Key: SPARK-2633 > URL: https://issues.apache.org/jira/browse/SPARK-2633 > Project: Spark > Issue Type: New Feature > Components: Java API >Reporter: Chengxiang Li >Priority: Critical > Labels: hive > Attachments: Spark listener enhancement for Hive on Spark job monitor > and statistic.docx > > > Based on Hive on Spark job status monitoring and statistic collection > requirement, try to enhance spark listener API to gather more spark job > information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2633) enhance spark listener API to gather more spark job information
[ https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated SPARK-2633: - Attachment: Spark listener enhancement for Hive on Spark job monitor and statistic.docx > enhance spark listener API to gather more spark job information > --- > > Key: SPARK-2633 > URL: https://issues.apache.org/jira/browse/SPARK-2633 > Project: Spark > Issue Type: New Feature > Components: Java API >Reporter: Chengxiang Li >Priority: Critical > Labels: hive > Attachments: Spark listener enhancement for Hive on Spark job monitor > and statistic.docx > > > Based on Hive on Spark job status monitoring and statistic collection > requirement, try to enhance spark listener API to gather more spark job > information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1701) Inconsistent naming: "slice" or "partition"
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117918#comment-14117918 ] Patrick Wendell commented on SPARK-1701: I think that's a straw man. A closer review once there is a patch might reveal some corner cases. On Mon, Sep 1, 2014 at 10:59 PM, Nicholas Chammas (JIRA) > Inconsistent naming: "slice" or "partition" > --- > > Key: SPARK-1701 > URL: https://issues.apache.org/jira/browse/SPARK-1701 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core >Reporter: Daniel Darabos >Priority: Minor > Labels: starter > > Throughout the documentation and code "slice" and "partition" are used > interchangeably. (Or so it seems to me.) It would avoid some confusion for > new users to settle on one name. I think "partition" is winning, since that > is the name of the class representing the concept. > This should not be much more complicated to do than a search & replace. I can > take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1701) Inconsistent naming: "slice" or "partition"
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117917#comment-14117917 ] Nicholas Chammas commented on SPARK-1701: - OK, so it sounds like action is: * Replace all occurrences of "slice" with "partition" * Leave occurrences of "task" as-is [~darabos] - Feel free to run with this. > Inconsistent naming: "slice" or "partition" > --- > > Key: SPARK-1701 > URL: https://issues.apache.org/jira/browse/SPARK-1701 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core >Reporter: Daniel Darabos >Priority: Minor > Labels: starter > > Throughout the documentation and code "slice" and "partition" are used > interchangeably. (Or so it seems to me.) It would avoid some confusion for > new users to settle on one name. I think "partition" is winning, since that > is the name of the class representing the concept. > This should not be much more complicated to do than a search & replace. I can > take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles
[ https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117914#comment-14117914 ] Patrick Wendell commented on SPARK-1239: I think [~andrewor] has this in his backlog but it's not actively being worked on. [~bcwalrus] do you or [~sandyr] want to take a crack? > Don't fetch all map output statuses at each reducer during shuffles > --- > > Key: SPARK-1239 > URL: https://issues.apache.org/jira/browse/SPARK-1239 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Andrew Or > Fix For: 1.1.0 > > > Instead we should modify the way we fetch map output statuses to take both a > mapper and a reducer - or we should just piggyback the statuses on each task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles
[ https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-1239: -- Summary: Don't fetch all map output statuses at each reducer during shuffles (was: Don't fetch all map outputs at each reducer during shuffles) > Don't fetch all map output statuses at each reducer during shuffles > --- > > Key: SPARK-1239 > URL: https://issues.apache.org/jira/browse/SPARK-1239 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Andrew Or > Fix For: 1.1.0 > > > Instead we should modify the way we fetch map output statuses to take both a > mapper and a reducer - or we should just piggyback the statuses on each task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()
[ https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3005: --- Target Version/s: 1.1.1, 1.2.0 > Spark with Mesos fine-grained mode throws UnsupportedOperationException in > MesosSchedulerBackend.killTask() > --- > > Key: SPARK-3005 > URL: https://issues.apache.org/jira/browse/SPARK-3005 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.2 > Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector >Reporter: Xu Zhongxing > Attachments: SPARK-3005_1.diff > > > I am using Spark, Mesos, spark-cassandra-connector to do some work on a > cassandra cluster. > During the job running, I killed the Cassandra daemon to simulate some > failure cases. This results in task failures. > If I run the job in Mesos coarse-grained mode, the spark driver program > throws an exception and shutdown cleanly. > But when I run the job in Mesos fine-grained mode, the spark driver program > hangs. > The spark log is: > {code} > INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 > Logging.scala (line 58) Cancelling stage 1 > INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 > Logging.scala (line 79) Could not cancel tasks for stage 1 > java.lang.UnsupportedOperationException > at > org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32) > at > org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >
[jira] [Commented] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()
[ https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117909#comment-14117909 ] Patrick Wendell commented on SPARK-3005: I think it might be safe to just define the semantics of cancelTask to be "best effort" and have the default implementation be empty. From what I can tell the DAGScheduler is already resilient to a task finishing after the stage has been cleaned up (for other reasons I think it needs be resilient to this). So maybe we should just remove this entire ableToCancelStages logic in here and make it simpler. /cc [~kayousterhout] and [~markhamstra] who worked on this code. Hey [~xuzhongxing] what do you mean that the "tasks themselves already died and exited"? The code here is designed to cancel outstanding tasks that are still running. For instance, I have a job that has 500 tasks running. Then there is a failure of one task multiple time so I need to fail the stage, but there are still many tasks running. I think those tasks need to be killed still. Otherwise you have zombie tasks running. I think in mesos fine-grained mode we just won't be able to support this feature... but we should make it so it doesn't hang. > Spark with Mesos fine-grained mode throws UnsupportedOperationException in > MesosSchedulerBackend.killTask() > --- > > Key: SPARK-3005 > URL: https://issues.apache.org/jira/browse/SPARK-3005 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.2 > Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector >Reporter: Xu Zhongxing > Attachments: SPARK-3005_1.diff > > > I am using Spark, Mesos, spark-cassandra-connector to do some work on a > cassandra cluster. > During the job running, I killed the Cassandra daemon to simulate some > failure cases. This results in task failures. > If I run the job in Mesos coarse-grained mode, the spark driver program > throws an exception and shutdown cleanly. > But when I run the job in Mesos fine-grained mode, the spark driver program > hangs. > The spark log is: > {code} > INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 > Logging.scala (line 58) Cancelling stage 1 > INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 > Logging.scala (line 79) Could not cancel tasks for stage 1 > java.lang.UnsupportedOperationException > at > org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32) > at > org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handle
[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117905#comment-14117905 ] Josh Rosen commented on SPARK-: --- Still investigating. I tried this on my laptop by running the following script through spark-submit: {code} from pyspark import SparkContext sc = SparkContext(appName="test") a = sc.parallelize(["Nick", "John", "Bob"]) a = a.repartition(1) parallelism = 4 a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y, parallelism).take(1) {code} With spark-1.0.2-bin-hadoop1, this ran in ~46 seconds, while branch-1.1 took ~30 seconds. Spark 1.0.2 seemed to experience one long pause that might have been due to GC, but I'll have to measure that. Both ran to completion without crashing. I'll see what happens if I bump up the number of partitions. > Large number of partitions causes OOM > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.1.0 > Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances >Reporter: Nicholas Chammas >Priority: Blocker > > Here’s a repro for PySpark: > {code} > a = sc.parallelize(["Nick", "John", "Bob"]) > a = a.repartition(24000) > a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) > {code} > This code runs fine on 1.0.2. It returns the following result in just over a > minute: > {code} > [(4, 'NickJohn')] > {code} > However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it > runs for a very, very long time (at least > 45 min) and then fails with > {{java.lang.OutOfMemoryError: Java heap space}}. > Here is a stack trace taken from a run on 1.1.0-rc2: > {code} > >>> a = sc.parallelize(["Nick", "John", "Bob"]) > >>> a = a.repartition(24000) > >>> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) > 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent > heart beats: 175143ms exceeds 45000ms > 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent > heart beats: 175359ms exceeds 45000ms > 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent > heart beats: 173061ms exceeds 45000ms > 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent > heart beats: 176816ms exceeds 45000ms > 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent > heart beats: 182241ms exceeds 45000ms > 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent > heart beats: 178406ms exceeds 45000ms > 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver > thread-3 > java.lang.OutOfMemoryError: Java heap space > at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) > at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) > at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) > at > org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) > at > org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) > at > java.util.concurrent.Threa
[jira] [Updated] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-: --- Priority: Blocker (was: Major) > Large number of partitions causes OOM > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.1.0 > Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances >Reporter: Nicholas Chammas >Priority: Blocker > > Here’s a repro for PySpark: > {code} > a = sc.parallelize(["Nick", "John", "Bob"]) > a = a.repartition(24000) > a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) > {code} > This code runs fine on 1.0.2. It returns the following result in just over a > minute: > {code} > [(4, 'NickJohn')] > {code} > However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it > runs for a very, very long time (at least > 45 min) and then fails with > {{java.lang.OutOfMemoryError: Java heap space}}. > Here is a stack trace taken from a run on 1.1.0-rc2: > {code} > >>> a = sc.parallelize(["Nick", "John", "Bob"]) > >>> a = a.repartition(24000) > >>> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) > 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent > heart beats: 175143ms exceeds 45000ms > 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent > heart beats: 175359ms exceeds 45000ms > 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent > heart beats: 173061ms exceeds 45000ms > 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent > heart beats: 176816ms exceeds 45000ms > 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent > heart beats: 182241ms exceeds 45000ms > 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent > heart beats: 178406ms exceeds 45000ms > 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver > thread-3 > java.lang.OutOfMemoryError: Java heap space > at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) > at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) > at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) > at > org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) > at > org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Exception in thread "Result resolver thread-3" 14/08/29 21:56:26 ERROR > SendingConnection: Exception while reading SendingConnection to > ConnectionManagerId(ip-10-73-142-223.ec2.internal,54014) > java.nio.channels.ClosedChannelException > at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295) > at org.apache.spark.network.SendingConnection.read(Connection.scala:390) >
[jira] [Updated] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-: --- Target Version/s: 1.1.0 > Large number of partitions causes OOM > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.1.0 > Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances >Reporter: Nicholas Chammas > > Here’s a repro for PySpark: > {code} > a = sc.parallelize(["Nick", "John", "Bob"]) > a = a.repartition(24000) > a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) > {code} > This code runs fine on 1.0.2. It returns the following result in just over a > minute: > {code} > [(4, 'NickJohn')] > {code} > However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it > runs for a very, very long time (at least > 45 min) and then fails with > {{java.lang.OutOfMemoryError: Java heap space}}. > Here is a stack trace taken from a run on 1.1.0-rc2: > {code} > >>> a = sc.parallelize(["Nick", "John", "Bob"]) > >>> a = a.repartition(24000) > >>> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) > 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent > heart beats: 175143ms exceeds 45000ms > 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent > heart beats: 175359ms exceeds 45000ms > 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent > heart beats: 173061ms exceeds 45000ms > 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent > heart beats: 176816ms exceeds 45000ms > 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent > heart beats: 182241ms exceeds 45000ms > 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent > heart beats: 178406ms exceeds 45000ms > 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver > thread-3 > java.lang.OutOfMemoryError: Java heap space > at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) > at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) > at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) > at > org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) > at > org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Exception in thread "Result resolver thread-3" 14/08/29 21:56:26 ERROR > SendingConnection: Exception while reading SendingConnection to > ConnectionManagerId(ip-10-73-142-223.ec2.internal,54014) > java.nio.channels.ClosedChannelException > at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295) > at org.apache.spark.network.SendingConnection.read(Connection.scala:390) > at > org.apache.spark.network.Connec
[jira] [Resolved] (SPARK-3342) m3 instances don't get local SSDs
[ https://issues.apache.org/jira/browse/SPARK-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3342. -- Resolution: Fixed Fix Version/s: 1.1.0 > m3 instances don't get local SSDs > - > > Key: SPARK-3342 > URL: https://issues.apache.org/jira/browse/SPARK-3342 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.0.2 >Reporter: Matei Zaharia >Assignee: Daniel Darabos > Fix For: 1.1.0 > > > As discussed on https://github.com/apache/spark/pull/2081, these instances > ignore the block device mapping on the AMI and require ephemeral drives to be > added programmatically when launching them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117903#comment-14117903 ] Aaron Davidson commented on SPARK-: --- Anyone have a jmap on the driver during this time? Perhaps also a GC log? To be clear, a "jmap -histo:live" and a "jmap -histo" should be sufficient over a full heap dump. I wonder if something changed with the MapOutputTracker, since that state grows with O(M * R), which is extremely high in this situation. > Large number of partitions causes OOM > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.1.0 > Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances >Reporter: Nicholas Chammas > > Here’s a repro for PySpark: > {code} > a = sc.parallelize(["Nick", "John", "Bob"]) > a = a.repartition(24000) > a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) > {code} > This code runs fine on 1.0.2. It returns the following result in just over a > minute: > {code} > [(4, 'NickJohn')] > {code} > However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it > runs for a very, very long time (at least > 45 min) and then fails with > {{java.lang.OutOfMemoryError: Java heap space}}. > Here is a stack trace taken from a run on 1.1.0-rc2: > {code} > >>> a = sc.parallelize(["Nick", "John", "Bob"]) > >>> a = a.repartition(24000) > >>> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) > 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent > heart beats: 175143ms exceeds 45000ms > 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent > heart beats: 175359ms exceeds 45000ms > 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent > heart beats: 173061ms exceeds 45000ms > 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent > heart beats: 176816ms exceeds 45000ms > 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent > heart beats: 182241ms exceeds 45000ms > 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent > heart beats: 178406ms exceeds 45000ms > 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver > thread-3 > java.lang.OutOfMemoryError: Java heap space > at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) > at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) > at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) > at > org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) > at > org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Exception in thread "Result resolver thread-3" 14/08/29 21:56:26 ERROR > SendingConnection: Exception while reading SendingConnection to > ConnectionManagerId(ip-10-73-142-
[jira] [Commented] (SPARK-3342) m3 instances don't get local SSDs
[ https://issues.apache.org/jira/browse/SPARK-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117900#comment-14117900 ] Matei Zaharia commented on SPARK-3342: -- In particular see http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html: "For M3 instances, you must specify instance store volumes in the block device mapping for the instance. When you launch an M3 instance, we ignore any instance store volumes specified in the block device mapping for the AMI." > m3 instances don't get local SSDs > - > > Key: SPARK-3342 > URL: https://issues.apache.org/jira/browse/SPARK-3342 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.0.2 >Reporter: Matei Zaharia >Assignee: Daniel Darabos > > As discussed on https://github.com/apache/spark/pull/2081, these instances > ignore the block device mapping on the AMI and require ephemeral drives to be > added programmatically when launching them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3342) m3 instances don't get local SSDs
Matei Zaharia created SPARK-3342: Summary: m3 instances don't get local SSDs Key: SPARK-3342 URL: https://issues.apache.org/jira/browse/SPARK-3342 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.0.2 Reporter: Matei Zaharia Assignee: Daniel Darabos As discussed on https://github.com/apache/spark/pull/2081, these instances ignore the block device mapping on the AMI and require ephemeral drives to be added programmatically when launching them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1701) Inconsistent naming: "slice" or "partition"
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117892#comment-14117892 ] Patrick Wendell commented on SPARK-1701: I think in the current code "slices" are just a (probably redundant) synonym for "partitions". We could probably clean it up a bit, for instance in example code and the programming guide. I think in those cases it's best to just say "partitions". The number of partitions and the number of tasks are distinct concepts in general, but in the cases you listed either applies, and I think that saying "number of tasks" might be slightly more understandable for users. > Inconsistent naming: "slice" or "partition" > --- > > Key: SPARK-1701 > URL: https://issues.apache.org/jira/browse/SPARK-1701 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core >Reporter: Daniel Darabos >Priority: Minor > Labels: starter > > Throughout the documentation and code "slice" and "partition" are used > interchangeably. (Or so it seems to me.) It would avoid some confusion for > new users to settle on one name. I think "partition" is winning, since that > is the name of the class representing the concept. > This should not be much more complicated to do than a search & replace. I can > take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1701) Inconsistent naming: "slice" or "partition"
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117887#comment-14117887 ] Reynold Xin commented on SPARK-1701: "Partition" is the standard term. > Inconsistent naming: "slice" or "partition" > --- > > Key: SPARK-1701 > URL: https://issues.apache.org/jira/browse/SPARK-1701 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core >Reporter: Daniel Darabos >Priority: Minor > Labels: starter > > Throughout the documentation and code "slice" and "partition" are used > interchangeably. (Or so it seems to me.) It would avoid some confusion for > new users to settle on one name. I think "partition" is winning, since that > is the name of the class representing the concept. > This should not be much more complicated to do than a search & replace. I can > take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1701) Inconsistent naming: "slice" or "partition"
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1701: --- Labels: starter (was: ) > Inconsistent naming: "slice" or "partition" > --- > > Key: SPARK-1701 > URL: https://issues.apache.org/jira/browse/SPARK-1701 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core >Reporter: Daniel Darabos >Priority: Minor > Labels: starter > > Throughout the documentation and code "slice" and "partition" are used > interchangeably. (Or so it seems to me.) It would avoid some confusion for > new users to settle on one name. I think "partition" is winning, since that > is the name of the class representing the concept. > This should not be much more complicated to do than a search & replace. I can > take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1701) Inconsistent naming: "slice" or "partition"
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117883#comment-14117883 ] Nicholas Chammas commented on SPARK-1701: - [~pwendell] and [~rxin], I'm pinging you here to put this issue on your radar. With some guidance on what terms are favored and how a solution should be approached (e.g. can parameter names can be changed?), I'm sure a contributor could pick up this task. > Inconsistent naming: "slice" or "partition" > --- > > Key: SPARK-1701 > URL: https://issues.apache.org/jira/browse/SPARK-1701 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core >Reporter: Daniel Darabos >Priority: Minor > > Throughout the documentation and code "slice" and "partition" are used > interchangeably. (Or so it seems to me.) It would avoid some confusion for > new users to settle on one name. I think "partition" is winning, since that > is the name of the class representing the concept. > This should not be much more complicated to do than a search & replace. I can > take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1701) Inconsistent naming: "slice" or "partition"
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117880#comment-14117880 ] Nicholas Chammas commented on SPARK-1701: - In addition to "slice" and "partition", we also have "task". Here's an example from Spark 1.0.2: * PySpark {{reduceByKey}} [API documentation|http://spark.apache.org/docs/1.0.2/api/python/pyspark.rdd.RDD-class.html#reduceByKey] says {{numPartitions}} * [Spark Programming Guide|http://spark.apache.org/docs/1.0.2/programming-guide.html#transformations] calls the same thing {{numTasks}} > Inconsistent naming: "slice" or "partition" > --- > > Key: SPARK-1701 > URL: https://issues.apache.org/jira/browse/SPARK-1701 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core >Reporter: Daniel Darabos >Priority: Minor > > Throughout the documentation and code "slice" and "partition" are used > interchangeably. (Or so it seems to me.) It would avoid some confusion for > new users to settle on one name. I think "partition" is winning, since that > is the name of the class representing the concept. > This should not be much more complicated to do than a search & replace. I can > take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3341) The dataType of Sqrt expression should be DoubleType.
[ https://issues.apache.org/jira/browse/SPARK-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117866#comment-14117866 ] Apache Spark commented on SPARK-3341: - User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/2233 > The dataType of Sqrt expression should be DoubleType. > - > > Key: SPARK-3341 > URL: https://issues.apache.org/jira/browse/SPARK-3341 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3341) The dataType of Sqrt expression should be DoubleType.
Takuya Ueshin created SPARK-3341: Summary: The dataType of Sqrt expression should be DoubleType. Key: SPARK-3341 URL: https://issues.apache.org/jira/browse/SPARK-3341 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3135) Avoid memory copy in TorrentBroadcast serialization
[ https://issues.apache.org/jira/browse/SPARK-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3135. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Reynold Xin > Avoid memory copy in TorrentBroadcast serialization > --- > > Key: SPARK-3135 > URL: https://issues.apache.org/jira/browse/SPARK-3135 > Project: Spark > Issue Type: Sub-task >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: starter > Fix For: 1.2.0 > > > TorrentBroadcast.blockifyObject uses a ByteArrayOutputStream to serialize > broadcast object into a single giant byte array, and then separates it into > smaller chunks. We should implement a new OutputStream that writes > serialized bytes directly into chunks of byte arrays so we don't need the > extra memory copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3168) The ServletContextHandler of webui lacks a SessionManager
[ https://issues.apache.org/jira/browse/SPARK-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117855#comment-14117855 ] Reynold Xin commented on SPARK-3168: [~tgraves] - can you take a look? I think you added the ui filter stuff. Thanks. > The ServletContextHandler of webui lacks a SessionManager > - > > Key: SPARK-3168 > URL: https://issues.apache.org/jira/browse/SPARK-3168 > Project: Spark > Issue Type: Bug > Components: Spark Core > Environment: CAS >Reporter: meiyoula > > When i use CAS to realize single sign of webui, it occurs a exception: > {code} > WARN [qtp1076146544-24] / > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:561) > java.lang.IllegalStateException: No SessionManager > at org.eclipse.jetty.server.Request.getSession(Request.java:1269) > at org.eclipse.jetty.server.Request.getSession(Request.java:1248) > at > org.jasig.cas.client.validation.AbstractTicketValidationFilter.doFilter(AbstractTicketValidationFilter.java:178) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467) > at > org.jasig.cas.client.authentication.AuthenticationFilter.doFilter(AuthenticationFilter.java:116) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467) > at > org.jasig.cas.client.session.SingleSignOutFilter.doFilter(SingleSignOutFilter.java:76) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) > at org.eclipse.jetty.server.Server.handle(Server.java:370) > at > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) > at > org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) > at > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644) > at > org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) > at > org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117853#comment-14117853 ] Nicholas Chammas edited comment on SPARK- at 9/2/14 3:13 AM: - It looks like the default number of reducers does indeed explain most of the performance difference here. But there is still a significant difference even after controlling this variable. I have 2 identical EC2 clusters as described in this JIRA issue, one on 1.0.2 and one on 1.1.0-rc3. This time I ran the following PySpark code: {code} a = sc.parallelize(["Nick", "John", "Bob"]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y, sc.defaultParallelism).take(1) {code} Here are the runtimes for 3 runs on each cluster: ||1.0.2||1.1.0-rc3|| | 95s | 343s | | 89s | 336s | | 95s | 334s | So manually setting the number of reducers to a smaller number does help a lot, but there is still a 3-4x performance slowdown. Can anyone else replicate this result? was (Author: nchammas): It looks like the default number of reducers does indeed explain most of the performance difference here. But there is still a significant difference even after controlling this variable. I have 2 identical EC2 clusters as described in this JIRA issue, one on 1.0.2 and one on 1.1.0-rc3. This time I ran the following PySpark code: {code} a = sc.parallelize(["Nick", "John", "Bob"]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y, sc.defaultParallelism).take(1) {code} Here are my results for 3 runs on each cluster: ||1.0.2||1.1.0-rc3|| | 95s | 343s | | 89s | 336s | | 95s | 334s | So manually setting the number of reducers to a smaller number does help a lot, but there is still a 3-4x performance slowdown. Can anyone else replicate this result? > Large number of partitions causes OOM > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.1.0 > Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances >Reporter: Nicholas Chammas > > Here’s a repro for PySpark: > {code} > a = sc.parallelize(["Nick", "John", "Bob"]) > a = a.repartition(24000) > a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) > {code} > This code runs fine on 1.0.2. It returns the following result in just over a > minute: > {code} > [(4, 'NickJohn')] > {code} > However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it > runs for a very, very long time (at least > 45 min) and then fails with > {{java.lang.OutOfMemoryError: Java heap space}}. > Here is a stack trace taken from a run on 1.1.0-rc2: > {code} > >>> a = sc.parallelize(["Nick", "John", "Bob"]) > >>> a = a.repartition(24000) > >>> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) > 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent > heart beats: 175143ms exceeds 45000ms > 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent > heart beats: 175359ms exceeds 45000ms > 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent > heart beats: 173061ms exceeds 45000ms > 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent > heart beats: 176816ms exceeds 45000ms > 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent > heart beats: 182241ms exceeds 45000ms > 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent > heart beats: 178406ms exceeds 45000ms > 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver > thread-3 > java.lang.OutOfMemoryError: Java heap space > at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) > at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftwar
[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117853#comment-14117853 ] Nicholas Chammas commented on SPARK-: - It looks like the default number of reducers does indeed explain most of the performance difference here. But there is still a significant difference even after controlling this variable. I have 2 identical EC2 clusters as described in this JIRA issue, one on 1.0.2 and one on 1.1.0-rc3. This time I ran the following PySpark code: {code} a = sc.parallelize(["Nick", "John", "Bob"]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y, sc.defaultParallelism).take(1) {code} Here are my results for 3 runs on each cluster: ||1.0.2||1.1.0-rc3|| | 95s | 343s | | 89s | 336s | | 95s | 334s | So manually setting the number of reducers to a smaller number does help a lot, but there is still a 3-4x performance slowdown. Can anyone else replicate this result? > Large number of partitions causes OOM > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.1.0 > Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances >Reporter: Nicholas Chammas > > Here’s a repro for PySpark: > {code} > a = sc.parallelize(["Nick", "John", "Bob"]) > a = a.repartition(24000) > a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) > {code} > This code runs fine on 1.0.2. It returns the following result in just over a > minute: > {code} > [(4, 'NickJohn')] > {code} > However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it > runs for a very, very long time (at least > 45 min) and then fails with > {{java.lang.OutOfMemoryError: Java heap space}}. > Here is a stack trace taken from a run on 1.1.0-rc2: > {code} > >>> a = sc.parallelize(["Nick", "John", "Bob"]) > >>> a = a.repartition(24000) > >>> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) > 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent > heart beats: 175143ms exceeds 45000ms > 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent > heart beats: 175359ms exceeds 45000ms > 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent > heart beats: 173061ms exceeds 45000ms > 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent > heart beats: 176816ms exceeds 45000ms > 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent > heart beats: 182241ms exceeds 45000ms > 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent > heart beats: 178406ms exceeds 45000ms > 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver > thread-3 > java.lang.OutOfMemoryError: Java heap space > at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) > at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) > at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) > at > org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) > at > org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) > at > org
[jira] [Updated] (SPARK-3067) JobProgressPage could not show Fair Scheduler Pools section sometimes
[ https://issues.apache.org/jira/browse/SPARK-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3067: --- Target Version/s: 1.2.0 > JobProgressPage could not show Fair Scheduler Pools section sometimes > - > > Key: SPARK-3067 > URL: https://issues.apache.org/jira/browse/SPARK-3067 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: YanTang Zhai >Priority: Minor > > JobProgressPage could not show Fair Scheduler Pools section sometimes. > SparkContext starts webui and then postEnvironmentUpdate. Sometimes > JobProgressPage is accessed between webui starting and postEnvironmentUpdate, > then the lazy val isFairScheduler will be false. The Fair Scheduler Pools > section will not display any more. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-2773) Shuffle:use growth rate to predict if need to spill
[ https://issues.apache.org/jira/browse/SPARK-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reopened SPARK-2773: > Shuffle:use growth rate to predict if need to spill > --- > > Key: SPARK-2773 > URL: https://issues.apache.org/jira/browse/SPARK-2773 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 0.9.0, 1.0.0 >Reporter: uncleGen >Priority: Minor > > Right now, Spark uses the total usage of "shuffle" memory of each thread to > predict if need to spill. I think it is not very reasonable. For example, > there are two threads pulling "shuffle" data. The total memory used to buffer > data is 21G. The first time to trigger spilling it when one thread has used > 7G memory to buffer "shuffle" data, here I assume another one has used the > same size. Unfortunately, I still have remaining 7G to use. So, I think > current prediction mode is too conservative, and can not maximize the usage > of "shuffle" memory. In my solution, I use the growth rate of "shuffle" > memory. Again, the growth of each time is limited, maybe 10K * 1024(my > assumption), then the first time to trigger spilling is when the remaining > "shuffle" memory is less than threads * growth * 2, i.e. 2 * 10M * 2. I think > it can maximize the usage of "shuffle" memory. In my solution, there is also > a conservative assumption, i.e. all of threads is pulling shuffle data in one > executor. However it dose not have much effect, the grow is limited after > all. Any suggestion? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2773) Shuffle:use growth rate to predict if need to spill
[ https://issues.apache.org/jira/browse/SPARK-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2773. Resolution: Won't Fix I don't think this is needed now that SPARK-2316 is fixed. This queue is not intended to overflow during normal operation. If you still observe issues in a version of Spark that contains SPARK-2316... please report it and we'll see what is going on. > Shuffle:use growth rate to predict if need to spill > --- > > Key: SPARK-2773 > URL: https://issues.apache.org/jira/browse/SPARK-2773 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 0.9.0, 1.0.0 >Reporter: uncleGen >Priority: Minor > > Right now, Spark uses the total usage of "shuffle" memory of each thread to > predict if need to spill. I think it is not very reasonable. For example, > there are two threads pulling "shuffle" data. The total memory used to buffer > data is 21G. The first time to trigger spilling it when one thread has used > 7G memory to buffer "shuffle" data, here I assume another one has used the > same size. Unfortunately, I still have remaining 7G to use. So, I think > current prediction mode is too conservative, and can not maximize the usage > of "shuffle" memory. In my solution, I use the growth rate of "shuffle" > memory. Again, the growth of each time is limited, maybe 10K * 1024(my > assumption), then the first time to trigger spilling is when the remaining > "shuffle" memory is less than threads * growth * 2, i.e. 2 * 10M * 2. I think > it can maximize the usage of "shuffle" memory. In my solution, there is also > a conservative assumption, i.e. all of threads is pulling shuffle data in one > executor. However it dose not have much effect, the grow is limited after > all. Any suggestion? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2773) Shuffle:use growth rate to predict if need to spill
[ https://issues.apache.org/jira/browse/SPARK-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2773. Resolution: Invalid > Shuffle:use growth rate to predict if need to spill > --- > > Key: SPARK-2773 > URL: https://issues.apache.org/jira/browse/SPARK-2773 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 0.9.0, 1.0.0 >Reporter: uncleGen >Priority: Minor > > Right now, Spark uses the total usage of "shuffle" memory of each thread to > predict if need to spill. I think it is not very reasonable. For example, > there are two threads pulling "shuffle" data. The total memory used to buffer > data is 21G. The first time to trigger spilling it when one thread has used > 7G memory to buffer "shuffle" data, here I assume another one has used the > same size. Unfortunately, I still have remaining 7G to use. So, I think > current prediction mode is too conservative, and can not maximize the usage > of "shuffle" memory. In my solution, I use the growth rate of "shuffle" > memory. Again, the growth of each time is limited, maybe 10K * 1024(my > assumption), then the first time to trigger spilling is when the remaining > "shuffle" memory is less than threads * growth * 2, i.e. 2 * 10M * 2. I think > it can maximize the usage of "shuffle" memory. In my solution, there is also > a conservative assumption, i.e. all of threads is pulling shuffle data in one > executor. However it dose not have much effect, the grow is limited after > all. Any suggestion? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3179) Add task OutputMetrics
[ https://issues.apache.org/jira/browse/SPARK-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117825#comment-14117825 ] Sandy Ryza commented on SPARK-3179: --- Hi Michael, Happy to help review your code or answer questions. > Add task OutputMetrics > -- > > Key: SPARK-3179 > URL: https://issues.apache.org/jira/browse/SPARK-3179 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Sandy Ryza > > Track the bytes that tasks write to HDFS or other output destinations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2078) Use ISO8601 date formats in logging
[ https://issues.apache.org/jira/browse/SPARK-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2078: --- Issue Type: Improvement (was: Bug) > Use ISO8601 date formats in logging > --- > > Key: SPARK-2078 > URL: https://issues.apache.org/jira/browse/SPARK-2078 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Ash >Assignee: Andrew Ash > > Currently, logging has 2 digit years and doesn't include milliseconds in > logging timestamps. > Use ISO8601 date formats instead of the current custom formats. > There is some precedent here for ISO8601 format -- it's what [Hadoop > uses|https://github.com/apache/hadoop-common/blob/d92a8a29978e35ed36c4d4721a21c356c1ff1d4d/hadoop-common-project/hadoop-minikdc/src/main/resources/log4j.properties] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2078) Use ISO8601 date formats in logging
[ https://issues.apache.org/jira/browse/SPARK-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2078. Resolution: Won't Fix Discussion on the JIRA suggested let's either keep the current one or just add milliseconds. https://github.com/apache/spark/pull/1018 > Use ISO8601 date formats in logging > --- > > Key: SPARK-2078 > URL: https://issues.apache.org/jira/browse/SPARK-2078 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Ash >Assignee: Andrew Ash > > Currently, logging has 2 digit years and doesn't include milliseconds in > logging timestamps. > Use ISO8601 date formats instead of the current custom formats. > There is some precedent here for ISO8601 format -- it's what [Hadoop > uses|https://github.com/apache/hadoop-common/blob/d92a8a29978e35ed36c4d4721a21c356c1ff1d4d/hadoop-common-project/hadoop-minikdc/src/main/resources/log4j.properties] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3306) Addition of external resource dependency in executors
[ https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117820#comment-14117820 ] Yan commented on SPARK-3306: I am afraid that I was not clear in that the resources are to be shared across different tasks, task sets and task waves, instead of letting each task make the connection by itself, which is very inefficient. For that purpose, I feel Spark conf is not enough and executor needs to be enhanced with hooks to initialize and stop the uses of the long running resources. > Addition of external resource dependency in executors > - > > Key: SPARK-3306 > URL: https://issues.apache.org/jira/browse/SPARK-3306 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Yan > > Currently, Spark executors only support static and read-only external > resources of side files and jar files. With emerging disparate data sources, > there is a need to support more versatile external resources, such as > connections to data sources, to facilitate efficient data accesses to the > sources. For one, the JDBCRDD, with some modifications, could benefit from > this feature by reusing established JDBC connections from the same Spark > context before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2895) Support mapPartitionsWithContext in Spark Java API
[ https://issues.apache.org/jira/browse/SPARK-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated SPARK-2895: - Description: This is a requirement from Hive on Spark, mapPartitionsWithContext only exists in Spark Scala API, we expect to access from Spark Java API. For HIVE-7627, HIVE-7843, Hive operators which are invoked in mapPartitions closure need to get taskId. was:This is a requirement from Hive on Spark, mapPartitionsWithContext only exists in Spark Scala API, we expect to access from Spark Java API. > Support mapPartitionsWithContext in Spark Java API > -- > > Key: SPARK-2895 > URL: https://issues.apache.org/jira/browse/SPARK-2895 > Project: Spark > Issue Type: New Feature > Components: Java API >Reporter: Chengxiang Li >Assignee: Chengxiang Li > Labels: hive > > This is a requirement from Hive on Spark, mapPartitionsWithContext only > exists in Spark Scala API, we expect to access from Spark Java API. > For HIVE-7627, HIVE-7843, Hive operators which are invoked in mapPartitions > closure need to get taskId. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1174) Adding port configuration for HttpFileServer
[ https://issues.apache.org/jira/browse/SPARK-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1174. Resolution: Duplicate > Adding port configuration for HttpFileServer > > > Key: SPARK-1174 > URL: https://issues.apache.org/jira/browse/SPARK-1174 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 0.9.0 >Reporter: Egor Pakhomov >Assignee: Egor Pahomov >Priority: Minor > Fix For: 0.9.0 > > > I run spark in big organization, where to open port accessible to other > computers in network, I need to create a ticket on DevOps and it executes for > days. I can't have port for some spark service to be changed all the time. I > need ability to configure this port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2638) Improve concurrency of fetching Map outputs
[ https://issues.apache.org/jira/browse/SPARK-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-2638: - Assignee: Josh Rosen > Improve concurrency of fetching Map outputs > --- > > Key: SPARK-2638 > URL: https://issues.apache.org/jira/browse/SPARK-2638 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 > Environment: All >Reporter: Stephen Boesch >Assignee: Josh Rosen >Priority: Minor > Labels: MapOutput, concurrency > Fix For: 1.1.0 > > Original Estimate: 0h > Remaining Estimate: 0h > > This issue was noticed while perusing the MapOutputTracker source code. > Notice that the synchronization is on the containing "fetching" collection - > which makes ALL fetches wait if any fetch were occurring. > The fix is to synchronize instead on the shuffleId (interned as a string to > ensure JVM wide visibility). > def getServerStatuses(shuffleId: Int, reduceId: Int): > Array[(BlockManagerId, Long)] = { > val statuses = mapStatuses.get(shuffleId).orNull > if (statuses == null) { > logInfo("Don't have map outputs for shuffle " + shuffleId + ", fetching > them") > var fetchedStatuses: Array[MapStatus] = null > fetching.synchronized { // This is existing code > // shuffleId.toString.intern.synchronized { // New Code > if (fetching.contains(shuffleId)) { > // Someone else is fetching it; wait for them to be done > while (fetching.contains(shuffleId)) { > try { > fetching.wait() > } catch { > case e: InterruptedException => > } > } > This is only a small code change, but the testcases to prove (a) proper > functionality and (b) proper performance improvement are not so trivial. > For (b) it is not worthwhile to add a testcase to the codebase. Instead I > have added a git project that demonstrates the concurrency/performance > improvement using the fine-grained approach . The github project is at > https://github.com/javadba/scalatesting.git . Simply run "sbt test". Note: > it is unclear how/where to include this ancillary testing/verification > information that will not be included in the git PR: i am open for any > suggestions - even as far as simply removing references to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3338) Respect user setting of spark.submit.pyFiles
[ https://issues.apache.org/jira/browse/SPARK-3338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117801#comment-14117801 ] Andrew Or commented on SPARK-3338: -- https://github.com/apache/spark/pull/2232 > Respect user setting of spark.submit.pyFiles > > > Key: SPARK-3338 > URL: https://issues.apache.org/jira/browse/SPARK-3338 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Andrew Or > > We currently override any setting of spark.submit.pyFiles. Even though this > is not documented, we should still respect this if the user explicitly sets > this in his/her default properties file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3340) Deprecate ADD_JARS and ADD_FILES
Andrew Or created SPARK-3340: Summary: Deprecate ADD_JARS and ADD_FILES Key: SPARK-3340 URL: https://issues.apache.org/jira/browse/SPARK-3340 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Andrew Or These were introduced before Spark submit even existed. Now that there are many better ways of setting jars and python files through Spark submit, we should deprecate these environment variables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3319) Resolve spark.jars, spark.files, and spark.submit.pyFiles etc.
[ https://issues.apache.org/jira/browse/SPARK-3319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117775#comment-14117775 ] Apache Spark commented on SPARK-3319: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/2232 > Resolve spark.jars, spark.files, and spark.submit.pyFiles etc. > -- > > Key: SPARK-3319 > URL: https://issues.apache.org/jira/browse/SPARK-3319 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Andrew Or > > We already do this for --jars, --files, and --py-files etc. For consistency, > we should do the same for the corresponding spark configs as well in case the > user sets them in spark-defaults.conf instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3308) Ability to read JSON Arrays as tables
[ https://issues.apache.org/jira/browse/SPARK-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3308: Assignee: Yin Huai > Ability to read JSON Arrays as tables > - > > Key: SPARK-3308 > URL: https://issues.apache.org/jira/browse/SPARK-3308 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Yin Huai >Priority: Critical > > Right now we can only read json where each object is on its own line. It > would be nice to be able to read top level json arrays where each element > maps to a row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3339) Support for skipping json lines that fail to parse
[ https://issues.apache.org/jira/browse/SPARK-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3339: Assignee: Yin Huai > Support for skipping json lines that fail to parse > -- > > Key: SPARK-3339 > URL: https://issues.apache.org/jira/browse/SPARK-3339 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Yin Huai > > When dealing with large datasets there is alway some data that fails to > parse. Would be nice to handle this instead of throwing an exception > requiring the user to filter it out manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3339) Support for skipping json lines that fail to parse
Michael Armbrust created SPARK-3339: --- Summary: Support for skipping json lines that fail to parse Key: SPARK-3339 URL: https://issues.apache.org/jira/browse/SPARK-3339 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust When dealing with large datasets there is alway some data that fails to parse. Would be nice to handle this instead of throwing an exception requiring the user to filter it out manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117770#comment-14117770 ] Patrick Wendell edited comment on SPARK- at 9/1/14 10:51 PM: - [~nchammas]. I think the default number of reducers could be the culprit. If you look at the actual job run in Spark 1.0.2, how many reducers are there in the shuffle? What happens, if you run the code in Spark 1.1.0 and specify that number of reducers that matches the number chose in 1.0.2... then is the performance the same? If this is the case we should probably document this clearly in the release notes, since changing this default could have major implications for those relying on the default behavior. was (Author: pwendell): [~nchammas]. I think the default number of reducers could be the culprit. If you look at the actual job run in Spark 1.0.2, how many reducers are there in the shuffle? How many are there? What happens, if you run the code in Spark 1.1.0 and specify that number of reducers that matches the number chose in 1.0.2... then is the performance the same? If this is the case we should probably document this clearly in the release notes, since changing this default could have major implications for those relying on the default behavior. > Large number of partitions causes OOM > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.1.0 > Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances >Reporter: Nicholas Chammas > > Here’s a repro for PySpark: > {code} > a = sc.parallelize(["Nick", "John", "Bob"]) > a = a.repartition(24000) > a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) > {code} > This code runs fine on 1.0.2. It returns the following result in just over a > minute: > {code} > [(4, 'NickJohn')] > {code} > However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it > runs for a very, very long time (at least > 45 min) and then fails with > {{java.lang.OutOfMemoryError: Java heap space}}. > Here is a stack trace taken from a run on 1.1.0-rc2: > {code} > >>> a = sc.parallelize(["Nick", "John", "Bob"]) > >>> a = a.repartition(24000) > >>> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) > 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent > heart beats: 175143ms exceeds 45000ms > 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent > heart beats: 175359ms exceeds 45000ms > 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent > heart beats: 173061ms exceeds 45000ms > 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent > heart beats: 176816ms exceeds 45000ms > 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent > heart beats: 182241ms exceeds 45000ms > 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent > heart beats: 178406ms exceeds 45000ms > 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver > thread-3 > java.lang.OutOfMemoryError: Java heap space > at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) > at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) > at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) > at > org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) > at > org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskRes
[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117770#comment-14117770 ] Patrick Wendell commented on SPARK-: [~nchammas]. I think the default number of reducers could be the culprit. If you look at the actual job run in Spark 1.0.2, how many reducers are there in the shuffle? How many are there? What happens, if you run the code in Spark 1.1.0 and specify that number of reducers that matches the number chose in 1.0.2... then is the performance the same? If this is the case we should probably document this clearly in the release notes, since changing this default could have major implications for those relying on the default behavior. > Large number of partitions causes OOM > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.1.0 > Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances >Reporter: Nicholas Chammas > > Here’s a repro for PySpark: > {code} > a = sc.parallelize(["Nick", "John", "Bob"]) > a = a.repartition(24000) > a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) > {code} > This code runs fine on 1.0.2. It returns the following result in just over a > minute: > {code} > [(4, 'NickJohn')] > {code} > However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it > runs for a very, very long time (at least > 45 min) and then fails with > {{java.lang.OutOfMemoryError: Java heap space}}. > Here is a stack trace taken from a run on 1.1.0-rc2: > {code} > >>> a = sc.parallelize(["Nick", "John", "Bob"]) > >>> a = a.repartition(24000) > >>> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) > 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent > heart beats: 175143ms exceeds 45000ms > 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent > heart beats: 175359ms exceeds 45000ms > 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent > heart beats: 173061ms exceeds 45000ms > 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent > heart beats: 176816ms exceeds 45000ms > 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent > heart beats: 182241ms exceeds 45000ms > 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent > heart beats: 178406ms exceeds 45000ms > 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver > thread-3 > java.lang.OutOfMemoryError: Java heap space > at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) > at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) > at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) > at > org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) > at > org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
[jira] [Created] (SPARK-3338) Respect user setting of spark.submit.pyFiles
Andrew Or created SPARK-3338: Summary: Respect user setting of spark.submit.pyFiles Key: SPARK-3338 URL: https://issues.apache.org/jira/browse/SPARK-3338 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Andrew Or We currently override any setting of spark.submit.pyFiles. Even though this is not documented, we should still respect this if the user explicitly sets this in his/her default properties file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3338) Respect user setting of spark.submit.pyFiles
[ https://issues.apache.org/jira/browse/SPARK-3338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3338: - Component/s: (was: PySpark) > Respect user setting of spark.submit.pyFiles > > > Key: SPARK-3338 > URL: https://issues.apache.org/jira/browse/SPARK-3338 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Andrew Or > > We currently override any setting of spark.submit.pyFiles. Even though this > is not documented, we should still respect this if the user explicitly sets > this in his/her default properties file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2312) Spark Actors do not handle unknown messages in their receive methods
[ https://issues.apache.org/jira/browse/SPARK-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2312: -- Assignee: Josh Rosen > Spark Actors do not handle unknown messages in their receive methods > > > Key: SPARK-2312 > URL: https://issues.apache.org/jira/browse/SPARK-2312 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Kam Kasravi >Assignee: Josh Rosen >Priority: Minor > Labels: starter > Fix For: 1.1.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Per akka documentation - an actor should provide a pattern match for all > messages including _ otherwise akka.actor.UnhandledMessage will be > propagated. > Noted actors: > MapOutputTrackerMasterActor, ClientActor, Master, Worker... > Should minimally do a > logWarning(s"Received unexpected actor system event: $_") so message info is > logged in correct actor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2312) Spark Actors do not handle unknown messages in their receive methods
[ https://issues.apache.org/jira/browse/SPARK-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2312: -- Assignee: Isaias Barroso (was: Josh Rosen) > Spark Actors do not handle unknown messages in their receive methods > > > Key: SPARK-2312 > URL: https://issues.apache.org/jira/browse/SPARK-2312 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Kam Kasravi >Assignee: Isaias Barroso >Priority: Minor > Labels: starter > Fix For: 1.1.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Per akka documentation - an actor should provide a pattern match for all > messages including _ otherwise akka.actor.UnhandledMessage will be > propagated. > Noted actors: > MapOutputTrackerMasterActor, ClientActor, Master, Worker... > Should minimally do a > logWarning(s"Received unexpected actor system event: $_") so message info is > logged in correct actor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117622#comment-14117622 ] Matei Zaharia commented on SPARK-3098: -- It's true that the ordering of values after a shuffle is nondeterministic, so that for example on failure you might get a different order of keys in a reduceByKey or distinct or operations like that. However, I think that's the way it should be (and we can document it). RDDs are deterministic when viewed as a multiset, but not when viewed as an ordered collection, unless you do sortByKey. Operations like zipWithIndex are meant to be more of a convenience to get unique IDs or act on something with a known ordering (such as a text file where you want to know the line numbers). But the freedom to control fetch ordering is quite important for performance, especially if you want to have a push-based shuffle in the future. If we wanted to get the same result every time, we could design reduce tasks to tell the master the order they fetched stuff in after the first time they ran, but even then, notice that it might limit the kind of shuffle mechanisms we allow (e.g. it would be harder to make a push-based shuffle deterministic). I'd rather not make that guarantee now. > In some cases, operation zipWithIndex get a wrong results > -- > > Key: SPARK-3098 > URL: https://issues.apache.org/jira/browse/SPARK-3098 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.1 >Reporter: Guoqiang Li >Priority: Critical > > The reproduce code: > {code} > val c = sc.parallelize(1 to 7899).flatMap { i => > (1 to 1).toSeq.map(p => i * 6000 + p) > }.distinct().zipWithIndex() > c.join(c).filter(t => t._2._1 != t._2._2).take(3) > {code} > => > {code} > Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), > (36579712,(13,14))) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2022) Spark 1.0.0 is failing if mesos.coarse set to true
[ https://issues.apache.org/jira/browse/SPARK-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117581#comment-14117581 ] Timothy Chen commented on SPARK-2022: - This should be resolved now, [~pwend...@gmail.com] please help close this. > Spark 1.0.0 is failing if mesos.coarse set to true > -- > > Key: SPARK-2022 > URL: https://issues.apache.org/jira/browse/SPARK-2022 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.0.0 >Reporter: Marek Wiewiorka >Assignee: Tim Chen >Priority: Critical > > more stderr > --- > WARNING: Logging before InitGoogleLogging() is written to STDERR > I0603 16:07:53.721132 61192 exec.cpp:131] Version: 0.18.2 > I0603 16:07:53.725230 61200 exec.cpp:205] Executor registered on slave > 201405220917-134217738-5050-27119-0 > Exception in thread "main" java.lang.NumberFormatException: For input string: > "sparkseq003.cloudapp.net" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Integer.parseInt(Integer.java:492) > at java.lang.Integer.parseInt(Integer.java:527) > at > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) > at scala.collection.immutable.StringOps.toInt(StringOps.scala:31) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:135) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) > more stdout > --- > Registered executor on sparkseq003.cloudapp.net > Starting task 5 > Forked command at 61202 > sh -c '"/home/mesos/spark-1.0.0/bin/spark-class" > org.apache.spark.executor.CoarseGrainedExecutorBackend > -Dspark.mesos.coarse=true > akka.tcp://sp...@sparkseq001.cloudapp.net:40312/user/CoarseG > rainedScheduler 201405220917-134217738-5050-27119-0 sparkseq003.cloudapp.net > 4' > Command exited with status 1 (pid: 61202) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117558#comment-14117558 ] Ye Xianjin commented on SPARK-3098: --- hi, [~srowen] and [~gq], I think what [~matei] wants to say is that because the ordering of elements in distinct() is not guaranteed, the result of zipWithIndex is not deterministic. If you recompute the RDD with distinct transformation, you are not guaranteed to get the same result. That explains the behavior here. But as [~srowen] said, It's surprised to see different results from the same RDD. [~matei], what do you think about this behavior? > In some cases, operation zipWithIndex get a wrong results > -- > > Key: SPARK-3098 > URL: https://issues.apache.org/jira/browse/SPARK-3098 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.1 >Reporter: Guoqiang Li >Priority: Critical > > The reproduce code: > {code} > val c = sc.parallelize(1 to 7899).flatMap { i => > (1 to 1).toSeq.map(p => i * 6000 + p) > }.distinct().zipWithIndex() > c.join(c).filter(t => t._2._1 != t._2._2).take(3) > {code} > => > {code} > Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), > (36579712,(13,14))) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2594) Add CACHE TABLE AS SELECT ...
[ https://issues.apache.org/jira/browse/SPARK-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117464#comment-14117464 ] Ravindra Pesala commented on SPARK-2594: Thank you Michael. Following are the tasks which I am planning to do to support this feature. 1. Change the SqlParser to support ADD CACHE TABLE syntax.And also we can change the HiveQl to support syntax for hive. 2. Add new strategy 'AddCacheTable' to 'SparkStrategies' 3. In the new strategy 'AddCacheTable', register the tableName with the plan and cache the same with tableName. Please review it and comment. > Add CACHE TABLE AS SELECT ... > > > Key: SPARK-2594 > URL: https://issues.apache.org/jira/browse/SPARK-2594 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Michael Armbrust >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117453#comment-14117453 ] Guoqiang Li edited comment on SPARK-1405 at 9/1/14 2:25 PM: Hi all [PR 1983|https://github.com/apache/spark/pull/1983] is OK to review. was (Author: gq): Hi all [PR 1983|https://github.com/apache/spark/pull/1983] is ready to review. > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Xusen Yin > Labels: features > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117453#comment-14117453 ] Guoqiang Li commented on SPARK-1405: Hi all [PR 1983|https://github.com/apache/spark/pull/1983] is ready to review. > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Xusen Yin > Labels: features > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3191) Add explanation of support building spark with maven in http proxy environment
[ https://issues.apache.org/jira/browse/SPARK-3191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117400#comment-14117400 ] zhengbing li commented on SPARK-3191: - Another way to solve this issuse is to set environment variable: export MAVEN_OPTS="-Dmaven.wagon.http.ssl.insecure=true -Dmaven.wagon.http.ssl.allowall=true" > Add explanation of support building spark with maven in http proxy environment > -- > > Key: SPARK-3191 > URL: https://issues.apache.org/jira/browse/SPARK-3191 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.0.2 > Environment: linux suse 11 > maven version:apache-maven-3.0.5 > spark version: 1.0.1 > proxy setting of maven is: > > lzb > true > http > user > password > proxy.company.com > 8080 > *.company.com > >Reporter: zhengbing li >Priority: Trivial > Labels: build, maven > Fix For: 1.1.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > When I use "mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean > package" in http proxy enviroment, I cannot finish this task.Error is as > follows: > [INFO] Spark Project YARN Stable API . SUCCESS [34.217s] > [INFO] Spark Project Assembly FAILURE [43.133s] > [INFO] Spark Project External Twitter SKIPPED > [INFO] Spark Project External Kafka .. SKIPPED > [INFO] Spark Project External Flume .. SKIPPED > [INFO] Spark Project External ZeroMQ . SKIPPED > [INFO] Spark Project External MQTT ... SKIPPED > [INFO] Spark Project Examples SKIPPED > [INFO] > > [INFO] BUILD FAILURE > [INFO] > > [INFO] Total time: 27:57.309s > [INFO] Finished at: Sat Aug 23 09:43:21 CST 2014 > [INFO] Final Memory: 51M/1080M > [INFO] > > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project > spark-assembly_2.10: Execution default of goal > org.apache.maven.plugins:maven-shade-plugin:2.2:shade failed: Plugin > org.apache.maven.plugins:maven-shade-plugin:2.2 or one of its dependencies > could not be resolved: Could not find artifact > com.google.code.findbugs:jsr305:jar:1.3.9 -> [Help 1] > [ERROR] > [ERROR] To see the full stack trace of the errors, re-run Maven with the -e > switch. > [ERROR] Re-run Maven using the -X switch to enable full debug logging. > [ERROR] > [ERROR] For more information about the errors and possible solutions, please > read the following articles: > [ERROR] [Help 1] > http://cwiki.apache.org/confluence/display/MAVEN/PluginResolutionException > [ERROR] > [ERROR] After correcting the problems, you can resume the build with the > command > [ERROR] mvn -rf :spark-assembly_2.10 > If you use this command, It is ok > mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 > -Dmaven.wagon.http.ssl.insecure=true -Dmaven.wagon.http.ssl.allowall=true > -DskipTests clean package > The error is not very obvious, I spent a long time to solve this issues > In order to facilitate other guys who use spark in http proxy environment, I > highly recommed add this to documents -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2096) Correctly parse dot notations for accessing an array of structs
[ https://issues.apache.org/jira/browse/SPARK-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117381#comment-14117381 ] Apache Spark commented on SPARK-2096: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/2230 > Correctly parse dot notations for accessing an array of structs > --- > > Key: SPARK-2096 > URL: https://issues.apache.org/jira/browse/SPARK-2096 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Yin Huai >Priority: Minor > Labels: starter > > For example, "arrayOfStruct" is an array of structs and every element of this > array has a field called "field1". "arrayOfStruct[0].field1" means to access > the value of "field1" for the first element of "arrayOfStruct", but the SQL > parser (in sql-core) treats "field1" as an alias. Also, > "arrayOfStruct.field1" means to access all values of "field1" in this array > of structs and the returns those values as an array. But, the SQL parser > cannot resolve it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3337) Paranoid quoting in shell to allow install dirs with spaces within.
[ https://issues.apache.org/jira/browse/SPARK-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117377#comment-14117377 ] Apache Spark commented on SPARK-3337: - User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/2229 > Paranoid quoting in shell to allow install dirs with spaces within. > --- > > Key: SPARK-3337 > URL: https://issues.apache.org/jira/browse/SPARK-3337 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma > Fix For: 1.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3337) Paranoid quoting in shell to allow install dirs with spaces within.
[ https://issues.apache.org/jira/browse/SPARK-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-3337: --- Component/s: Build > Paranoid quoting in shell to allow install dirs with spaces within. > --- > > Key: SPARK-3337 > URL: https://issues.apache.org/jira/browse/SPARK-3337 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma > Fix For: 1.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3337) Paranoid quoting in shell to allow install dirs with spaces within.
[ https://issues.apache.org/jira/browse/SPARK-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-3337: --- Component/s: Spark Core > Paranoid quoting in shell to allow install dirs with spaces within. > --- > > Key: SPARK-3337 > URL: https://issues.apache.org/jira/browse/SPARK-3337 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma > Fix For: 1.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3337) Paranoid quoting in shell to allow install dirs with spaces within.
Prashant Sharma created SPARK-3337: -- Summary: Paranoid quoting in shell to allow install dirs with spaces within. Key: SPARK-3337 URL: https://issues.apache.org/jira/browse/SPARK-3337 Project: Spark Issue Type: Improvement Affects Versions: 1.0.2, 1.1.0 Reporter: Prashant Sharma Assignee: Prashant Sharma Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3328) ./make-distribution.sh --with-tachyon build is broken
[ https://issues.apache.org/jira/browse/SPARK-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117361#comment-14117361 ] Apache Spark commented on SPARK-3328: - User 'prudhvije' has created a pull request for this issue: https://github.com/apache/spark/pull/2228 > ./make-distribution.sh --with-tachyon build is broken > - > > Key: SPARK-3328 > URL: https://issues.apache.org/jira/browse/SPARK-3328 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.1.0 >Reporter: Elijah Epifanov > > cp: tachyon-0.5.0/target/tachyon-0.5.0-jar-with-dependencies.jar: No such > file or directory -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3178) setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the worker memory limit to zero
[ https://issues.apache.org/jira/browse/SPARK-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117350#comment-14117350 ] Apache Spark commented on SPARK-3178: - User 'bbejeck' has created a pull request for this issue: https://github.com/apache/spark/pull/2227 > setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the > worker memory limit to zero > > > Key: SPARK-3178 > URL: https://issues.apache.org/jira/browse/SPARK-3178 > Project: Spark > Issue Type: Bug > Environment: osx >Reporter: Jon Haddad >Assignee: Bill Bejeck > Labels: starter > > This should either default to m or just completely fail. Starting a worker > with zero memory isn't very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3292) Shuffle Tasks run incessantly even though there's no inputs
[ https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117308#comment-14117308 ] guowei commented on SPARK-3292: --- i've created PR 2192 to fix it . but i don't know how to link it > Shuffle Tasks run incessantly even though there's no inputs > --- > > Key: SPARK-3292 > URL: https://issues.apache.org/jira/browse/SPARK-3292 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.2 >Reporter: guowei > > such as repartition groupby join and cogroup > for example. > if i want the shuffle outputs save as hadoop file ,even though there is no > inputs , many emtpy file generate too. > it's too expensive , -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3322) Log a ConnectionManager error when the application ends
[ https://issues.apache.org/jira/browse/SPARK-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117299#comment-14117299 ] Prashant Sharma commented on SPARK-3322: So I felt, SPARK-3171 is related. I am not sure I fully understand the intentions, probably giving a bit more information on what this issue is about will be helpful. However if you think, this is almost or same issue as the SPARK-3171. Feel free to close it. > Log a ConnectionManager error when the application ends > --- > > Key: SPARK-3322 > URL: https://issues.apache.org/jira/browse/SPARK-3322 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: wangfei > > Athough it does not influence the result, it always would log an error from > ConnectionManager. > Sometimes only log "ConnectionManagerId(vm2,40992) not found" and sometimes > it also will log "CancelledKeyException" > The log Info as fellow: > 14/08/29 16:54:53 ERROR ConnectionManager: Corresponding SendingConnection to > ConnectionManagerId(vm2,40992) not found > 14/08/29 16:54:53 INFO ConnectionManager: key already cancelled ? > sun.nio.ch.SelectionKeyImpl@457245f9 > java.nio.channels.CancelledKeyException > at > org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386) > at > org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3071) Increase default driver memory
[ https://issues.apache.org/jira/browse/SPARK-3071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117241#comment-14117241 ] Prashant Sharma commented on SPARK-3071: This makes sense, but looking at it from a different angle that when we set "too" low memory the user is forced to think this should be configurable somewhere. However if you increase the default, most things will work without OOMs + this is effected cluster wide. And they may never need to look at it. (This is like being too critical, but just another angle.) > Increase default driver memory > -- > > Key: SPARK-3071 > URL: https://issues.apache.org/jira/browse/SPARK-3071 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Xiangrui Meng > > The current default is 512M, which is usually too small because user also > uses driver to do some computation. In local mode, executor memory setting is > ignored while only driver memory is used, which provides more incentive to > increase the default driver memory. > I suggest > 1. 2GB in local mode and warn users if executor memory is set a bigger value > 2. same as worker memory on an EC2 standalone server -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3275) Socket receiver can not recover when the socket server restarted
[ https://issues.apache.org/jira/browse/SPARK-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117216#comment-14117216 ] Prashant Sharma commented on SPARK-3275: AFAIK, from my conversation with TD once. Spark Streaming has a listener interface which lets you listen to events of receiver dying. So you can then manually restart/start another receiver somewhere. > Socket receiver can not recover when the socket server restarted > - > > Key: SPARK-3275 > URL: https://issues.apache.org/jira/browse/SPARK-3275 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.0.2 >Reporter: Jack Hu > Labels: failover > > To reproduce this issue: > 1. create a application with a socket dstream > 2. start the socket server and start the application > 3. restart the socket server > 4. the socket dstream will fail to reconnect (it will close the connection > after a successful connect) > The main issue should be the status in SocketReceiver and ReceiverSupervisor > is incorrect after the reconnect: > In SocketReceiver ::receive() the while loop will never be entered after > reconnect since the isStopped will returns true: > val iterator = bytesToObjects(socket.getInputStream()) > while(!isStopped && iterator.hasNext) { > store(iterator.next) > } > logInfo("Stopped receiving") > restart("Retrying connecting to " + host + ":" + port) > That is caused by the status flag "receiverState" in ReceiverSupervisor will > be set to Stopped when the connection losses, but it is reset after the call > of Receiver start method: > def startReceiver(): Unit = synchronized { > try { > logInfo("Starting receiver") > receiver.onStart() > logInfo("Called receiver onStart") > onReceiverStart() > receiverState = Started > } catch { > case t: Throwable => > stop("Error starting receiver " + streamId, Some(t)) > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3282) It should support multiple receivers at one socketInputDStream
[ https://issues.apache.org/jira/browse/SPARK-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117208#comment-14117208 ] Prashant Sharma commented on SPARK-3282: You can create as many receivers as you like, by creating multiple dstream and doing a union on them. There are examples illustrating this like: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/RawNetworkGrep.scala. Here you can specify how many dstreams to create and same number of receivers are created. > It should support multiple receivers at one socketInputDStream > --- > > Key: SPARK-3282 > URL: https://issues.apache.org/jira/browse/SPARK-3282 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.2 >Reporter: shenhong > > At present, a socketInputDStream support at most one receiver, it will be > bottleneck when large inputStrem appear. > It should support multiple receivers at one socketInputDStream -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3335) [Spark SQL] In pyspark, cannot use broadcast variables in UDF
[ https://issues.apache.org/jira/browse/SPARK-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kay feng updated SPARK-3335: Description: Running pyspark on a spark cluster with standalone master, spark sql cannot use broadcast variables in UDF. But we can use broadcast variable in spark in scala. For example, bar={"a":"aa", "b":"bb", "c":"abc"} foo=sc.broadcast(bar) sqlContext.registerFunction("MYUDF", lambda x: foo.value[x] if x else ''). q= sqlContext.sql('SELECT MYUDF(c) FROM foobar') out = q.collect() Got the following exception: Py4JJavaError: An error occurred while calling o169.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 51.0 failed 4 times, most recent failure: Lost task 4.3 in stage 51.0 (TID 13040, ip-10-33-9-144.us-west-2.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/root/spark/python/pyspark/worker.py", line 75, in main command = pickleSer._read_with_length(infile) File "/root/spark/python/pyspark/serializers.py", line 150, in _read_with_length return self.loads(obj) File "/root/spark/python/pyspark/broadcast.py", line 41, in _from_id raise Exception("Broadcast variable '%s' not loaded!" % bid) Exception: (Exception("Broadcast variable '21' not loaded!",), , (21L,)) org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124) org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:154) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaFo
[jira] [Created] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF
kay feng created SPARK-3336: --- Summary: [Spark SQL] In pyspark, cannot group by field on UDF Key: SPARK-3336 URL: https://issues.apache.org/jira/browse/SPARK-3336 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.1.0 Reporter: kay feng Running pyspark on a spark cluster with standalone master. Cannot group by field on a UDF. But we can group by UDF in Scala. For example: q = sqlContext.sql('SELECT COUNT(*), MYUDF(foo) FROM bar GROUP BY MYUDF(foo)') out = q.collect() I got this exception: Py4JJavaError: An error occurred while calling o183.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 56.0 failed 4 times, most recent failure: Lost task 26.3 in stage 56.0 (TID 14038, ip-10-33-9-144.us-west-2.compute.internal): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF#1278 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43) org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42) org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212) org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168) org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156) org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:42) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.immutable.List.foreach(List.scala:318) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.AbstractTraversable.map(Traversable.scala:105) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(Projection.scala:52) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.(Aggregate.scala:176) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:172) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:115) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.
[jira] [Updated] (SPARK-3335) [Spark SQL] In pyspark, cannot use broadcast variables in UDF
[ https://issues.apache.org/jira/browse/SPARK-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kay feng updated SPARK-3335: Description: Running pyspark on a spark cluster with standalone master, it cannot use broadcast variables in UDF. For example, bar={"a":"aa", "b":"bb", "c":"abc"} foo=sc.broadcast(bar) sqlContext.registerFunction("MYUDF", lambda x: foo.value[x] if x else ''). q= sqlContext.sql('SELECT MYUDF(c) FROM foobar') out = q.collect() Got the following exception: Py4JJavaError: An error occurred while calling o169.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 51.0 failed 4 times, most recent failure: Lost task 4.3 in stage 51.0 (TID 13040, ip-10-33-9-144.us-west-2.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/root/spark/python/pyspark/worker.py", line 75, in main command = pickleSer._read_with_length(infile) File "/root/spark/python/pyspark/serializers.py", line 150, in _read_with_length return self.loads(obj) File "/root/spark/python/pyspark/broadcast.py", line 41, in _from_id raise Exception("Broadcast variable '%s' not loaded!" % bid) Exception: (Exception("Broadcast variable '21' not loaded!",), , (21L,)) org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124) org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:154) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at sca
[jira] [Created] (SPARK-3335) [Spark SQL] In pyspark, cannot use broadcast variables in UDF
kay feng created SPARK-3335: --- Summary: [Spark SQL] In pyspark, cannot use broadcast variables in UDF Key: SPARK-3335 URL: https://issues.apache.org/jira/browse/SPARK-3335 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.1.0 Reporter: kay feng Running pyspark on a spark cluster with standalone master, it cannot use broadcast variables in UDF. For example, bar={"a":"aa", "b":"bb", "c":"abc"} foo=sc.broadcast(bar) sqlContext.registerFunction("MYUDF", lambda x: foo.value[x] if x else ''). Got the following exception: Py4JJavaError: An error occurred while calling o169.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 51.0 failed 4 times, most recent failure: Lost task 4.3 in stage 51.0 (TID 13040, ip-10-33-9-144.us-west-2.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/root/spark/python/pyspark/worker.py", line 75, in main command = pickleSer._read_with_length(infile) File "/root/spark/python/pyspark/serializers.py", line 150, in _read_with_length return self.loads(obj) File "/root/spark/python/pyspark/broadcast.py", line 41, in _from_id raise Exception("Broadcast variable '%s' not loaded!" % bid) Exception: (Exception("Broadcast variable '21' not loaded!",), , (21L,)) org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124) org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:154) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.sc
[jira] [Comment Edited] (SPARK-3292) Shuffle Tasks run incessantly even though there's no inputs
[ https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117156#comment-14117156 ] Prashant Sharma edited comment on SPARK-3292 at 9/1/14 8:12 AM: This issue reminds me of a pull request that was trying to not do anything if there is nothing to save for streaming jobs. I will try to recollect the PR number. But in the meantime, would be great if you could link it. was (Author: prashant_): This issue reminds me of a pull request that was trying to not do anything if there is nothing to save. I will try to recollect the PR number. But in the meantime, would be great if you could link it. > Shuffle Tasks run incessantly even though there's no inputs > --- > > Key: SPARK-3292 > URL: https://issues.apache.org/jira/browse/SPARK-3292 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.2 >Reporter: guowei > > such as repartition groupby join and cogroup > for example. > if i want the shuffle outputs save as hadoop file ,even though there is no > inputs , many emtpy file generate too. > it's too expensive , -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3292) Shuffle Tasks run incessantly even though there's no inputs
[ https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117156#comment-14117156 ] Prashant Sharma commented on SPARK-3292: This issue reminds me of a pull request that was trying to not do anything if there is nothing to save. I will try to recollect the PR number. But in the meantime, would be great if you could link it. > Shuffle Tasks run incessantly even though there's no inputs > --- > > Key: SPARK-3292 > URL: https://issues.apache.org/jira/browse/SPARK-3292 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.2 >Reporter: guowei > > such as repartition groupby join and cogroup > for example. > if i want the shuffle outputs save as hadoop file ,even though there is no > inputs , many emtpy file generate too. > it's too expensive , -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3007) Add "Dynamic Partition" support to Spark Sql hive
[ https://issues.apache.org/jira/browse/SPARK-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117152#comment-14117152 ] Apache Spark commented on SPARK-3007: - User 'baishuo' has created a pull request for this issue: https://github.com/apache/spark/pull/2226 > Add "Dynamic Partition" support to Spark Sql hive > --- > > Key: SPARK-3007 > URL: https://issues.apache.org/jira/browse/SPARK-3007 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: baishuo > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3306) Addition of external resource dependency in executors
[ https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-3306: --- Target Version/s: (was: 1.3.0) > Addition of external resource dependency in executors > - > > Key: SPARK-3306 > URL: https://issues.apache.org/jira/browse/SPARK-3306 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Yan > > Currently, Spark executors only support static and read-only external > resources of side files and jar files. With emerging disparate data sources, > there is a need to support more versatile external resources, such as > connections to data sources, to facilitate efficient data accesses to the > sources. For one, the JDBCRDD, with some modifications, could benefit from > this feature by reusing established JDBC connections from the same Spark > context before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3306) Addition of external resource dependency in executors
[ https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117138#comment-14117138 ] Prashant Sharma commented on SPARK-3306: I am not 100% sure, why you are referring to Spark Executors. I was feeling there is no need to touch spark executors to support something like that. You can probably add a new spark conf option and by default all spark conf options are propogated to executors. I will let you close this issue, if you are convinced. > Addition of external resource dependency in executors > - > > Key: SPARK-3306 > URL: https://issues.apache.org/jira/browse/SPARK-3306 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Yan > > Currently, Spark executors only support static and read-only external > resources of side files and jar files. With emerging disparate data sources, > there is a need to support more versatile external resources, such as > connections to data sources, to facilitate efficient data accesses to the > sources. For one, the JDBCRDD, with some modifications, could benefit from > this feature by reusing established JDBC connections from the same Spark > context before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3322) Log a ConnectionManager error when the application ends
[ https://issues.apache.org/jira/browse/SPARK-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117117#comment-14117117 ] Prashant Sharma commented on SPARK-3322: I remember seeing this discussion somewhere else as well, probably on a pull request. It seems the PR is not linked to this issue somehow. Would be great if you could do it. > Log a ConnectionManager error when the application ends > --- > > Key: SPARK-3322 > URL: https://issues.apache.org/jira/browse/SPARK-3322 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: wangfei > > Athough it does not influence the result, it always would log an error from > ConnectionManager. > Sometimes only log "ConnectionManagerId(vm2,40992) not found" and sometimes > it also will log "CancelledKeyException" > The log Info as fellow: > 14/08/29 16:54:53 ERROR ConnectionManager: Corresponding SendingConnection to > ConnectionManagerId(vm2,40992) not found > 14/08/29 16:54:53 INFO ConnectionManager: key already cancelled ? > sun.nio.ch.SelectionKeyImpl@457245f9 > java.nio.channels.CancelledKeyException > at > org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386) > at > org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3328) ./make-distribution.sh --with-tachyon build is broken
[ https://issues.apache.org/jira/browse/SPARK-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-3328: --- Summary: ./make-distribution.sh --with-tachyon build is broken (was: --with-tachyon build is broken) > ./make-distribution.sh --with-tachyon build is broken > - > > Key: SPARK-3328 > URL: https://issues.apache.org/jira/browse/SPARK-3328 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.1.0 >Reporter: Elijah Epifanov > > cp: tachyon-0.5.0/target/tachyon-0.5.0-jar-with-dependencies.jar: No such > file or directory -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org