[jira] [Resolved] (SPARK-2636) Expose job ID in JobWaiter API

2014-09-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2636.

   Resolution: Fixed
Fix Version/s: 1.2.0

> Expose job ID in JobWaiter API
> --
>
> Key: SPARK-2636
> URL: https://issues.apache.org/jira/browse/SPARK-2636
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API
>Reporter: Chengxiang Li
>Assignee: Chengxiang Li
>  Labels: hive
> Fix For: 1.2.0
>
>
> In Hive on Spark, we want to track spark job status through Spark API, the 
> basic idea is as following:
> # create an hive-specified spark listener and register it to spark listener 
> bus.
> # hive-specified spark listener generate job status by spark listener events.
> # hive driver track job status through hive-specified spark listener. 
> the current problem is that hive driver need job identifier to track 
> specified job status through spark listener, but there is no spark API to get 
> job identifier(like job id) while submit spark job.
> I think other project whoever try to track job status with spark API would 
> suffer from this as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2633) enhance spark listener API to gather more spark job information

2014-09-01 Thread Chengxiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated SPARK-2633:
-
Attachment: (was: Spark listener enhancement for Hive on Spark job 
monitor and statistic.docx)

> enhance spark listener API to gather more spark job information
> ---
>
> Key: SPARK-2633
> URL: https://issues.apache.org/jira/browse/SPARK-2633
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API
>Reporter: Chengxiang Li
>Priority: Critical
>  Labels: hive
> Attachments: Spark listener enhancement for Hive on Spark job monitor 
> and statistic.docx
>
>
> Based on Hive on Spark job status monitoring and statistic collection 
> requirement, try to enhance spark listener API to gather more spark job 
> information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2633) enhance spark listener API to gather more spark job information

2014-09-01 Thread Chengxiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated SPARK-2633:
-
Attachment: Spark listener enhancement for Hive on Spark job monitor and 
statistic.docx

> enhance spark listener API to gather more spark job information
> ---
>
> Key: SPARK-2633
> URL: https://issues.apache.org/jira/browse/SPARK-2633
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API
>Reporter: Chengxiang Li
>Priority: Critical
>  Labels: hive
> Attachments: Spark listener enhancement for Hive on Spark job monitor 
> and statistic.docx
>
>
> Based on Hive on Spark job status monitoring and statistic collection 
> requirement, try to enhance spark listener API to gather more spark job 
> information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1701) Inconsistent naming: "slice" or "partition"

2014-09-01 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117918#comment-14117918
 ] 

Patrick Wendell commented on SPARK-1701:


I think that's a straw man. A closer review once there is a patch
might reveal some corner cases.

On Mon, Sep 1, 2014 at 10:59 PM, Nicholas Chammas (JIRA)


> Inconsistent naming: "slice" or "partition"
> ---
>
> Key: SPARK-1701
> URL: https://issues.apache.org/jira/browse/SPARK-1701
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Reporter: Daniel Darabos
>Priority: Minor
>  Labels: starter
>
> Throughout the documentation and code "slice" and "partition" are used 
> interchangeably. (Or so it seems to me.) It would avoid some confusion for 
> new users to settle on one name. I think "partition" is winning, since that 
> is the name of the class representing the concept.
> This should not be much more complicated to do than a search & replace. I can 
> take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1701) Inconsistent naming: "slice" or "partition"

2014-09-01 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117917#comment-14117917
 ] 

Nicholas Chammas commented on SPARK-1701:
-

OK, so it sounds like action is:
* Replace all occurrences of "slice" with "partition"
* Leave occurrences of "task" as-is

[~darabos] - Feel free to run with this.

> Inconsistent naming: "slice" or "partition"
> ---
>
> Key: SPARK-1701
> URL: https://issues.apache.org/jira/browse/SPARK-1701
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Reporter: Daniel Darabos
>Priority: Minor
>  Labels: starter
>
> Throughout the documentation and code "slice" and "partition" are used 
> interchangeably. (Or so it seems to me.) It would avoid some confusion for 
> new users to settle on one name. I think "partition" is winning, since that 
> is the name of the class representing the concept.
> This should not be much more complicated to do than a search & replace. I can 
> take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles

2014-09-01 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117914#comment-14117914
 ] 

Patrick Wendell commented on SPARK-1239:


I think [~andrewor] has this in his backlog but it's not actively being worked 
on. [~bcwalrus] do you or [~sandyr] want to take a crack?

> Don't fetch all map output statuses at each reducer during shuffles
> ---
>
> Key: SPARK-1239
> URL: https://issues.apache.org/jira/browse/SPARK-1239
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
> Fix For: 1.1.0
>
>
> Instead we should modify the way we fetch map output statuses to take both a 
> mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles

2014-09-01 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-1239:
--
Summary: Don't fetch all map output statuses at each reducer during 
shuffles  (was: Don't fetch all map outputs at each reducer during shuffles)

> Don't fetch all map output statuses at each reducer during shuffles
> ---
>
> Key: SPARK-1239
> URL: https://issues.apache.org/jira/browse/SPARK-1239
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
> Fix For: 1.1.0
>
>
> Instead we should modify the way we fetch map output statuses to take both a 
> mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3005:
---
Target Version/s: 1.1.1, 1.2.0

> Spark with Mesos fine-grained mode throws UnsupportedOperationException in 
> MesosSchedulerBackend.killTask()
> ---
>
> Key: SPARK-3005
> URL: https://issues.apache.org/jira/browse/SPARK-3005
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2
> Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector
>Reporter: Xu Zhongxing
> Attachments: SPARK-3005_1.diff
>
>
> I am using Spark, Mesos, spark-cassandra-connector to do some work on a 
> cassandra cluster.
> During the job running, I killed the Cassandra daemon to simulate some 
> failure cases. This results in task failures.
> If I run the job in Mesos coarse-grained mode, the spark driver program 
> throws an exception and shutdown cleanly.
> But when I run the job in Mesos fine-grained mode, the spark driver program 
> hangs.
> The spark log is: 
> {code}
>  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 
> Logging.scala (line 58) Cancelling stage 1
>  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 
> Logging.scala (line 79) Could not cancel tasks for stage 1
> java.lang.UnsupportedOperationException
>   at 
> org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 

[jira] [Commented] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()

2014-09-01 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117909#comment-14117909
 ] 

Patrick Wendell commented on SPARK-3005:


I think it might be safe to just define the semantics of cancelTask to be "best 
effort" and have the default implementation be empty. From what I can tell the 
DAGScheduler is already resilient to a task finishing after the stage has been 
cleaned up (for other reasons I think it needs be resilient to this). So maybe 
we should just remove this entire ableToCancelStages logic in here and make it 
simpler. /cc [~kayousterhout] and [~markhamstra] who worked on this code.

Hey [~xuzhongxing] what do you mean that the "tasks themselves already died and 
exited"? The code here is designed to cancel outstanding tasks that are still 
running. For instance, I have a job that has 500 tasks running. Then there is a 
failure of one task multiple time so I need to fail the stage, but there are 
still many tasks running. I think those tasks need to be killed still. 
Otherwise you have zombie tasks running. I think in mesos fine-grained mode we 
just won't be able to support this feature... but we should make it so it 
doesn't hang.


> Spark with Mesos fine-grained mode throws UnsupportedOperationException in 
> MesosSchedulerBackend.killTask()
> ---
>
> Key: SPARK-3005
> URL: https://issues.apache.org/jira/browse/SPARK-3005
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2
> Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector
>Reporter: Xu Zhongxing
> Attachments: SPARK-3005_1.diff
>
>
> I am using Spark, Mesos, spark-cassandra-connector to do some work on a 
> cassandra cluster.
> During the job running, I killed the Cassandra daemon to simulate some 
> failure cases. This results in task failures.
> If I run the job in Mesos coarse-grained mode, the spark driver program 
> throws an exception and shutdown cleanly.
> But when I run the job in Mesos fine-grained mode, the spark driver program 
> hangs.
> The spark log is: 
> {code}
>  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 
> Logging.scala (line 58) Cancelling stage 1
>  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 
> Logging.scala (line 79) Could not cancel tasks for stage 1
> java.lang.UnsupportedOperationException
>   at 
> org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handle

[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM

2014-09-01 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117905#comment-14117905
 ] 

Josh Rosen commented on SPARK-:
---

Still investigating.  I tried this on my laptop by running the following script 
through spark-submit:

{code}
from pyspark import SparkContext

sc = SparkContext(appName="test")
a = sc.parallelize(["Nick", "John", "Bob"])
a = a.repartition(1)
parallelism = 4
a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y, parallelism).take(1)
{code}

With spark-1.0.2-bin-hadoop1, this ran in ~46 seconds, while branch-1.1 took 
~30 seconds.  Spark 1.0.2 seemed to experience one long pause that might have 
been due to GC, but I'll have to measure that.  Both ran to completion without 
crashing.  I'll see what happens if I bump up the number of partitions.

> Large number of partitions causes OOM
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
> Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
>Reporter: Nicholas Chammas
>Priority: Blocker
>
> Here’s a repro for PySpark:
> {code}
> a = sc.parallelize(["Nick", "John", "Bob"])
> a = a.repartition(24000)
> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
> {code}
> This code runs fine on 1.0.2. It returns the following result in just over a 
> minute:
> {code}
> [(4, 'NickJohn')]
> {code}
> However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
> runs for a very, very long time (at least > 45 min) and then fails with 
> {{java.lang.OutOfMemoryError: Java heap space}}.
> Here is a stack trace taken from a run on 1.1.0-rc2:
> {code}
> >>> a = sc.parallelize(["Nick", "John", "Bob"])
> >>> a = a.repartition(24000)
> >>> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
> 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
> heart beats: 175143ms exceeds 45000ms
> 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
> heart beats: 175359ms exceeds 45000ms
> 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
> heart beats: 173061ms exceeds 45000ms
> 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
> heart beats: 176816ms exceeds 45000ms
> 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
> heart beats: 182241ms exceeds 45000ms
> 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
> heart beats: 178406ms exceeds 45000ms
> 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
> thread-3
> java.lang.OutOfMemoryError: Java heap space
> at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
> at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
> at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
> at 
> org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
> at 
> java.util.concurrent.Threa

[jira] [Updated] (SPARK-3333) Large number of partitions causes OOM

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-:
---
Priority: Blocker  (was: Major)

> Large number of partitions causes OOM
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
> Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
>Reporter: Nicholas Chammas
>Priority: Blocker
>
> Here’s a repro for PySpark:
> {code}
> a = sc.parallelize(["Nick", "John", "Bob"])
> a = a.repartition(24000)
> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
> {code}
> This code runs fine on 1.0.2. It returns the following result in just over a 
> minute:
> {code}
> [(4, 'NickJohn')]
> {code}
> However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
> runs for a very, very long time (at least > 45 min) and then fails with 
> {{java.lang.OutOfMemoryError: Java heap space}}.
> Here is a stack trace taken from a run on 1.1.0-rc2:
> {code}
> >>> a = sc.parallelize(["Nick", "John", "Bob"])
> >>> a = a.repartition(24000)
> >>> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
> 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
> heart beats: 175143ms exceeds 45000ms
> 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
> heart beats: 175359ms exceeds 45000ms
> 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
> heart beats: 173061ms exceeds 45000ms
> 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
> heart beats: 176816ms exceeds 45000ms
> 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
> heart beats: 182241ms exceeds 45000ms
> 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
> heart beats: 178406ms exceeds 45000ms
> 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
> thread-3
> java.lang.OutOfMemoryError: Java heap space
> at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
> at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
> at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
> at 
> org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Exception in thread "Result resolver thread-3" 14/08/29 21:56:26 ERROR 
> SendingConnection: Exception while reading SendingConnection to 
> ConnectionManagerId(ip-10-73-142-223.ec2.internal,54014)
> java.nio.channels.ClosedChannelException
> at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
> at org.apache.spark.network.SendingConnection.read(Connection.scala:390)
>   

[jira] [Updated] (SPARK-3333) Large number of partitions causes OOM

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-:
---
Target Version/s: 1.1.0

> Large number of partitions causes OOM
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
> Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
>Reporter: Nicholas Chammas
>
> Here’s a repro for PySpark:
> {code}
> a = sc.parallelize(["Nick", "John", "Bob"])
> a = a.repartition(24000)
> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
> {code}
> This code runs fine on 1.0.2. It returns the following result in just over a 
> minute:
> {code}
> [(4, 'NickJohn')]
> {code}
> However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
> runs for a very, very long time (at least > 45 min) and then fails with 
> {{java.lang.OutOfMemoryError: Java heap space}}.
> Here is a stack trace taken from a run on 1.1.0-rc2:
> {code}
> >>> a = sc.parallelize(["Nick", "John", "Bob"])
> >>> a = a.repartition(24000)
> >>> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
> 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
> heart beats: 175143ms exceeds 45000ms
> 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
> heart beats: 175359ms exceeds 45000ms
> 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
> heart beats: 173061ms exceeds 45000ms
> 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
> heart beats: 176816ms exceeds 45000ms
> 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
> heart beats: 182241ms exceeds 45000ms
> 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
> heart beats: 178406ms exceeds 45000ms
> 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
> thread-3
> java.lang.OutOfMemoryError: Java heap space
> at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
> at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
> at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
> at 
> org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Exception in thread "Result resolver thread-3" 14/08/29 21:56:26 ERROR 
> SendingConnection: Exception while reading SendingConnection to 
> ConnectionManagerId(ip-10-73-142-223.ec2.internal,54014)
> java.nio.channels.ClosedChannelException
> at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
> at org.apache.spark.network.SendingConnection.read(Connection.scala:390)
> at 
> org.apache.spark.network.Connec

[jira] [Resolved] (SPARK-3342) m3 instances don't get local SSDs

2014-09-01 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-3342.
--
   Resolution: Fixed
Fix Version/s: 1.1.0

> m3 instances don't get local SSDs
> -
>
> Key: SPARK-3342
> URL: https://issues.apache.org/jira/browse/SPARK-3342
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.0.2
>Reporter: Matei Zaharia
>Assignee: Daniel Darabos
> Fix For: 1.1.0
>
>
> As discussed on https://github.com/apache/spark/pull/2081, these instances 
> ignore the block device mapping on the AMI and require ephemeral drives to be 
> added programmatically when launching them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM

2014-09-01 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117903#comment-14117903
 ] 

Aaron Davidson commented on SPARK-:
---

Anyone have a jmap on the driver during this time? Perhaps also a GC log? To be 
clear, a "jmap -histo:live" and a "jmap -histo" should be sufficient over a 
full heap dump. I wonder if something changed with the MapOutputTracker, since 
that state grows with O(M * R), which is extremely high in this situation.

> Large number of partitions causes OOM
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
> Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
>Reporter: Nicholas Chammas
>
> Here’s a repro for PySpark:
> {code}
> a = sc.parallelize(["Nick", "John", "Bob"])
> a = a.repartition(24000)
> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
> {code}
> This code runs fine on 1.0.2. It returns the following result in just over a 
> minute:
> {code}
> [(4, 'NickJohn')]
> {code}
> However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
> runs for a very, very long time (at least > 45 min) and then fails with 
> {{java.lang.OutOfMemoryError: Java heap space}}.
> Here is a stack trace taken from a run on 1.1.0-rc2:
> {code}
> >>> a = sc.parallelize(["Nick", "John", "Bob"])
> >>> a = a.repartition(24000)
> >>> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
> 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
> heart beats: 175143ms exceeds 45000ms
> 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
> heart beats: 175359ms exceeds 45000ms
> 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
> heart beats: 173061ms exceeds 45000ms
> 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
> heart beats: 176816ms exceeds 45000ms
> 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
> heart beats: 182241ms exceeds 45000ms
> 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
> heart beats: 178406ms exceeds 45000ms
> 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
> thread-3
> java.lang.OutOfMemoryError: Java heap space
> at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
> at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
> at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
> at 
> org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Exception in thread "Result resolver thread-3" 14/08/29 21:56:26 ERROR 
> SendingConnection: Exception while reading SendingConnection to 
> ConnectionManagerId(ip-10-73-142-

[jira] [Commented] (SPARK-3342) m3 instances don't get local SSDs

2014-09-01 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117900#comment-14117900
 ] 

Matei Zaharia commented on SPARK-3342:
--

In particular see 
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html:

"For M3 instances, you must specify instance store volumes in the block device 
mapping for the instance. When you launch an M3 instance, we ignore any 
instance store volumes specified in the block device mapping for the AMI."

> m3 instances don't get local SSDs
> -
>
> Key: SPARK-3342
> URL: https://issues.apache.org/jira/browse/SPARK-3342
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.0.2
>Reporter: Matei Zaharia
>Assignee: Daniel Darabos
>
> As discussed on https://github.com/apache/spark/pull/2081, these instances 
> ignore the block device mapping on the AMI and require ephemeral drives to be 
> added programmatically when launching them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3342) m3 instances don't get local SSDs

2014-09-01 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-3342:


 Summary: m3 instances don't get local SSDs
 Key: SPARK-3342
 URL: https://issues.apache.org/jira/browse/SPARK-3342
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.0.2
Reporter: Matei Zaharia
Assignee: Daniel Darabos


As discussed on https://github.com/apache/spark/pull/2081, these instances 
ignore the block device mapping on the AMI and require ephemeral drives to be 
added programmatically when launching them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1701) Inconsistent naming: "slice" or "partition"

2014-09-01 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117892#comment-14117892
 ] 

Patrick Wendell commented on SPARK-1701:


I think in the current code "slices" are just a (probably redundant) synonym 
for "partitions". We could probably clean it up a bit, for instance in example 
code and the programming guide. I think in those cases it's best to just say 
"partitions".

The number of partitions and the number of tasks are distinct concepts in 
general, but in the cases you listed either applies, and I think that saying 
"number of tasks" might be slightly more understandable for users.

> Inconsistent naming: "slice" or "partition"
> ---
>
> Key: SPARK-1701
> URL: https://issues.apache.org/jira/browse/SPARK-1701
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Reporter: Daniel Darabos
>Priority: Minor
>  Labels: starter
>
> Throughout the documentation and code "slice" and "partition" are used 
> interchangeably. (Or so it seems to me.) It would avoid some confusion for 
> new users to settle on one name. I think "partition" is winning, since that 
> is the name of the class representing the concept.
> This should not be much more complicated to do than a search & replace. I can 
> take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1701) Inconsistent naming: "slice" or "partition"

2014-09-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117887#comment-14117887
 ] 

Reynold Xin commented on SPARK-1701:


"Partition" is the standard term. 

> Inconsistent naming: "slice" or "partition"
> ---
>
> Key: SPARK-1701
> URL: https://issues.apache.org/jira/browse/SPARK-1701
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Reporter: Daniel Darabos
>Priority: Minor
>  Labels: starter
>
> Throughout the documentation and code "slice" and "partition" are used 
> interchangeably. (Or so it seems to me.) It would avoid some confusion for 
> new users to settle on one name. I think "partition" is winning, since that 
> is the name of the class representing the concept.
> This should not be much more complicated to do than a search & replace. I can 
> take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1701) Inconsistent naming: "slice" or "partition"

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1701:
---
Labels: starter  (was: )

> Inconsistent naming: "slice" or "partition"
> ---
>
> Key: SPARK-1701
> URL: https://issues.apache.org/jira/browse/SPARK-1701
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Reporter: Daniel Darabos
>Priority: Minor
>  Labels: starter
>
> Throughout the documentation and code "slice" and "partition" are used 
> interchangeably. (Or so it seems to me.) It would avoid some confusion for 
> new users to settle on one name. I think "partition" is winning, since that 
> is the name of the class representing the concept.
> This should not be much more complicated to do than a search & replace. I can 
> take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1701) Inconsistent naming: "slice" or "partition"

2014-09-01 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117883#comment-14117883
 ] 

Nicholas Chammas commented on SPARK-1701:
-

[~pwendell] and [~rxin], I'm pinging you here to put this issue on your radar. 

With some guidance on what terms are favored and how a solution should be 
approached (e.g. can parameter names can be changed?), I'm sure a contributor 
could pick up this task.

> Inconsistent naming: "slice" or "partition"
> ---
>
> Key: SPARK-1701
> URL: https://issues.apache.org/jira/browse/SPARK-1701
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Reporter: Daniel Darabos
>Priority: Minor
>
> Throughout the documentation and code "slice" and "partition" are used 
> interchangeably. (Or so it seems to me.) It would avoid some confusion for 
> new users to settle on one name. I think "partition" is winning, since that 
> is the name of the class representing the concept.
> This should not be much more complicated to do than a search & replace. I can 
> take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1701) Inconsistent naming: "slice" or "partition"

2014-09-01 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117880#comment-14117880
 ] 

Nicholas Chammas commented on SPARK-1701:
-

In addition to "slice" and "partition", we also have "task".

Here's an example from Spark 1.0.2:
* PySpark {{reduceByKey}} [API 
documentation|http://spark.apache.org/docs/1.0.2/api/python/pyspark.rdd.RDD-class.html#reduceByKey]
 says {{numPartitions}}
* [Spark Programming 
Guide|http://spark.apache.org/docs/1.0.2/programming-guide.html#transformations]
 calls the same thing {{numTasks}}

> Inconsistent naming: "slice" or "partition"
> ---
>
> Key: SPARK-1701
> URL: https://issues.apache.org/jira/browse/SPARK-1701
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Reporter: Daniel Darabos
>Priority: Minor
>
> Throughout the documentation and code "slice" and "partition" are used 
> interchangeably. (Or so it seems to me.) It would avoid some confusion for 
> new users to settle on one name. I think "partition" is winning, since that 
> is the name of the class representing the concept.
> This should not be much more complicated to do than a search & replace. I can 
> take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3341) The dataType of Sqrt expression should be DoubleType.

2014-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117866#comment-14117866
 ] 

Apache Spark commented on SPARK-3341:
-

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2233

> The dataType of Sqrt expression should be DoubleType.
> -
>
> Key: SPARK-3341
> URL: https://issues.apache.org/jira/browse/SPARK-3341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3341) The dataType of Sqrt expression should be DoubleType.

2014-09-01 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-3341:


 Summary: The dataType of Sqrt expression should be DoubleType.
 Key: SPARK-3341
 URL: https://issues.apache.org/jira/browse/SPARK-3341
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3135) Avoid memory copy in TorrentBroadcast serialization

2014-09-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3135.

   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Reynold Xin

> Avoid memory copy in TorrentBroadcast serialization
> ---
>
> Key: SPARK-3135
> URL: https://issues.apache.org/jira/browse/SPARK-3135
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: starter
> Fix For: 1.2.0
>
>
> TorrentBroadcast.blockifyObject uses a ByteArrayOutputStream to serialize 
> broadcast object into a single giant byte array, and then separates it into 
> smaller chunks.  We should implement a new OutputStream that writes 
> serialized bytes directly into chunks of byte arrays so we don't need the 
> extra memory copy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3168) The ServletContextHandler of webui lacks a SessionManager

2014-09-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117855#comment-14117855
 ] 

Reynold Xin commented on SPARK-3168:


[~tgraves] - can you take a look? I think you added the ui filter stuff. Thanks.

> The ServletContextHandler of webui lacks a SessionManager
> -
>
> Key: SPARK-3168
> URL: https://issues.apache.org/jira/browse/SPARK-3168
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
> Environment: CAS
>Reporter: meiyoula
>
> When i use CAS to realize single sign of webui, it occurs a exception:
> {code}
> WARN  [qtp1076146544-24] / 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:561)
> java.lang.IllegalStateException: No SessionManager
> at org.eclipse.jetty.server.Request.getSession(Request.java:1269)
> at org.eclipse.jetty.server.Request.getSession(Request.java:1248)
> at 
> org.jasig.cas.client.validation.AbstractTicketValidationFilter.doFilter(AbstractTicketValidationFilter.java:178)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
> at 
> org.jasig.cas.client.authentication.AuthenticationFilter.doFilter(AuthenticationFilter.java:116)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
> at 
> org.jasig.cas.client.session.SingleSignOutFilter.doFilter(SingleSignOutFilter.java:76)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at org.eclipse.jetty.server.Server.handle(Server.java:370)
> at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
> at 
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
> at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
> at 
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
> at 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
> at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
> at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3333) Large number of partitions causes OOM

2014-09-01 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117853#comment-14117853
 ] 

Nicholas Chammas edited comment on SPARK- at 9/2/14 3:13 AM:
-

It looks like the default number of  reducers does indeed explain most of the 
performance difference here. But there is still a significant difference even 
after controlling this variable. 

I have 2 identical EC2 clusters as described in this JIRA issue, one on 1.0.2 
and one on 1.1.0-rc3. This time I ran the following PySpark code:

{code}
a = sc.parallelize(["Nick", "John", "Bob"])
a = a.repartition(24000)
a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y, 
sc.defaultParallelism).take(1)
{code}

Here are the runtimes for 3 runs on each cluster:
||1.0.2||1.1.0-rc3||
| 95s | 343s |
| 89s | 336s |
| 95s | 334s |

So manually setting the number of reducers to a smaller number does help a lot, 
but there is still a 3-4x performance slowdown.

Can anyone else replicate this result?


was (Author: nchammas):
It looks like the default number of  reducers does indeed explain most of the 
performance difference here. But there is still a significant difference even 
after controlling this variable. 

I have 2 identical EC2 clusters as described in this JIRA issue, one on 1.0.2 
and one on 1.1.0-rc3. This time I ran the following PySpark code:

{code}
a = sc.parallelize(["Nick", "John", "Bob"])
a = a.repartition(24000)
a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y, 
sc.defaultParallelism).take(1)
{code}

Here are my results for 3 runs on each cluster:
||1.0.2||1.1.0-rc3||
| 95s | 343s |
| 89s | 336s |
| 95s | 334s |

So manually setting the number of reducers to a smaller number does help a lot, 
but there is still a 3-4x performance slowdown.

Can anyone else replicate this result?

> Large number of partitions causes OOM
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
> Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
>Reporter: Nicholas Chammas
>
> Here’s a repro for PySpark:
> {code}
> a = sc.parallelize(["Nick", "John", "Bob"])
> a = a.repartition(24000)
> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
> {code}
> This code runs fine on 1.0.2. It returns the following result in just over a 
> minute:
> {code}
> [(4, 'NickJohn')]
> {code}
> However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
> runs for a very, very long time (at least > 45 min) and then fails with 
> {{java.lang.OutOfMemoryError: Java heap space}}.
> Here is a stack trace taken from a run on 1.1.0-rc2:
> {code}
> >>> a = sc.parallelize(["Nick", "John", "Bob"])
> >>> a = a.repartition(24000)
> >>> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
> 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
> heart beats: 175143ms exceeds 45000ms
> 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
> heart beats: 175359ms exceeds 45000ms
> 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
> heart beats: 173061ms exceeds 45000ms
> 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
> heart beats: 176816ms exceeds 45000ms
> 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
> heart beats: 182241ms exceeds 45000ms
> 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
> heart beats: 178406ms exceeds 45000ms
> 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
> thread-3
> java.lang.OutOfMemoryError: Java heap space
> at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
> at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> at com.esotericsoftwar

[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM

2014-09-01 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117853#comment-14117853
 ] 

Nicholas Chammas commented on SPARK-:
-

It looks like the default number of  reducers does indeed explain most of the 
performance difference here. But there is still a significant difference even 
after controlling this variable. 

I have 2 identical EC2 clusters as described in this JIRA issue, one on 1.0.2 
and one on 1.1.0-rc3. This time I ran the following PySpark code:

{code}
a = sc.parallelize(["Nick", "John", "Bob"])
a = a.repartition(24000)
a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y, 
sc.defaultParallelism).take(1)
{code}

Here are my results for 3 runs on each cluster:
||1.0.2||1.1.0-rc3||
| 95s | 343s |
| 89s | 336s |
| 95s | 334s |

So manually setting the number of reducers to a smaller number does help a lot, 
but there is still a 3-4x performance slowdown.

Can anyone else replicate this result?

> Large number of partitions causes OOM
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
> Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
>Reporter: Nicholas Chammas
>
> Here’s a repro for PySpark:
> {code}
> a = sc.parallelize(["Nick", "John", "Bob"])
> a = a.repartition(24000)
> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
> {code}
> This code runs fine on 1.0.2. It returns the following result in just over a 
> minute:
> {code}
> [(4, 'NickJohn')]
> {code}
> However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
> runs for a very, very long time (at least > 45 min) and then fails with 
> {{java.lang.OutOfMemoryError: Java heap space}}.
> Here is a stack trace taken from a run on 1.1.0-rc2:
> {code}
> >>> a = sc.parallelize(["Nick", "John", "Bob"])
> >>> a = a.repartition(24000)
> >>> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
> 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
> heart beats: 175143ms exceeds 45000ms
> 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
> heart beats: 175359ms exceeds 45000ms
> 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
> heart beats: 173061ms exceeds 45000ms
> 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
> heart beats: 176816ms exceeds 45000ms
> 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
> heart beats: 182241ms exceeds 45000ms
> 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
> heart beats: 178406ms exceeds 45000ms
> 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
> thread-3
> java.lang.OutOfMemoryError: Java heap space
> at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
> at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
> at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
> at 
> org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
> at 
> org

[jira] [Updated] (SPARK-3067) JobProgressPage could not show Fair Scheduler Pools section sometimes

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3067:
---
Target Version/s: 1.2.0

> JobProgressPage could not show Fair Scheduler Pools section sometimes
> -
>
> Key: SPARK-3067
> URL: https://issues.apache.org/jira/browse/SPARK-3067
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: YanTang Zhai
>Priority: Minor
>
> JobProgressPage could not show Fair Scheduler Pools section sometimes.
> SparkContext starts webui and then postEnvironmentUpdate. Sometimes 
> JobProgressPage is accessed between webui starting and postEnvironmentUpdate, 
> then the lazy val isFairScheduler will be false. The Fair Scheduler Pools 
> section will not display any more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-2773) Shuffle:use growth rate to predict if need to spill

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-2773:


> Shuffle:use growth rate to predict if need to spill
> ---
>
> Key: SPARK-2773
> URL: https://issues.apache.org/jira/browse/SPARK-2773
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 0.9.0, 1.0.0
>Reporter: uncleGen
>Priority: Minor
>
> Right now, Spark uses the total usage of "shuffle" memory of each thread to 
> predict if need to spill. I think it is not very reasonable. For example, 
> there are two threads pulling "shuffle" data. The total memory used to buffer 
> data is 21G. The first time to trigger spilling it when one thread has used 
> 7G memory to buffer "shuffle" data, here I assume another one has used the 
> same size. Unfortunately, I still have remaining 7G to use. So, I think 
> current prediction mode is too conservative, and can not maximize the usage 
> of "shuffle" memory. In my solution, I use the growth rate of "shuffle" 
> memory. Again, the growth of each time is limited, maybe 10K * 1024(my 
> assumption), then the first time to trigger spilling is when the remaining 
> "shuffle" memory is less than threads * growth * 2, i.e. 2 * 10M * 2. I think 
> it can maximize the usage of "shuffle" memory. In my solution, there is also 
> a conservative assumption, i.e. all of threads is pulling shuffle data in one 
> executor. However it dose not have much effect, the grow is limited after 
> all. Any suggestion?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2773) Shuffle:use growth rate to predict if need to spill

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2773.

Resolution: Won't Fix

I don't think this is needed now that SPARK-2316 is fixed. This queue is not 
intended to overflow during normal operation. If you still observe issues in a 
version of Spark that contains SPARK-2316... please report it and we'll see 
what is going on.

> Shuffle:use growth rate to predict if need to spill
> ---
>
> Key: SPARK-2773
> URL: https://issues.apache.org/jira/browse/SPARK-2773
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 0.9.0, 1.0.0
>Reporter: uncleGen
>Priority: Minor
>
> Right now, Spark uses the total usage of "shuffle" memory of each thread to 
> predict if need to spill. I think it is not very reasonable. For example, 
> there are two threads pulling "shuffle" data. The total memory used to buffer 
> data is 21G. The first time to trigger spilling it when one thread has used 
> 7G memory to buffer "shuffle" data, here I assume another one has used the 
> same size. Unfortunately, I still have remaining 7G to use. So, I think 
> current prediction mode is too conservative, and can not maximize the usage 
> of "shuffle" memory. In my solution, I use the growth rate of "shuffle" 
> memory. Again, the growth of each time is limited, maybe 10K * 1024(my 
> assumption), then the first time to trigger spilling is when the remaining 
> "shuffle" memory is less than threads * growth * 2, i.e. 2 * 10M * 2. I think 
> it can maximize the usage of "shuffle" memory. In my solution, there is also 
> a conservative assumption, i.e. all of threads is pulling shuffle data in one 
> executor. However it dose not have much effect, the grow is limited after 
> all. Any suggestion?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2773) Shuffle:use growth rate to predict if need to spill

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2773.

Resolution: Invalid

> Shuffle:use growth rate to predict if need to spill
> ---
>
> Key: SPARK-2773
> URL: https://issues.apache.org/jira/browse/SPARK-2773
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 0.9.0, 1.0.0
>Reporter: uncleGen
>Priority: Minor
>
> Right now, Spark uses the total usage of "shuffle" memory of each thread to 
> predict if need to spill. I think it is not very reasonable. For example, 
> there are two threads pulling "shuffle" data. The total memory used to buffer 
> data is 21G. The first time to trigger spilling it when one thread has used 
> 7G memory to buffer "shuffle" data, here I assume another one has used the 
> same size. Unfortunately, I still have remaining 7G to use. So, I think 
> current prediction mode is too conservative, and can not maximize the usage 
> of "shuffle" memory. In my solution, I use the growth rate of "shuffle" 
> memory. Again, the growth of each time is limited, maybe 10K * 1024(my 
> assumption), then the first time to trigger spilling is when the remaining 
> "shuffle" memory is less than threads * growth * 2, i.e. 2 * 10M * 2. I think 
> it can maximize the usage of "shuffle" memory. In my solution, there is also 
> a conservative assumption, i.e. all of threads is pulling shuffle data in one 
> executor. However it dose not have much effect, the grow is limited after 
> all. Any suggestion?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3179) Add task OutputMetrics

2014-09-01 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117825#comment-14117825
 ] 

Sandy Ryza commented on SPARK-3179:
---

Hi Michael,

Happy to help review your code or answer questions.

> Add task OutputMetrics
> --
>
> Key: SPARK-3179
> URL: https://issues.apache.org/jira/browse/SPARK-3179
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Sandy Ryza
>
> Track the bytes that tasks write to HDFS or other output destinations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2078) Use ISO8601 date formats in logging

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2078:
---
Issue Type: Improvement  (was: Bug)

> Use ISO8601 date formats in logging
> ---
>
> Key: SPARK-2078
> URL: https://issues.apache.org/jira/browse/SPARK-2078
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Ash
>Assignee: Andrew Ash
>
> Currently, logging has 2 digit years and doesn't include milliseconds in 
> logging timestamps.
> Use ISO8601 date formats instead of the current custom formats.
> There is some precedent here for ISO8601 format -- it's what [Hadoop 
> uses|https://github.com/apache/hadoop-common/blob/d92a8a29978e35ed36c4d4721a21c356c1ff1d4d/hadoop-common-project/hadoop-minikdc/src/main/resources/log4j.properties]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2078) Use ISO8601 date formats in logging

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2078.

Resolution: Won't Fix

Discussion on the JIRA suggested let's either keep the current one or just add 
milliseconds.

https://github.com/apache/spark/pull/1018

> Use ISO8601 date formats in logging
> ---
>
> Key: SPARK-2078
> URL: https://issues.apache.org/jira/browse/SPARK-2078
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Ash
>Assignee: Andrew Ash
>
> Currently, logging has 2 digit years and doesn't include milliseconds in 
> logging timestamps.
> Use ISO8601 date formats instead of the current custom formats.
> There is some precedent here for ISO8601 format -- it's what [Hadoop 
> uses|https://github.com/apache/hadoop-common/blob/d92a8a29978e35ed36c4d4721a21c356c1ff1d4d/hadoop-common-project/hadoop-minikdc/src/main/resources/log4j.properties]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3306) Addition of external resource dependency in executors

2014-09-01 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117820#comment-14117820
 ] 

Yan commented on SPARK-3306:


I am afraid that I was not clear in that the resources are to be shared across 
different tasks, task sets and task waves, instead of letting each task make 
the connection by itself, which is very inefficient. For that purpose, I feel 
Spark conf is not enough and executor needs to be enhanced with hooks to 
initialize and stop the uses of the long running resources.

> Addition of external resource dependency in executors
> -
>
> Key: SPARK-3306
> URL: https://issues.apache.org/jira/browse/SPARK-3306
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Yan
>
> Currently, Spark executors only support static and read-only external 
> resources of side files and jar files. With emerging disparate data sources, 
> there is a need to support more versatile external resources, such as 
> connections to data sources, to facilitate efficient data accesses to the 
> sources. For one, the JDBCRDD, with some modifications,  could benefit from 
> this feature by reusing established JDBC connections from the same Spark 
> context before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2895) Support mapPartitionsWithContext in Spark Java API

2014-09-01 Thread Chengxiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated SPARK-2895:
-
Description: 
This is a requirement from Hive on Spark, mapPartitionsWithContext only exists 
in Spark Scala API, we expect to access from Spark Java API. 
For HIVE-7627, HIVE-7843, Hive operators which are invoked in mapPartitions 
closure need to get taskId.

  was:This is a requirement from Hive on Spark, mapPartitionsWithContext only 
exists in Spark Scala API, we expect to access from Spark Java API.


> Support mapPartitionsWithContext in Spark Java API
> --
>
> Key: SPARK-2895
> URL: https://issues.apache.org/jira/browse/SPARK-2895
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API
>Reporter: Chengxiang Li
>Assignee: Chengxiang Li
>  Labels: hive
>
> This is a requirement from Hive on Spark, mapPartitionsWithContext only 
> exists in Spark Scala API, we expect to access from Spark Java API. 
> For HIVE-7627, HIVE-7843, Hive operators which are invoked in mapPartitions 
> closure need to get taskId.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1174) Adding port configuration for HttpFileServer

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1174.

Resolution: Duplicate

> Adding port configuration for HttpFileServer
> 
>
> Key: SPARK-1174
> URL: https://issues.apache.org/jira/browse/SPARK-1174
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Egor Pakhomov
>Assignee: Egor Pahomov
>Priority: Minor
> Fix For: 0.9.0
>
>
> I run spark in big organization, where to open port accessible to other 
> computers in network, I need to create a ticket on DevOps and it executes for 
> days. I can't have port for some spark service to be changed all the time. I 
> need ability to configure this port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2638) Improve concurrency of fetching Map outputs

2014-09-01 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-2638:
-

Assignee: Josh Rosen

> Improve concurrency of fetching Map outputs
> ---
>
> Key: SPARK-2638
> URL: https://issues.apache.org/jira/browse/SPARK-2638
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Stephen Boesch
>Assignee: Josh Rosen
>Priority: Minor
>  Labels: MapOutput, concurrency
> Fix For: 1.1.0
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> This issue was noticed while perusing the MapOutputTracker source code. 
> Notice that the synchronization is on the containing "fetching" collection - 
> which makes ALL fetches wait if any fetch were occurring.  
> The fix is to synchronize instead on the shuffleId (interned as a string to 
> ensure JVM wide visibility).
>   def getServerStatuses(shuffleId: Int, reduceId: Int): 
> Array[(BlockManagerId, Long)] = {
> val statuses = mapStatuses.get(shuffleId).orNull
> if (statuses == null) {
>   logInfo("Don't have map outputs for shuffle " + shuffleId + ", fetching 
> them")
>   var fetchedStatuses: Array[MapStatus] = null
>   fetching.synchronized {   // This is existing code
>  //  shuffleId.toString.intern.synchronized {  // New Code
> if (fetching.contains(shuffleId)) {
>   // Someone else is fetching it; wait for them to be done
>   while (fetching.contains(shuffleId)) {
> try {
>   fetching.wait()
> } catch {
>   case e: InterruptedException =>
> }
>   }
> This is only a small code change, but the testcases to prove (a) proper 
> functionality and (b) proper performance improvement are not so trivial.  
> For (b) it is not worthwhile to add a testcase to the codebase. Instead I 
> have added a git project that demonstrates the concurrency/performance 
> improvement using the fine-grained approach . The github project is at
> https://github.com/javadba/scalatesting.git  .  Simply run "sbt test". Note: 
> it is unclear how/where to include this ancillary testing/verification 
> information that will not be included in the git PR: i am open for any 
> suggestions - even as far as simply removing references to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3338) Respect user setting of spark.submit.pyFiles

2014-09-01 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117801#comment-14117801
 ] 

Andrew Or commented on SPARK-3338:
--

https://github.com/apache/spark/pull/2232

> Respect user setting of spark.submit.pyFiles
> 
>
> Key: SPARK-3338
> URL: https://issues.apache.org/jira/browse/SPARK-3338
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> We currently override any setting of spark.submit.pyFiles. Even though this 
> is not documented, we should still respect this if the user explicitly sets 
> this in his/her default properties file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3340) Deprecate ADD_JARS and ADD_FILES

2014-09-01 Thread Andrew Or (JIRA)
Andrew Or created SPARK-3340:


 Summary: Deprecate ADD_JARS and ADD_FILES
 Key: SPARK-3340
 URL: https://issues.apache.org/jira/browse/SPARK-3340
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Andrew Or


These were introduced before Spark submit even existed. Now that there are many 
better ways of setting jars and python files through Spark submit, we should 
deprecate these environment variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3319) Resolve spark.jars, spark.files, and spark.submit.pyFiles etc.

2014-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117775#comment-14117775
 ] 

Apache Spark commented on SPARK-3319:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/2232

> Resolve spark.jars, spark.files, and spark.submit.pyFiles etc.
> --
>
> Key: SPARK-3319
> URL: https://issues.apache.org/jira/browse/SPARK-3319
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> We already do this for --jars, --files, and --py-files etc. For consistency, 
> we should do the same for the corresponding spark configs as well in case the 
> user sets them in spark-defaults.conf instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3308) Ability to read JSON Arrays as tables

2014-09-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3308:

Assignee: Yin Huai

> Ability to read JSON Arrays as tables
> -
>
> Key: SPARK-3308
> URL: https://issues.apache.org/jira/browse/SPARK-3308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Yin Huai
>Priority: Critical
>
> Right now we can only read json where each object is on its own line.  It 
> would be nice to be able to read top level json arrays where each element 
> maps to a row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3339) Support for skipping json lines that fail to parse

2014-09-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3339:

Assignee: Yin Huai

> Support for skipping json lines that fail to parse
> --
>
> Key: SPARK-3339
> URL: https://issues.apache.org/jira/browse/SPARK-3339
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Yin Huai
>
> When dealing with large datasets there is alway some data that fails to 
> parse.  Would be nice to handle this instead of throwing an exception 
> requiring the user to filter it out manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3339) Support for skipping json lines that fail to parse

2014-09-01 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-3339:
---

 Summary: Support for skipping json lines that fail to parse
 Key: SPARK-3339
 URL: https://issues.apache.org/jira/browse/SPARK-3339
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust


When dealing with large datasets there is alway some data that fails to parse.  
Would be nice to handle this instead of throwing an exception requiring the 
user to filter it out manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3333) Large number of partitions causes OOM

2014-09-01 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117770#comment-14117770
 ] 

Patrick Wendell edited comment on SPARK- at 9/1/14 10:51 PM:
-

[~nchammas]. I think the default number of reducers could be the culprit. If 
you look at the actual job run in Spark 1.0.2, how many reducers are there in 
the shuffle? What happens, if you run the code in Spark 1.1.0 and specify that 
number of reducers that matches the number chose in 1.0.2... then is the 
performance the same?

If this is the case we should probably document this clearly in the release 
notes, since changing this default could have major implications for those 
relying on the default behavior.


was (Author: pwendell):
[~nchammas]. I think the default number of reducers could be the culprit. If 
you look at the actual job run in Spark 1.0.2, how many reducers are there in 
the shuffle? How many are there? What happens, if you run the code in Spark 
1.1.0 and specify that number of reducers that matches the number chose in 
1.0.2... then is the performance the same?

If this is the case we should probably document this clearly in the release 
notes, since changing this default could have major implications for those 
relying on the default behavior.

> Large number of partitions causes OOM
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
> Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
>Reporter: Nicholas Chammas
>
> Here’s a repro for PySpark:
> {code}
> a = sc.parallelize(["Nick", "John", "Bob"])
> a = a.repartition(24000)
> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
> {code}
> This code runs fine on 1.0.2. It returns the following result in just over a 
> minute:
> {code}
> [(4, 'NickJohn')]
> {code}
> However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
> runs for a very, very long time (at least > 45 min) and then fails with 
> {{java.lang.OutOfMemoryError: Java heap space}}.
> Here is a stack trace taken from a run on 1.1.0-rc2:
> {code}
> >>> a = sc.parallelize(["Nick", "John", "Bob"])
> >>> a = a.repartition(24000)
> >>> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
> 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
> heart beats: 175143ms exceeds 45000ms
> 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
> heart beats: 175359ms exceeds 45000ms
> 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
> heart beats: 173061ms exceeds 45000ms
> 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
> heart beats: 176816ms exceeds 45000ms
> 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
> heart beats: 182241ms exceeds 45000ms
> 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
> heart beats: 178406ms exceeds 45000ms
> 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
> thread-3
> java.lang.OutOfMemoryError: Java heap space
> at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
> at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
> at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
> at 
> org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskRes

[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM

2014-09-01 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117770#comment-14117770
 ] 

Patrick Wendell commented on SPARK-:


[~nchammas]. I think the default number of reducers could be the culprit. If 
you look at the actual job run in Spark 1.0.2, how many reducers are there in 
the shuffle? How many are there? What happens, if you run the code in Spark 
1.1.0 and specify that number of reducers that matches the number chose in 
1.0.2... then is the performance the same?

If this is the case we should probably document this clearly in the release 
notes, since changing this default could have major implications for those 
relying on the default behavior.

> Large number of partitions causes OOM
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
> Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
>Reporter: Nicholas Chammas
>
> Here’s a repro for PySpark:
> {code}
> a = sc.parallelize(["Nick", "John", "Bob"])
> a = a.repartition(24000)
> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
> {code}
> This code runs fine on 1.0.2. It returns the following result in just over a 
> minute:
> {code}
> [(4, 'NickJohn')]
> {code}
> However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
> runs for a very, very long time (at least > 45 min) and then fails with 
> {{java.lang.OutOfMemoryError: Java heap space}}.
> Here is a stack trace taken from a run on 1.1.0-rc2:
> {code}
> >>> a = sc.parallelize(["Nick", "John", "Bob"])
> >>> a = a.repartition(24000)
> >>> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
> 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
> heart beats: 175143ms exceeds 45000ms
> 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
> heart beats: 175359ms exceeds 45000ms
> 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
> heart beats: 173061ms exceeds 45000ms
> 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
> heart beats: 176816ms exceeds 45000ms
> 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
> heart beats: 182241ms exceeds 45000ms
> 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
> heart beats: 178406ms exceeds 45000ms
> 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
> thread-3
> java.lang.OutOfMemoryError: Java heap space
> at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
> at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
> at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
> at 
> org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java

[jira] [Created] (SPARK-3338) Respect user setting of spark.submit.pyFiles

2014-09-01 Thread Andrew Or (JIRA)
Andrew Or created SPARK-3338:


 Summary: Respect user setting of spark.submit.pyFiles
 Key: SPARK-3338
 URL: https://issues.apache.org/jira/browse/SPARK-3338
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Andrew Or


We currently override any setting of spark.submit.pyFiles. Even though this is 
not documented, we should still respect this if the user explicitly sets this 
in his/her default properties file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3338) Respect user setting of spark.submit.pyFiles

2014-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3338:
-
Component/s: (was: PySpark)

> Respect user setting of spark.submit.pyFiles
> 
>
> Key: SPARK-3338
> URL: https://issues.apache.org/jira/browse/SPARK-3338
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> We currently override any setting of spark.submit.pyFiles. Even though this 
> is not documented, we should still respect this if the user explicitly sets 
> this in his/her default properties file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2312) Spark Actors do not handle unknown messages in their receive methods

2014-09-01 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2312:
--
Assignee: Josh Rosen

> Spark Actors do not handle unknown messages in their receive methods
> 
>
> Key: SPARK-2312
> URL: https://issues.apache.org/jira/browse/SPARK-2312
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Kam Kasravi
>Assignee: Josh Rosen
>Priority: Minor
>  Labels: starter
> Fix For: 1.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Per akka documentation - an actor should provide a pattern match for all 
> messages including _ otherwise akka.actor.UnhandledMessage will be 
> propagated. 
> Noted actors:
> MapOutputTrackerMasterActor, ClientActor, Master, Worker...
> Should minimally do a 
> logWarning(s"Received unexpected actor system event: $_") so message info is 
> logged in correct actor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2312) Spark Actors do not handle unknown messages in their receive methods

2014-09-01 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2312:
--
Assignee: Isaias Barroso  (was: Josh Rosen)

> Spark Actors do not handle unknown messages in their receive methods
> 
>
> Key: SPARK-2312
> URL: https://issues.apache.org/jira/browse/SPARK-2312
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Kam Kasravi
>Assignee: Isaias Barroso
>Priority: Minor
>  Labels: starter
> Fix For: 1.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Per akka documentation - an actor should provide a pattern match for all 
> messages including _ otherwise akka.actor.UnhandledMessage will be 
> propagated. 
> Noted actors:
> MapOutputTrackerMasterActor, ClientActor, Master, Worker...
> Should minimally do a 
> logWarning(s"Received unexpected actor system event: $_") so message info is 
> logged in correct actor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results

2014-09-01 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117622#comment-14117622
 ] 

Matei Zaharia commented on SPARK-3098:
--

It's true that the ordering of values after a shuffle is nondeterministic, so 
that for example on failure you might get a different order of keys in a 
reduceByKey or distinct or operations like that. However, I think that's the 
way it should be (and we can document it). RDDs are deterministic when viewed 
as a multiset, but not when viewed as an ordered collection, unless you do 
sortByKey. Operations like zipWithIndex are meant to be more of a convenience 
to get unique IDs or act on something with a known ordering (such as a text 
file where you want to know the line numbers). But the freedom to control fetch 
ordering is quite important for performance, especially if you want to have a 
push-based shuffle in the future.

If we wanted to get the same result every time, we could design reduce tasks to 
tell the master the order they fetched stuff in after the first time they ran, 
but even then, notice that it might limit the kind of shuffle mechanisms we 
allow (e.g. it would be harder to make a push-based shuffle deterministic). I'd 
rather not make that guarantee now.

>  In some cases, operation zipWithIndex get a wrong results
> --
>
> Key: SPARK-3098
> URL: https://issues.apache.org/jira/browse/SPARK-3098
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.1
>Reporter: Guoqiang Li
>Priority: Critical
>
> The reproduce code:
> {code}
>  val c = sc.parallelize(1 to 7899).flatMap { i =>
>   (1 to 1).toSeq.map(p => i * 6000 + p)
> }.distinct().zipWithIndex() 
> c.join(c).filter(t => t._2._1 != t._2._2).take(3)
> {code}
>  => 
> {code}
>  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
> (36579712,(13,14)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2022) Spark 1.0.0 is failing if mesos.coarse set to true

2014-09-01 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117581#comment-14117581
 ] 

Timothy Chen commented on SPARK-2022:
-

This should be resolved now, [~pwend...@gmail.com] please help close this.

> Spark 1.0.0 is failing if mesos.coarse set to true
> --
>
> Key: SPARK-2022
> URL: https://issues.apache.org/jira/browse/SPARK-2022
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Marek Wiewiorka
>Assignee: Tim Chen
>Priority: Critical
>
> more stderr
> ---
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0603 16:07:53.721132 61192 exec.cpp:131] Version: 0.18.2
> I0603 16:07:53.725230 61200 exec.cpp:205] Executor registered on slave 
> 201405220917-134217738-5050-27119-0
> Exception in thread "main" java.lang.NumberFormatException: For input string: 
> "sparkseq003.cloudapp.net"
> at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> at java.lang.Integer.parseInt(Integer.java:492)
> at java.lang.Integer.parseInt(Integer.java:527)
> at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
> at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:135)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> more stdout
> ---
> Registered executor on sparkseq003.cloudapp.net
> Starting task 5
> Forked command at 61202
> sh -c '"/home/mesos/spark-1.0.0/bin/spark-class" 
> org.apache.spark.executor.CoarseGrainedExecutorBackend 
> -Dspark.mesos.coarse=true 
> akka.tcp://sp...@sparkseq001.cloudapp.net:40312/user/CoarseG
> rainedScheduler 201405220917-134217738-5050-27119-0 sparkseq003.cloudapp.net 
> 4'
> Command exited with status 1 (pid: 61202)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results

2014-09-01 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117558#comment-14117558
 ] 

Ye Xianjin commented on SPARK-3098:
---

hi, [~srowen] and [~gq], I think what [~matei] wants to say is that because the 
ordering of elements in distinct() is not guaranteed, the result of 
zipWithIndex is not deterministic. If you recompute the RDD with distinct 
transformation, you are not guaranteed to get the same result. That explains 
the behavior here.

But as [~srowen] said, It's surprised to see different results from the same 
RDD. [~matei], what do you think about this behavior?

>  In some cases, operation zipWithIndex get a wrong results
> --
>
> Key: SPARK-3098
> URL: https://issues.apache.org/jira/browse/SPARK-3098
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.1
>Reporter: Guoqiang Li
>Priority: Critical
>
> The reproduce code:
> {code}
>  val c = sc.parallelize(1 to 7899).flatMap { i =>
>   (1 to 1).toSeq.map(p => i * 6000 + p)
> }.distinct().zipWithIndex() 
> c.join(c).filter(t => t._2._1 != t._2._2).take(3)
> {code}
>  => 
> {code}
>  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
> (36579712,(13,14)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2594) Add CACHE TABLE AS SELECT ...

2014-09-01 Thread Ravindra Pesala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117464#comment-14117464
 ] 

Ravindra Pesala commented on SPARK-2594:


Thank you Michael. Following are the tasks which I am planning to do to support 
this feature.
1. Change the SqlParser to support ADD CACHE TABLE syntax.And also we can 
change the HiveQl to support syntax for hive.
2. Add new strategy 'AddCacheTable' to 'SparkStrategies'
3. In the new strategy 'AddCacheTable', register the tableName with the plan 
and cache the same with tableName.

Please review it and comment.

> Add CACHE TABLE  AS SELECT ...
> 
>
> Key: SPARK-2594
> URL: https://issues.apache.org/jira/browse/SPARK-2594
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-09-01 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117453#comment-14117453
 ] 

Guoqiang Li edited comment on SPARK-1405 at 9/1/14 2:25 PM:


Hi all
[PR 1983|https://github.com/apache/spark/pull/1983] is OK to review. 


was (Author: gq):
Hi all
[PR 1983|https://github.com/apache/spark/pull/1983] is ready to review. 

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>  Labels: features
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-09-01 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117453#comment-14117453
 ] 

Guoqiang Li commented on SPARK-1405:


Hi all
[PR 1983|https://github.com/apache/spark/pull/1983] is ready to review. 

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>  Labels: features
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3191) Add explanation of support building spark with maven in http proxy environment

2014-09-01 Thread zhengbing li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117400#comment-14117400
 ] 

zhengbing li commented on SPARK-3191:
-

Another way to solve this issuse is to set environment variable:
export MAVEN_OPTS="-Dmaven.wagon.http.ssl.insecure=true 
-Dmaven.wagon.http.ssl.allowall=true"

> Add explanation of support building spark with maven in http proxy environment
> --
>
> Key: SPARK-3191
> URL: https://issues.apache.org/jira/browse/SPARK-3191
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.0.2
> Environment: linux suse 11
> maven version:apache-maven-3.0.5
> spark version: 1.0.1
> proxy setting of maven is:
>   
>   lzb
>   true
>   http
>   user
>   password
>   proxy.company.com
>   8080
>   *.company.com
> 
>Reporter: zhengbing li
>Priority: Trivial
>  Labels: build, maven
> Fix For: 1.1.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When I use "mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean 
> package" in http proxy enviroment, I cannot finish this task.Error is as 
> follows:
> [INFO] Spark Project YARN Stable API . SUCCESS [34.217s]
> [INFO] Spark Project Assembly  FAILURE [43.133s]
> [INFO] Spark Project External Twitter  SKIPPED
> [INFO] Spark Project External Kafka .. SKIPPED
> [INFO] Spark Project External Flume .. SKIPPED
> [INFO] Spark Project External ZeroMQ . SKIPPED
> [INFO] Spark Project External MQTT ... SKIPPED
> [INFO] Spark Project Examples  SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 27:57.309s
> [INFO] Finished at: Sat Aug 23 09:43:21 CST 2014
> [INFO] Final Memory: 51M/1080M
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project 
> spark-assembly_2.10: Execution default of goal 
> org.apache.maven.plugins:maven-shade-plugin:2.2:shade failed: Plugin 
> org.apache.maven.plugins:maven-shade-plugin:2.2 or one of its dependencies 
> could not be resolved: Could not find artifact 
> com.google.code.findbugs:jsr305:jar:1.3.9 -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/PluginResolutionException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the 
> command
> [ERROR]   mvn  -rf :spark-assembly_2.10
> If you use this command, It is ok
> mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 
> -Dmaven.wagon.http.ssl.insecure=true -Dmaven.wagon.http.ssl.allowall=true 
> -DskipTests clean package
> The error is not very obvious, I spent a long time to solve this issues
> In order to facilitate other guys who use spark in http proxy environment, I 
> highly recommed add this to documents



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2096) Correctly parse dot notations for accessing an array of structs

2014-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117381#comment-14117381
 ] 

Apache Spark commented on SPARK-2096:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/2230

> Correctly parse dot notations for accessing an array of structs
> ---
>
> Key: SPARK-2096
> URL: https://issues.apache.org/jira/browse/SPARK-2096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Yin Huai
>Priority: Minor
>  Labels: starter
>
> For example, "arrayOfStruct" is an array of structs and every element of this 
> array has a field called "field1". "arrayOfStruct[0].field1" means to access 
> the value of "field1" for the first element of "arrayOfStruct", but the SQL 
> parser (in sql-core) treats "field1" as an alias. Also, 
> "arrayOfStruct.field1" means to access all values of "field1" in this array 
> of structs and the returns those values as an array. But, the SQL parser 
> cannot resolve it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3337) Paranoid quoting in shell to allow install dirs with spaces within.

2014-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117377#comment-14117377
 ] 

Apache Spark commented on SPARK-3337:
-

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/2229

> Paranoid quoting in shell to allow install dirs with spaces within.
> ---
>
> Key: SPARK-3337
> URL: https://issues.apache.org/jira/browse/SPARK-3337
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
> Fix For: 1.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3337) Paranoid quoting in shell to allow install dirs with spaces within.

2014-09-01 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-3337:
---
Component/s: Build

> Paranoid quoting in shell to allow install dirs with spaces within.
> ---
>
> Key: SPARK-3337
> URL: https://issues.apache.org/jira/browse/SPARK-3337
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
> Fix For: 1.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3337) Paranoid quoting in shell to allow install dirs with spaces within.

2014-09-01 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-3337:
---
Component/s: Spark Core

> Paranoid quoting in shell to allow install dirs with spaces within.
> ---
>
> Key: SPARK-3337
> URL: https://issues.apache.org/jira/browse/SPARK-3337
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
> Fix For: 1.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3337) Paranoid quoting in shell to allow install dirs with spaces within.

2014-09-01 Thread Prashant Sharma (JIRA)
Prashant Sharma created SPARK-3337:
--

 Summary: Paranoid quoting in shell to allow install dirs with 
spaces within.
 Key: SPARK-3337
 URL: https://issues.apache.org/jira/browse/SPARK-3337
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.2, 1.1.0
Reporter: Prashant Sharma
Assignee: Prashant Sharma
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3328) ./make-distribution.sh --with-tachyon build is broken

2014-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117361#comment-14117361
 ] 

Apache Spark commented on SPARK-3328:
-

User 'prudhvije' has created a pull request for this issue:
https://github.com/apache/spark/pull/2228

> ./make-distribution.sh --with-tachyon build is broken
> -
>
> Key: SPARK-3328
> URL: https://issues.apache.org/jira/browse/SPARK-3328
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
>Reporter: Elijah Epifanov
>
> cp: tachyon-0.5.0/target/tachyon-0.5.0-jar-with-dependencies.jar: No such 
> file or directory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3178) setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the worker memory limit to zero

2014-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117350#comment-14117350
 ] 

Apache Spark commented on SPARK-3178:
-

User 'bbejeck' has created a pull request for this issue:
https://github.com/apache/spark/pull/2227

> setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the 
> worker memory limit to zero
> 
>
> Key: SPARK-3178
> URL: https://issues.apache.org/jira/browse/SPARK-3178
> Project: Spark
>  Issue Type: Bug
> Environment: osx
>Reporter: Jon Haddad
>Assignee: Bill Bejeck
>  Labels: starter
>
> This should either default to m or just completely fail.  Starting a worker 
> with zero memory isn't very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3292) Shuffle Tasks run incessantly even though there's no inputs

2014-09-01 Thread guowei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117308#comment-14117308
 ] 

guowei commented on SPARK-3292:
---

i've created  PR 2192 to fix it . but i don't know how to link it

> Shuffle Tasks run incessantly even though there's no inputs
> ---
>
> Key: SPARK-3292
> URL: https://issues.apache.org/jira/browse/SPARK-3292
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.2
>Reporter: guowei
>
> such as repartition groupby join and cogroup
> for example. 
> if i want the shuffle outputs save as hadoop file ,even though  there is no 
> inputs , many emtpy file generate too.
> it's too expensive , 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3322) Log a ConnectionManager error when the application ends

2014-09-01 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117299#comment-14117299
 ] 

Prashant Sharma commented on SPARK-3322:


So I felt, SPARK-3171 is related. I am not sure I fully understand the 
intentions, probably giving a bit more information on what this issue is about 
will be helpful. However if you think, this is almost or same issue as the 
SPARK-3171. Feel free to close it.

> Log a ConnectionManager error when the application ends
> ---
>
> Key: SPARK-3322
> URL: https://issues.apache.org/jira/browse/SPARK-3322
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: wangfei
>
> Athough it does not influence the result, it always would log an error from 
> ConnectionManager.
> Sometimes only log "ConnectionManagerId(vm2,40992) not found" and sometimes 
> it also will log "CancelledKeyException"
> The log Info as fellow:
> 14/08/29 16:54:53 ERROR ConnectionManager: Corresponding SendingConnection to 
> ConnectionManagerId(vm2,40992) not found
> 14/08/29 16:54:53 INFO ConnectionManager: key already cancelled ? 
> sun.nio.ch.SelectionKeyImpl@457245f9
> java.nio.channels.CancelledKeyException
> at 
> org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
> at 
> org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3071) Increase default driver memory

2014-09-01 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117241#comment-14117241
 ] 

Prashant Sharma commented on SPARK-3071:


This makes sense, but looking at it from a different angle that when we set 
"too" low memory the user is forced to think this should be configurable 
somewhere. However if you increase the default, most things will work without 
OOMs + this is effected cluster wide. And they may never need to look at it. 
(This is like being too critical, but just another angle.)

> Increase default driver memory
> --
>
> Key: SPARK-3071
> URL: https://issues.apache.org/jira/browse/SPARK-3071
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Xiangrui Meng
>
> The current default is 512M, which is usually too small because user also 
> uses driver to do some computation. In local mode, executor memory setting is 
> ignored while only driver memory is used, which provides more incentive to 
> increase the default driver memory.
> I suggest
> 1. 2GB in local mode and warn users if executor memory is set a bigger value
> 2. same as worker memory on an EC2 standalone server



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3275) Socket receiver can not recover when the socket server restarted

2014-09-01 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117216#comment-14117216
 ] 

Prashant Sharma commented on SPARK-3275:


AFAIK, from my conversation with TD once. Spark Streaming has a listener 
interface which lets you listen to events of receiver dying. So you can then 
manually restart/start another receiver somewhere. 

> Socket receiver can not recover when the socket server restarted 
> -
>
> Key: SPARK-3275
> URL: https://issues.apache.org/jira/browse/SPARK-3275
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.2
>Reporter: Jack Hu
>  Labels: failover
>
> To reproduce this issue:
> 1. create a application with a socket dstream
> 2. start the socket server and start the application
> 3. restart the socket server
> 4. the socket dstream will fail to reconnect (it will close the connection 
> after a successful connect)
> The main issue should be the status in SocketReceiver and ReceiverSupervisor 
> is incorrect after the reconnect:
> In SocketReceiver ::receive() the while loop will never be entered after 
> reconnect since the isStopped will returns true:
>  val iterator = bytesToObjects(socket.getInputStream())
>   while(!isStopped && iterator.hasNext) {
> store(iterator.next)
>   }
>   logInfo("Stopped receiving")
>   restart("Retrying connecting to " + host + ":" + port)
> That is caused by the status flag "receiverState" in ReceiverSupervisor will 
> be set to Stopped when the connection losses, but it is reset after the call 
> of Receiver start method:
> def startReceiver(): Unit = synchronized {
> try {
>   logInfo("Starting receiver")
>   receiver.onStart()
>   logInfo("Called receiver onStart")
>   onReceiverStart()
>   receiverState = Started
> } catch {
>   case t: Throwable =>
> stop("Error starting receiver " + streamId, Some(t))
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3282) It should support multiple receivers at one socketInputDStream

2014-09-01 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117208#comment-14117208
 ] 

Prashant Sharma commented on SPARK-3282:


You can create as many receivers as you like, by creating multiple dstream and 
doing a union on them. There are examples illustrating this like: 
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/RawNetworkGrep.scala.
 Here you can specify how many dstreams to create and same number of receivers 
are created.

> It should support multiple receivers at one socketInputDStream 
> ---
>
> Key: SPARK-3282
> URL: https://issues.apache.org/jira/browse/SPARK-3282
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.2
>Reporter: shenhong
>
> At present, a socketInputDStream support at most one receiver, it will be 
> bottleneck when large inputStrem appear. 
> It should support multiple receivers at one socketInputDStream 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3335) [Spark SQL] In pyspark, cannot use broadcast variables in UDF

2014-09-01 Thread kay feng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kay feng updated SPARK-3335:

Description: 
Running pyspark on a spark cluster with standalone master, spark sql cannot use 
broadcast variables in UDF. But we can use broadcast variable in spark in scala.

For example,
bar={"a":"aa", "b":"bb", "c":"abc"}
foo=sc.broadcast(bar)
sqlContext.registerFunction("MYUDF", lambda x: foo.value[x] if x else '').
q= sqlContext.sql('SELECT MYUDF(c)  FROM foobar')
out = q.collect()

Got the following exception:
Py4JJavaError: An error occurred while calling o169.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
stage 51.0 failed 4 times, most recent failure: Lost task 4.3 in stage 51.0 
(TID 13040, ip-10-33-9-144.us-west-2.compute.internal): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/root/spark/python/pyspark/worker.py", line 75, in main
command = pickleSer._read_with_length(infile)
  File "/root/spark/python/pyspark/serializers.py", line 150, in 
_read_with_length
return self.loads(obj)
  File "/root/spark/python/pyspark/broadcast.py", line 41, in _from_id
raise Exception("Broadcast variable '%s' not loaded!" % bid)
Exception: (Exception("Broadcast variable '21' not loaded!",), , (21L,))

org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)

org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:154)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:87)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaFo

[jira] [Created] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF

2014-09-01 Thread kay feng (JIRA)
kay feng created SPARK-3336:
---

 Summary: [Spark SQL] In pyspark, cannot group by field on UDF
 Key: SPARK-3336
 URL: https://issues.apache.org/jira/browse/SPARK-3336
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.1.0
Reporter: kay feng


Running pyspark on a spark cluster with standalone master.
Cannot group by field on a UDF. But we can group by UDF in Scala.

For example:
q = sqlContext.sql('SELECT COUNT(*), MYUDF(foo)  FROM bar GROUP BY MYUDF(foo)')
out = q.collect()

I got this exception:
Py4JJavaError: An error occurred while calling o183.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
stage 56.0 failed 4 times, most recent failure: Lost task 26.3 in stage 56.0 
(TID 14038, ip-10-33-9-144.us-west-2.compute.internal): 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: pythonUDF#1278

org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)

org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43)

org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42)

org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)

org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
scala.collection.AbstractIterator.to(Iterator.scala:1157)

scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)

org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)

org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)

org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:42)

org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)

org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
scala.collection.immutable.List.foreach(List.scala:318)
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
scala.collection.AbstractTraversable.map(Traversable.scala:105)

org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(Projection.scala:52)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.(Aggregate.scala:176)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:172)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:115)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.

[jira] [Updated] (SPARK-3335) [Spark SQL] In pyspark, cannot use broadcast variables in UDF

2014-09-01 Thread kay feng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kay feng updated SPARK-3335:

Description: 
Running pyspark on a spark cluster with standalone master, it cannot use 
broadcast variables in UDF.

For example,
bar={"a":"aa", "b":"bb", "c":"abc"}
foo=sc.broadcast(bar)
sqlContext.registerFunction("MYUDF", lambda x: foo.value[x] if x else '').
q= sqlContext.sql('SELECT MYUDF(c)  FROM foobar')
out = q.collect()

Got the following exception:
Py4JJavaError: An error occurred while calling o169.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
stage 51.0 failed 4 times, most recent failure: Lost task 4.3 in stage 51.0 
(TID 13040, ip-10-33-9-144.us-west-2.compute.internal): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/root/spark/python/pyspark/worker.py", line 75, in main
command = pickleSer._read_with_length(infile)
  File "/root/spark/python/pyspark/serializers.py", line 150, in 
_read_with_length
return self.loads(obj)
  File "/root/spark/python/pyspark/broadcast.py", line 41, in _from_id
raise Exception("Broadcast variable '%s' not loaded!" % bid)
Exception: (Exception("Broadcast variable '21' not loaded!",), , (21L,))

org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)

org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:154)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:87)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at sca

[jira] [Created] (SPARK-3335) [Spark SQL] In pyspark, cannot use broadcast variables in UDF

2014-09-01 Thread kay feng (JIRA)
kay feng created SPARK-3335:
---

 Summary: [Spark SQL] In pyspark, cannot use broadcast variables in 
UDF 
 Key: SPARK-3335
 URL: https://issues.apache.org/jira/browse/SPARK-3335
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.1.0
Reporter: kay feng


Running pyspark on a spark cluster with standalone master, it cannot use 
broadcast variables in UDF.

For example,
bar={"a":"aa", "b":"bb", "c":"abc"}
foo=sc.broadcast(bar)
sqlContext.registerFunction("MYUDF", lambda x: foo.value[x] if x else '').

Got the following exception:
Py4JJavaError: An error occurred while calling o169.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
stage 51.0 failed 4 times, most recent failure: Lost task 4.3 in stage 51.0 
(TID 13040, ip-10-33-9-144.us-west-2.compute.internal): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/root/spark/python/pyspark/worker.py", line 75, in main
command = pickleSer._read_with_length(infile)
  File "/root/spark/python/pyspark/serializers.py", line 150, in 
_read_with_length
return self.loads(obj)
  File "/root/spark/python/pyspark/broadcast.py", line 41, in _from_id
raise Exception("Broadcast variable '%s' not loaded!" % bid)
Exception: (Exception("Broadcast variable '21' not loaded!",), , (21L,))

org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)

org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:154)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:87)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.sc

[jira] [Comment Edited] (SPARK-3292) Shuffle Tasks run incessantly even though there's no inputs

2014-09-01 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117156#comment-14117156
 ] 

Prashant Sharma edited comment on SPARK-3292 at 9/1/14 8:12 AM:


This issue reminds me of a pull request that was trying to not do anything if 
there is nothing to save for streaming jobs. I will try to recollect the PR 
number. But in the meantime, would be great if you could link it. 


was (Author: prashant_):
This issue reminds me of a pull request that was trying to not do anything if 
there is nothing to save. I will try to recollect the PR number. But in the 
meantime, would be great if you could link it. 

> Shuffle Tasks run incessantly even though there's no inputs
> ---
>
> Key: SPARK-3292
> URL: https://issues.apache.org/jira/browse/SPARK-3292
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.2
>Reporter: guowei
>
> such as repartition groupby join and cogroup
> for example. 
> if i want the shuffle outputs save as hadoop file ,even though  there is no 
> inputs , many emtpy file generate too.
> it's too expensive , 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3292) Shuffle Tasks run incessantly even though there's no inputs

2014-09-01 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117156#comment-14117156
 ] 

Prashant Sharma commented on SPARK-3292:


This issue reminds me of a pull request that was trying to not do anything if 
there is nothing to save. I will try to recollect the PR number. But in the 
meantime, would be great if you could link it. 

> Shuffle Tasks run incessantly even though there's no inputs
> ---
>
> Key: SPARK-3292
> URL: https://issues.apache.org/jira/browse/SPARK-3292
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.2
>Reporter: guowei
>
> such as repartition groupby join and cogroup
> for example. 
> if i want the shuffle outputs save as hadoop file ,even though  there is no 
> inputs , many emtpy file generate too.
> it's too expensive , 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3007) Add "Dynamic Partition" support to Spark Sql hive

2014-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117152#comment-14117152
 ] 

Apache Spark commented on SPARK-3007:
-

User 'baishuo' has created a pull request for this issue:
https://github.com/apache/spark/pull/2226

> Add "Dynamic Partition" support  to  Spark Sql hive
> ---
>
> Key: SPARK-3007
> URL: https://issues.apache.org/jira/browse/SPARK-3007
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: baishuo
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3306) Addition of external resource dependency in executors

2014-09-01 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-3306:
---
Target Version/s:   (was: 1.3.0)

> Addition of external resource dependency in executors
> -
>
> Key: SPARK-3306
> URL: https://issues.apache.org/jira/browse/SPARK-3306
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Yan
>
> Currently, Spark executors only support static and read-only external 
> resources of side files and jar files. With emerging disparate data sources, 
> there is a need to support more versatile external resources, such as 
> connections to data sources, to facilitate efficient data accesses to the 
> sources. For one, the JDBCRDD, with some modifications,  could benefit from 
> this feature by reusing established JDBC connections from the same Spark 
> context before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3306) Addition of external resource dependency in executors

2014-09-01 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117138#comment-14117138
 ] 

Prashant Sharma commented on SPARK-3306:


I am not 100% sure, why you are referring to Spark Executors. I was feeling 
there is no need to touch spark executors to support something like that. You 
can probably add a new spark conf option and by default all spark conf options 
are propogated to executors. 

I will let you close this issue, if you are convinced.

> Addition of external resource dependency in executors
> -
>
> Key: SPARK-3306
> URL: https://issues.apache.org/jira/browse/SPARK-3306
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Yan
>
> Currently, Spark executors only support static and read-only external 
> resources of side files and jar files. With emerging disparate data sources, 
> there is a need to support more versatile external resources, such as 
> connections to data sources, to facilitate efficient data accesses to the 
> sources. For one, the JDBCRDD, with some modifications,  could benefit from 
> this feature by reusing established JDBC connections from the same Spark 
> context before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3322) Log a ConnectionManager error when the application ends

2014-09-01 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117117#comment-14117117
 ] 

Prashant Sharma commented on SPARK-3322:


I remember seeing this discussion somewhere else as well, probably on a pull 
request. It seems the PR is not linked to this issue somehow. Would be great if 
you could do it. 

> Log a ConnectionManager error when the application ends
> ---
>
> Key: SPARK-3322
> URL: https://issues.apache.org/jira/browse/SPARK-3322
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: wangfei
>
> Athough it does not influence the result, it always would log an error from 
> ConnectionManager.
> Sometimes only log "ConnectionManagerId(vm2,40992) not found" and sometimes 
> it also will log "CancelledKeyException"
> The log Info as fellow:
> 14/08/29 16:54:53 ERROR ConnectionManager: Corresponding SendingConnection to 
> ConnectionManagerId(vm2,40992) not found
> 14/08/29 16:54:53 INFO ConnectionManager: key already cancelled ? 
> sun.nio.ch.SelectionKeyImpl@457245f9
> java.nio.channels.CancelledKeyException
> at 
> org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
> at 
> org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3328) ./make-distribution.sh --with-tachyon build is broken

2014-09-01 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-3328:
---
Summary: ./make-distribution.sh --with-tachyon build is broken  (was: 
--with-tachyon build is broken)

> ./make-distribution.sh --with-tachyon build is broken
> -
>
> Key: SPARK-3328
> URL: https://issues.apache.org/jira/browse/SPARK-3328
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
>Reporter: Elijah Epifanov
>
> cp: tachyon-0.5.0/target/tachyon-0.5.0-jar-with-dependencies.jar: No such 
> file or directory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org