[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM

2014-09-01 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117062#comment-14117062
 ] 

Josh Rosen commented on SPARK-:
---

[~shivaram] and I discussed this; we have a few ideas about what might be 
happening.

I tried running {{sc.parallelize(1 to 
10).repartition(24000).keyBy(x=x).reduceByKey(_+_).collect()}} in 1.0.2 and 
observed similarly slow speed to what I saw in the current 1.1.0 release 
candidate.  When I modified my job to use fewer reducers ({{reduceByKey(_+_, 
4)}}, then the job completed quickly.  You can see similar behavior in Python 
by explicitly specifying a smaller number of reducers.

I think the issue here is that the overhead of sending and processing task 
completions is proportional to O(numReducers).  Specifically, the uncompressed 
size of ShuffleMapTask results is roughly O(numReducers), and there's a 
O(numReducers) processing cost for task completions within DAGScheduler (since 
mapOutputLocations is O(numReducers)).

This normally isn't a problem, but it can impact performance for jobs with 
large numbers of extremely small map tasks (like this job, where nearly all of 
the map tasks are effectively no-ops).  For larger tasks, this cost should be 
masked by larger overheads (such as task processing time).

I'm not sure where the OOM is coming from, but the slow performance that you're 
observing here is probably due to the new default number of reducers 
(https://github.com/apache/spark/pull/1138 exposed this in Python by changing 
it's defaults to match Scala Spark).

As a result, I'm not sure that this is a regression from 1.0.2, since it 
behaves similarly for Scala jobs.  I think we already do some compression of 
the task results and there are probably other improvements that we can make to 
lower these overheads, but I think we should postpone that to 1.2.0.

 Large number of partitions causes OOM
 -

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
Reporter: Nicholas Chammas

 Here’s a repro for PySpark:
 {code}
 a = sc.parallelize([Nick, John, Bob])
 a = a.repartition(24000)
 a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 {code}
 This code runs fine on 1.0.2. It returns the following result in just over a 
 minute:
 {code}
 [(4, 'NickJohn')]
 {code}
 However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
 runs for a very, very long time (at least  45 min) and then fails with 
 {{java.lang.OutOfMemoryError: Java heap space}}.
 Here is a stack trace taken from a run on 1.1.0-rc2:
 {code}
  a = sc.parallelize([Nick, John, Bob])
  a = a.repartition(24000)
  a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
 heart beats: 175143ms exceeds 45000ms
 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
 heart beats: 175359ms exceeds 45000ms
 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
 heart beats: 173061ms exceeds 45000ms
 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
 heart beats: 176816ms exceeds 45000ms
 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
 heart beats: 182241ms exceeds 45000ms
 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
 heart beats: 178406ms exceeds 45000ms
 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
 thread-3
 java.lang.OutOfMemoryError: Java heap space
 at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
 at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 

[jira] [Updated] (SPARK-3330) Successive test runs with different profiles fail SparkSubmitSuite

2014-09-01 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-3330:
---
Assignee: Sean Owen

 Successive test runs with different profiles fail SparkSubmitSuite
 --

 Key: SPARK-3330
 URL: https://issues.apache.org/jira/browse/SPARK-3330
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Sean Owen
Assignee: Sean Owen

 Maven-based Jenkins builds have been failing for a while:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/480/HADOOP_PROFILE=hadoop-2.4,label=centos/console
 One common cause is that on the second and subsequent runs of mvn clean 
 test, at least two assembly JARs will exist in assembly/target. Because 
 assembly is not a submodule of parent, mvn clean is not invoked for 
 assembly. The presence of two assembly jars causes spark-submit to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3330) Successive test runs with different profiles fail SparkSubmitSuite

2014-09-01 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-3330:
---
Assignee: (was: Sean Owen)

 Successive test runs with different profiles fail SparkSubmitSuite
 --

 Key: SPARK-3330
 URL: https://issues.apache.org/jira/browse/SPARK-3330
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Sean Owen

 Maven-based Jenkins builds have been failing for a while:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/480/HADOOP_PROFILE=hadoop-2.4,label=centos/console
 One common cause is that on the second and subsequent runs of mvn clean 
 test, at least two assembly JARs will exist in assembly/target. Because 
 assembly is not a submodule of parent, mvn clean is not invoked for 
 assembly. The presence of two assembly jars causes spark-submit to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3328) ./make-distribution.sh --with-tachyon build is broken

2014-09-01 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-3328:
---
Summary: ./make-distribution.sh --with-tachyon build is broken  (was: 
--with-tachyon build is broken)

 ./make-distribution.sh --with-tachyon build is broken
 -

 Key: SPARK-3328
 URL: https://issues.apache.org/jira/browse/SPARK-3328
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Elijah Epifanov

 cp: tachyon-0.5.0/target/tachyon-0.5.0-jar-with-dependencies.jar: No such 
 file or directory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3322) Log a ConnectionManager error when the application ends

2014-09-01 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117117#comment-14117117
 ] 

Prashant Sharma commented on SPARK-3322:


I remember seeing this discussion somewhere else as well, probably on a pull 
request. It seems the PR is not linked to this issue somehow. Would be great if 
you could do it. 

 Log a ConnectionManager error when the application ends
 ---

 Key: SPARK-3322
 URL: https://issues.apache.org/jira/browse/SPARK-3322
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: wangfei

 Athough it does not influence the result, it always would log an error from 
 ConnectionManager.
 Sometimes only log ConnectionManagerId(vm2,40992) not found and sometimes 
 it also will log CancelledKeyException
 The log Info as fellow:
 14/08/29 16:54:53 ERROR ConnectionManager: Corresponding SendingConnection to 
 ConnectionManagerId(vm2,40992) not found
 14/08/29 16:54:53 INFO ConnectionManager: key already cancelled ? 
 sun.nio.ch.SelectionKeyImpl@457245f9
 java.nio.channels.CancelledKeyException
 at 
 org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
 at 
 org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3306) Addition of external resource dependency in executors

2014-09-01 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117138#comment-14117138
 ] 

Prashant Sharma commented on SPARK-3306:


I am not 100% sure, why you are referring to Spark Executors. I was feeling 
there is no need to touch spark executors to support something like that. You 
can probably add a new spark conf option and by default all spark conf options 
are propogated to executors. 

I will let you close this issue, if you are convinced.

 Addition of external resource dependency in executors
 -

 Key: SPARK-3306
 URL: https://issues.apache.org/jira/browse/SPARK-3306
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Yan

 Currently, Spark executors only support static and read-only external 
 resources of side files and jar files. With emerging disparate data sources, 
 there is a need to support more versatile external resources, such as 
 connections to data sources, to facilitate efficient data accesses to the 
 sources. For one, the JDBCRDD, with some modifications,  could benefit from 
 this feature by reusing established JDBC connections from the same Spark 
 context before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3306) Addition of external resource dependency in executors

2014-09-01 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-3306:
---
Target Version/s:   (was: 1.3.0)

 Addition of external resource dependency in executors
 -

 Key: SPARK-3306
 URL: https://issues.apache.org/jira/browse/SPARK-3306
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Yan

 Currently, Spark executors only support static and read-only external 
 resources of side files and jar files. With emerging disparate data sources, 
 there is a need to support more versatile external resources, such as 
 connections to data sources, to facilitate efficient data accesses to the 
 sources. For one, the JDBCRDD, with some modifications,  could benefit from 
 this feature by reusing established JDBC connections from the same Spark 
 context before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3292) Shuffle Tasks run incessantly even though there's no inputs

2014-09-01 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117156#comment-14117156
 ] 

Prashant Sharma commented on SPARK-3292:


This issue reminds me of a pull request that was trying to not do anything if 
there is nothing to save. I will try to recollect the PR number. But in the 
meantime, would be great if you could link it. 

 Shuffle Tasks run incessantly even though there's no inputs
 ---

 Key: SPARK-3292
 URL: https://issues.apache.org/jira/browse/SPARK-3292
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2
Reporter: guowei

 such as repartition groupby join and cogroup
 for example. 
 if i want the shuffle outputs save as hadoop file ,even though  there is no 
 inputs , many emtpy file generate too.
 it's too expensive , 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3292) Shuffle Tasks run incessantly even though there's no inputs

2014-09-01 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117156#comment-14117156
 ] 

Prashant Sharma edited comment on SPARK-3292 at 9/1/14 8:12 AM:


This issue reminds me of a pull request that was trying to not do anything if 
there is nothing to save for streaming jobs. I will try to recollect the PR 
number. But in the meantime, would be great if you could link it. 


was (Author: prashant_):
This issue reminds me of a pull request that was trying to not do anything if 
there is nothing to save. I will try to recollect the PR number. But in the 
meantime, would be great if you could link it. 

 Shuffle Tasks run incessantly even though there's no inputs
 ---

 Key: SPARK-3292
 URL: https://issues.apache.org/jira/browse/SPARK-3292
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2
Reporter: guowei

 such as repartition groupby join and cogroup
 for example. 
 if i want the shuffle outputs save as hadoop file ,even though  there is no 
 inputs , many emtpy file generate too.
 it's too expensive , 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3335) [Spark SQL] In pyspark, cannot use broadcast variables in UDF

2014-09-01 Thread kay feng (JIRA)
kay feng created SPARK-3335:
---

 Summary: [Spark SQL] In pyspark, cannot use broadcast variables in 
UDF 
 Key: SPARK-3335
 URL: https://issues.apache.org/jira/browse/SPARK-3335
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.1.0
Reporter: kay feng


Running pyspark on a spark cluster with standalone master, it cannot use 
broadcast variables in UDF.

For example,
bar={a:aa, b:bb, c:abc}
foo=sc.broadcast(bar)
sqlContext.registerFunction(MYUDF, lambda x: foo.value[x] if x else '').

Got the following exception:
Py4JJavaError: An error occurred while calling o169.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
stage 51.0 failed 4 times, most recent failure: Lost task 4.3 in stage 51.0 
(TID 13040, ip-10-33-9-144.us-west-2.compute.internal): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File /root/spark/python/pyspark/worker.py, line 75, in main
command = pickleSer._read_with_length(infile)
  File /root/spark/python/pyspark/serializers.py, line 150, in 
_read_with_length
return self.loads(obj)
  File /root/spark/python/pyspark/broadcast.py, line 41, in _from_id
raise Exception(Broadcast variable '%s' not loaded! % bid)
Exception: (Exception(Broadcast variable '21' not loaded!,), function 
_from_id at 0x35042a8, (21L,))

org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)

org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:154)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:87)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at 

[jira] [Updated] (SPARK-3335) [Spark SQL] In pyspark, cannot use broadcast variables in UDF

2014-09-01 Thread kay feng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kay feng updated SPARK-3335:

Description: 
Running pyspark on a spark cluster with standalone master, it cannot use 
broadcast variables in UDF.

For example,
bar={a:aa, b:bb, c:abc}
foo=sc.broadcast(bar)
sqlContext.registerFunction(MYUDF, lambda x: foo.value[x] if x else '').
q= sqlContext.sql('SELECT MYUDF(c)  FROM foobar')
out = q.collect()

Got the following exception:
Py4JJavaError: An error occurred while calling o169.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
stage 51.0 failed 4 times, most recent failure: Lost task 4.3 in stage 51.0 
(TID 13040, ip-10-33-9-144.us-west-2.compute.internal): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File /root/spark/python/pyspark/worker.py, line 75, in main
command = pickleSer._read_with_length(infile)
  File /root/spark/python/pyspark/serializers.py, line 150, in 
_read_with_length
return self.loads(obj)
  File /root/spark/python/pyspark/broadcast.py, line 41, in _from_id
raise Exception(Broadcast variable '%s' not loaded! % bid)
Exception: (Exception(Broadcast variable '21' not loaded!,), function 
_from_id at 0x35042a8, (21L,))

org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)

org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:154)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:87)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   

[jira] [Created] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF

2014-09-01 Thread kay feng (JIRA)
kay feng created SPARK-3336:
---

 Summary: [Spark SQL] In pyspark, cannot group by field on UDF
 Key: SPARK-3336
 URL: https://issues.apache.org/jira/browse/SPARK-3336
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.1.0
Reporter: kay feng


Running pyspark on a spark cluster with standalone master.
Cannot group by field on a UDF. But we can group by UDF in Scala.

For example:
q = sqlContext.sql('SELECT COUNT(*), MYUDF(foo)  FROM bar GROUP BY MYUDF(foo)')
out = q.collect()

I got this exception:
Py4JJavaError: An error occurred while calling o183.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
stage 56.0 failed 4 times, most recent failure: Lost task 26.3 in stage 56.0 
(TID 14038, ip-10-33-9-144.us-west-2.compute.internal): 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: pythonUDF#1278

org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)

org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43)

org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42)

org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)

org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
scala.collection.AbstractIterator.to(Iterator.scala:1157)

scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)

org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)

org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)

org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:42)

org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)

org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
scala.collection.immutable.List.foreach(List.scala:318)
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
scala.collection.AbstractTraversable.map(Traversable.scala:105)

org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.init(Projection.scala:52)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.init(Aggregate.scala:176)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:172)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:115)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 

[jira] [Updated] (SPARK-3335) [Spark SQL] In pyspark, cannot use broadcast variables in UDF

2014-09-01 Thread kay feng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kay feng updated SPARK-3335:

Description: 
Running pyspark on a spark cluster with standalone master, spark sql cannot use 
broadcast variables in UDF. But we can use broadcast variable in spark in scala.

For example,
bar={a:aa, b:bb, c:abc}
foo=sc.broadcast(bar)
sqlContext.registerFunction(MYUDF, lambda x: foo.value[x] if x else '').
q= sqlContext.sql('SELECT MYUDF(c)  FROM foobar')
out = q.collect()

Got the following exception:
Py4JJavaError: An error occurred while calling o169.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
stage 51.0 failed 4 times, most recent failure: Lost task 4.3 in stage 51.0 
(TID 13040, ip-10-33-9-144.us-west-2.compute.internal): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File /root/spark/python/pyspark/worker.py, line 75, in main
command = pickleSer._read_with_length(infile)
  File /root/spark/python/pyspark/serializers.py, line 150, in 
_read_with_length
return self.loads(obj)
  File /root/spark/python/pyspark/broadcast.py, line 41, in _from_id
raise Exception(Broadcast variable '%s' not loaded! % bid)
Exception: (Exception(Broadcast variable '21' not loaded!,), function 
_from_id at 0x35042a8, (21L,))

org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)

org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:154)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:87)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 

[jira] [Commented] (SPARK-3282) It should support multiple receivers at one socketInputDStream

2014-09-01 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117208#comment-14117208
 ] 

Prashant Sharma commented on SPARK-3282:


You can create as many receivers as you like, by creating multiple dstream and 
doing a union on them. There are examples illustrating this like: 
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/RawNetworkGrep.scala.
 Here you can specify how many dstreams to create and same number of receivers 
are created.

 It should support multiple receivers at one socketInputDStream 
 ---

 Key: SPARK-3282
 URL: https://issues.apache.org/jira/browse/SPARK-3282
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2
Reporter: shenhong

 At present, a socketInputDStream support at most one receiver, it will be 
 bottleneck when large inputStrem appear. 
 It should support multiple receivers at one socketInputDStream 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3275) Socket receiver can not recover when the socket server restarted

2014-09-01 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117216#comment-14117216
 ] 

Prashant Sharma commented on SPARK-3275:


AFAIK, from my conversation with TD once. Spark Streaming has a listener 
interface which lets you listen to events of receiver dying. So you can then 
manually restart/start another receiver somewhere. 

 Socket receiver can not recover when the socket server restarted 
 -

 Key: SPARK-3275
 URL: https://issues.apache.org/jira/browse/SPARK-3275
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.2
Reporter: Jack Hu
  Labels: failover

 To reproduce this issue:
 1. create a application with a socket dstream
 2. start the socket server and start the application
 3. restart the socket server
 4. the socket dstream will fail to reconnect (it will close the connection 
 after a successful connect)
 The main issue should be the status in SocketReceiver and ReceiverSupervisor 
 is incorrect after the reconnect:
 In SocketReceiver ::receive() the while loop will never be entered after 
 reconnect since the isStopped will returns true:
  val iterator = bytesToObjects(socket.getInputStream())
   while(!isStopped  iterator.hasNext) {
 store(iterator.next)
   }
   logInfo(Stopped receiving)
   restart(Retrying connecting to  + host + : + port)
 That is caused by the status flag receiverState in ReceiverSupervisor will 
 be set to Stopped when the connection losses, but it is reset after the call 
 of Receiver start method:
 def startReceiver(): Unit = synchronized {
 try {
   logInfo(Starting receiver)
   receiver.onStart()
   logInfo(Called receiver onStart)
   onReceiverStart()
   receiverState = Started
 } catch {
   case t: Throwable =
 stop(Error starting receiver  + streamId, Some(t))
 }
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3071) Increase default driver memory

2014-09-01 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117241#comment-14117241
 ] 

Prashant Sharma commented on SPARK-3071:


This makes sense, but looking at it from a different angle that when we set 
too low memory the user is forced to think this should be configurable 
somewhere. However if you increase the default, most things will work without 
OOMs + this is effected cluster wide. And they may never need to look at it. 
(This is like being too critical, but just another angle.)

 Increase default driver memory
 --

 Key: SPARK-3071
 URL: https://issues.apache.org/jira/browse/SPARK-3071
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Xiangrui Meng

 The current default is 512M, which is usually too small because user also 
 uses driver to do some computation. In local mode, executor memory setting is 
 ignored while only driver memory is used, which provides more incentive to 
 increase the default driver memory.
 I suggest
 1. 2GB in local mode and warn users if executor memory is set a bigger value
 2. same as worker memory on an EC2 standalone server



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3292) Shuffle Tasks run incessantly even though there's no inputs

2014-09-01 Thread guowei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117308#comment-14117308
 ] 

guowei commented on SPARK-3292:
---

i've created  PR 2192 to fix it . but i don't know how to link it

 Shuffle Tasks run incessantly even though there's no inputs
 ---

 Key: SPARK-3292
 URL: https://issues.apache.org/jira/browse/SPARK-3292
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2
Reporter: guowei

 such as repartition groupby join and cogroup
 for example. 
 if i want the shuffle outputs save as hadoop file ,even though  there is no 
 inputs , many emtpy file generate too.
 it's too expensive , 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3178) setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the worker memory limit to zero

2014-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117350#comment-14117350
 ] 

Apache Spark commented on SPARK-3178:
-

User 'bbejeck' has created a pull request for this issue:
https://github.com/apache/spark/pull/2227

 setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the 
 worker memory limit to zero
 

 Key: SPARK-3178
 URL: https://issues.apache.org/jira/browse/SPARK-3178
 Project: Spark
  Issue Type: Bug
 Environment: osx
Reporter: Jon Haddad
Assignee: Bill Bejeck
  Labels: starter

 This should either default to m or just completely fail.  Starting a worker 
 with zero memory isn't very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3328) ./make-distribution.sh --with-tachyon build is broken

2014-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117361#comment-14117361
 ] 

Apache Spark commented on SPARK-3328:
-

User 'prudhvije' has created a pull request for this issue:
https://github.com/apache/spark/pull/2228

 ./make-distribution.sh --with-tachyon build is broken
 -

 Key: SPARK-3328
 URL: https://issues.apache.org/jira/browse/SPARK-3328
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Elijah Epifanov

 cp: tachyon-0.5.0/target/tachyon-0.5.0-jar-with-dependencies.jar: No such 
 file or directory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3337) Paranoid quoting in shell to allow install dirs with spaces within.

2014-09-01 Thread Prashant Sharma (JIRA)
Prashant Sharma created SPARK-3337:
--

 Summary: Paranoid quoting in shell to allow install dirs with 
spaces within.
 Key: SPARK-3337
 URL: https://issues.apache.org/jira/browse/SPARK-3337
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.2, 1.1.0
Reporter: Prashant Sharma
Assignee: Prashant Sharma
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3337) Paranoid quoting in shell to allow install dirs with spaces within.

2014-09-01 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-3337:
---
Component/s: Spark Core

 Paranoid quoting in shell to allow install dirs with spaces within.
 ---

 Key: SPARK-3337
 URL: https://issues.apache.org/jira/browse/SPARK-3337
 Project: Spark
  Issue Type: Improvement
  Components: Build, Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Prashant Sharma
Assignee: Prashant Sharma
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3337) Paranoid quoting in shell to allow install dirs with spaces within.

2014-09-01 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-3337:
---
Component/s: Build

 Paranoid quoting in shell to allow install dirs with spaces within.
 ---

 Key: SPARK-3337
 URL: https://issues.apache.org/jira/browse/SPARK-3337
 Project: Spark
  Issue Type: Improvement
  Components: Build, Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Prashant Sharma
Assignee: Prashant Sharma
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2096) Correctly parse dot notations for accessing an array of structs

2014-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117381#comment-14117381
 ] 

Apache Spark commented on SPARK-2096:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/2230

 Correctly parse dot notations for accessing an array of structs
 ---

 Key: SPARK-2096
 URL: https://issues.apache.org/jira/browse/SPARK-2096
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Yin Huai
Priority: Minor
  Labels: starter

 For example, arrayOfStruct is an array of structs and every element of this 
 array has a field called field1. arrayOfStruct[0].field1 means to access 
 the value of field1 for the first element of arrayOfStruct, but the SQL 
 parser (in sql-core) treats field1 as an alias. Also, 
 arrayOfStruct.field1 means to access all values of field1 in this array 
 of structs and the returns those values as an array. But, the SQL parser 
 cannot resolve it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3191) Add explanation of support building spark with maven in http proxy environment

2014-09-01 Thread zhengbing li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117400#comment-14117400
 ] 

zhengbing li commented on SPARK-3191:
-

Another way to solve this issuse is to set environment variable:
export MAVEN_OPTS=-Dmaven.wagon.http.ssl.insecure=true 
-Dmaven.wagon.http.ssl.allowall=true

 Add explanation of support building spark with maven in http proxy environment
 --

 Key: SPARK-3191
 URL: https://issues.apache.org/jira/browse/SPARK-3191
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.0.2
 Environment: linux suse 11
 maven version:apache-maven-3.0.5
 spark version: 1.0.1
 proxy setting of maven is:
   proxy
   idlzb/id
   activetrue/active
   protocolhttp/protocol
   usernameuser/username
   passwordpassword/password
   hostproxy.company.com/host
   port8080/port
   nonProxyHosts*.company.com/nonProxyHosts
 /proxy
Reporter: zhengbing li
Priority: Trivial
  Labels: build, maven
 Fix For: 1.1.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 When I use mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean 
 package in http proxy enviroment, I cannot finish this task.Error is as 
 follows:
 [INFO] Spark Project YARN Stable API . SUCCESS [34.217s]
 [INFO] Spark Project Assembly  FAILURE [43.133s]
 [INFO] Spark Project External Twitter  SKIPPED
 [INFO] Spark Project External Kafka .. SKIPPED
 [INFO] Spark Project External Flume .. SKIPPED
 [INFO] Spark Project External ZeroMQ . SKIPPED
 [INFO] Spark Project External MQTT ... SKIPPED
 [INFO] Spark Project Examples  SKIPPED
 [INFO] 
 
 [INFO] BUILD FAILURE
 [INFO] 
 
 [INFO] Total time: 27:57.309s
 [INFO] Finished at: Sat Aug 23 09:43:21 CST 2014
 [INFO] Final Memory: 51M/1080M
 [INFO] 
 
 [ERROR] Failed to execute goal 
 org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project 
 spark-assembly_2.10: Execution default of goal 
 org.apache.maven.plugins:maven-shade-plugin:2.2:shade failed: Plugin 
 org.apache.maven.plugins:maven-shade-plugin:2.2 or one of its dependencies 
 could not be resolved: Could not find artifact 
 com.google.code.findbugs:jsr305:jar:1.3.9 - [Help 1]
 [ERROR]
 [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
 switch.
 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
 [ERROR]
 [ERROR] For more information about the errors and possible solutions, please 
 read the following articles:
 [ERROR] [Help 1] 
 http://cwiki.apache.org/confluence/display/MAVEN/PluginResolutionException
 [ERROR]
 [ERROR] After correcting the problems, you can resume the build with the 
 command
 [ERROR]   mvn goals -rf :spark-assembly_2.10
 If you use this command, It is ok
 mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 
 -Dmaven.wagon.http.ssl.insecure=true -Dmaven.wagon.http.ssl.allowall=true 
 -DskipTests clean package
 The error is not very obvious, I spent a long time to solve this issues
 In order to facilitate other guys who use spark in http proxy environment, I 
 highly recommed add this to documents



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-09-01 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117453#comment-14117453
 ] 

Guoqiang Li commented on SPARK-1405:


Hi all
[PR 1983|https://github.com/apache/spark/pull/1983] is ready to review. 

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Xusen Yin
  Labels: features
   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-09-01 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117453#comment-14117453
 ] 

Guoqiang Li edited comment on SPARK-1405 at 9/1/14 2:25 PM:


Hi all
[PR 1983|https://github.com/apache/spark/pull/1983] is OK to review. 


was (Author: gq):
Hi all
[PR 1983|https://github.com/apache/spark/pull/1983] is ready to review. 

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Xusen Yin
  Labels: features
   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2594) Add CACHE TABLE name AS SELECT ...

2014-09-01 Thread Ravindra Pesala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117464#comment-14117464
 ] 

Ravindra Pesala commented on SPARK-2594:


Thank you Michael. Following are the tasks which I am planning to do to support 
this feature.
1. Change the SqlParser to support ADD CACHE TABLE syntax.And also we can 
change the HiveQl to support syntax for hive.
2. Add new strategy 'AddCacheTable' to 'SparkStrategies'
3. In the new strategy 'AddCacheTable', register the tableName with the plan 
and cache the same with tableName.

Please review it and comment.

 Add CACHE TABLE name AS SELECT ...
 

 Key: SPARK-2594
 URL: https://issues.apache.org/jira/browse/SPARK-2594
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results

2014-09-01 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117558#comment-14117558
 ] 

Ye Xianjin commented on SPARK-3098:
---

hi, [~srowen] and [~gq], I think what [~matei] wants to say is that because the 
ordering of elements in distinct() is not guaranteed, the result of 
zipWithIndex is not deterministic. If you recompute the RDD with distinct 
transformation, you are not guaranteed to get the same result. That explains 
the behavior here.

But as [~srowen] said, It's surprised to see different results from the same 
RDD. [~matei], what do you think about this behavior?

  In some cases, operation zipWithIndex get a wrong results
 --

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 The reproduce code:
 {code}
  val c = sc.parallelize(1 to 7899).flatMap { i =
   (1 to 1).toSeq.map(p = i * 6000 + p)
 }.distinct().zipWithIndex() 
 c.join(c).filter(t = t._2._1 != t._2._2).take(3)
 {code}
  = 
 {code}
  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
 (36579712,(13,14)))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2022) Spark 1.0.0 is failing if mesos.coarse set to true

2014-09-01 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117581#comment-14117581
 ] 

Timothy Chen commented on SPARK-2022:
-

This should be resolved now, [~pwend...@gmail.com] please help close this.

 Spark 1.0.0 is failing if mesos.coarse set to true
 --

 Key: SPARK-2022
 URL: https://issues.apache.org/jira/browse/SPARK-2022
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Marek Wiewiorka
Assignee: Tim Chen
Priority: Critical

 more stderr
 ---
 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0603 16:07:53.721132 61192 exec.cpp:131] Version: 0.18.2
 I0603 16:07:53.725230 61200 exec.cpp:205] Executor registered on slave 
 201405220917-134217738-5050-27119-0
 Exception in thread main java.lang.NumberFormatException: For input string: 
 sparkseq003.cloudapp.net
 at 
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.lang.Integer.parseInt(Integer.java:492)
 at java.lang.Integer.parseInt(Integer.java:527)
 at 
 scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
 at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:135)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
 more stdout
 ---
 Registered executor on sparkseq003.cloudapp.net
 Starting task 5
 Forked command at 61202
 sh -c '/home/mesos/spark-1.0.0/bin/spark-class 
 org.apache.spark.executor.CoarseGrainedExecutorBackend 
 -Dspark.mesos.coarse=true 
 akka.tcp://sp...@sparkseq001.cloudapp.net:40312/user/CoarseG
 rainedScheduler 201405220917-134217738-5050-27119-0 sparkseq003.cloudapp.net 
 4'
 Command exited with status 1 (pid: 61202)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results

2014-09-01 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117622#comment-14117622
 ] 

Matei Zaharia commented on SPARK-3098:
--

It's true that the ordering of values after a shuffle is nondeterministic, so 
that for example on failure you might get a different order of keys in a 
reduceByKey or distinct or operations like that. However, I think that's the 
way it should be (and we can document it). RDDs are deterministic when viewed 
as a multiset, but not when viewed as an ordered collection, unless you do 
sortByKey. Operations like zipWithIndex are meant to be more of a convenience 
to get unique IDs or act on something with a known ordering (such as a text 
file where you want to know the line numbers). But the freedom to control fetch 
ordering is quite important for performance, especially if you want to have a 
push-based shuffle in the future.

If we wanted to get the same result every time, we could design reduce tasks to 
tell the master the order they fetched stuff in after the first time they ran, 
but even then, notice that it might limit the kind of shuffle mechanisms we 
allow (e.g. it would be harder to make a push-based shuffle deterministic). I'd 
rather not make that guarantee now.

  In some cases, operation zipWithIndex get a wrong results
 --

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 The reproduce code:
 {code}
  val c = sc.parallelize(1 to 7899).flatMap { i =
   (1 to 1).toSeq.map(p = i * 6000 + p)
 }.distinct().zipWithIndex() 
 c.join(c).filter(t = t._2._1 != t._2._2).take(3)
 {code}
  = 
 {code}
  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
 (36579712,(13,14)))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2312) Spark Actors do not handle unknown messages in their receive methods

2014-09-01 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2312:
--
Assignee: Josh Rosen

 Spark Actors do not handle unknown messages in their receive methods
 

 Key: SPARK-2312
 URL: https://issues.apache.org/jira/browse/SPARK-2312
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kam Kasravi
Assignee: Josh Rosen
Priority: Minor
  Labels: starter
 Fix For: 1.1.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 Per akka documentation - an actor should provide a pattern match for all 
 messages including _ otherwise akka.actor.UnhandledMessage will be 
 propagated. 
 Noted actors:
 MapOutputTrackerMasterActor, ClientActor, Master, Worker...
 Should minimally do a 
 logWarning(sReceived unexpected actor system event: $_) so message info is 
 logged in correct actor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2312) Spark Actors do not handle unknown messages in their receive methods

2014-09-01 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2312:
--
Assignee: Isaias Barroso  (was: Josh Rosen)

 Spark Actors do not handle unknown messages in their receive methods
 

 Key: SPARK-2312
 URL: https://issues.apache.org/jira/browse/SPARK-2312
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kam Kasravi
Assignee: Isaias Barroso
Priority: Minor
  Labels: starter
 Fix For: 1.1.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 Per akka documentation - an actor should provide a pattern match for all 
 messages including _ otherwise akka.actor.UnhandledMessage will be 
 propagated. 
 Noted actors:
 MapOutputTrackerMasterActor, ClientActor, Master, Worker...
 Should minimally do a 
 logWarning(sReceived unexpected actor system event: $_) so message info is 
 logged in correct actor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3338) Respect user setting of spark.submit.pyFiles

2014-09-01 Thread Andrew Or (JIRA)
Andrew Or created SPARK-3338:


 Summary: Respect user setting of spark.submit.pyFiles
 Key: SPARK-3338
 URL: https://issues.apache.org/jira/browse/SPARK-3338
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Andrew Or


We currently override any setting of spark.submit.pyFiles. Even though this is 
not documented, we should still respect this if the user explicitly sets this 
in his/her default properties file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3338) Respect user setting of spark.submit.pyFiles

2014-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3338:
-
Component/s: (was: PySpark)

 Respect user setting of spark.submit.pyFiles
 

 Key: SPARK-3338
 URL: https://issues.apache.org/jira/browse/SPARK-3338
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Andrew Or

 We currently override any setting of spark.submit.pyFiles. Even though this 
 is not documented, we should still respect this if the user explicitly sets 
 this in his/her default properties file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM

2014-09-01 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117770#comment-14117770
 ] 

Patrick Wendell commented on SPARK-:


[~nchammas]. I think the default number of reducers could be the culprit. If 
you look at the actual job run in Spark 1.0.2, how many reducers are there in 
the shuffle? How many are there? What happens, if you run the code in Spark 
1.1.0 and specify that number of reducers that matches the number chose in 
1.0.2... then is the performance the same?

If this is the case we should probably document this clearly in the release 
notes, since changing this default could have major implications for those 
relying on the default behavior.

 Large number of partitions causes OOM
 -

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
Reporter: Nicholas Chammas

 Here’s a repro for PySpark:
 {code}
 a = sc.parallelize([Nick, John, Bob])
 a = a.repartition(24000)
 a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 {code}
 This code runs fine on 1.0.2. It returns the following result in just over a 
 minute:
 {code}
 [(4, 'NickJohn')]
 {code}
 However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
 runs for a very, very long time (at least  45 min) and then fails with 
 {{java.lang.OutOfMemoryError: Java heap space}}.
 Here is a stack trace taken from a run on 1.1.0-rc2:
 {code}
  a = sc.parallelize([Nick, John, Bob])
  a = a.repartition(24000)
  a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
 heart beats: 175143ms exceeds 45000ms
 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
 heart beats: 175359ms exceeds 45000ms
 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
 heart beats: 173061ms exceeds 45000ms
 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
 heart beats: 176816ms exceeds 45000ms
 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
 heart beats: 182241ms exceeds 45000ms
 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
 heart beats: 178406ms exceeds 45000ms
 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
 thread-3
 java.lang.OutOfMemoryError: Java heap space
 at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
 at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 at 
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Exception in thread Result resolver thread-3 

[jira] [Comment Edited] (SPARK-3333) Large number of partitions causes OOM

2014-09-01 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117770#comment-14117770
 ] 

Patrick Wendell edited comment on SPARK- at 9/1/14 10:51 PM:
-

[~nchammas]. I think the default number of reducers could be the culprit. If 
you look at the actual job run in Spark 1.0.2, how many reducers are there in 
the shuffle? What happens, if you run the code in Spark 1.1.0 and specify that 
number of reducers that matches the number chose in 1.0.2... then is the 
performance the same?

If this is the case we should probably document this clearly in the release 
notes, since changing this default could have major implications for those 
relying on the default behavior.


was (Author: pwendell):
[~nchammas]. I think the default number of reducers could be the culprit. If 
you look at the actual job run in Spark 1.0.2, how many reducers are there in 
the shuffle? How many are there? What happens, if you run the code in Spark 
1.1.0 and specify that number of reducers that matches the number chose in 
1.0.2... then is the performance the same?

If this is the case we should probably document this clearly in the release 
notes, since changing this default could have major implications for those 
relying on the default behavior.

 Large number of partitions causes OOM
 -

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
Reporter: Nicholas Chammas

 Here’s a repro for PySpark:
 {code}
 a = sc.parallelize([Nick, John, Bob])
 a = a.repartition(24000)
 a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 {code}
 This code runs fine on 1.0.2. It returns the following result in just over a 
 minute:
 {code}
 [(4, 'NickJohn')]
 {code}
 However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
 runs for a very, very long time (at least  45 min) and then fails with 
 {{java.lang.OutOfMemoryError: Java heap space}}.
 Here is a stack trace taken from a run on 1.1.0-rc2:
 {code}
  a = sc.parallelize([Nick, John, Bob])
  a = a.repartition(24000)
  a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
 heart beats: 175143ms exceeds 45000ms
 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
 heart beats: 175359ms exceeds 45000ms
 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
 heart beats: 173061ms exceeds 45000ms
 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
 heart beats: 176816ms exceeds 45000ms
 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
 heart beats: 182241ms exceeds 45000ms
 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
 heart beats: 178406ms exceeds 45000ms
 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
 thread-3
 java.lang.OutOfMemoryError: Java heap space
 at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
 at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 at 
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 at 
 

[jira] [Created] (SPARK-3339) Support for skipping json lines that fail to parse

2014-09-01 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-3339:
---

 Summary: Support for skipping json lines that fail to parse
 Key: SPARK-3339
 URL: https://issues.apache.org/jira/browse/SPARK-3339
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust


When dealing with large datasets there is alway some data that fails to parse.  
Would be nice to handle this instead of throwing an exception requiring the 
user to filter it out manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3339) Support for skipping json lines that fail to parse

2014-09-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3339:

Assignee: Yin Huai

 Support for skipping json lines that fail to parse
 --

 Key: SPARK-3339
 URL: https://issues.apache.org/jira/browse/SPARK-3339
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Yin Huai

 When dealing with large datasets there is alway some data that fails to 
 parse.  Would be nice to handle this instead of throwing an exception 
 requiring the user to filter it out manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3308) Ability to read JSON Arrays as tables

2014-09-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3308:

Assignee: Yin Huai

 Ability to read JSON Arrays as tables
 -

 Key: SPARK-3308
 URL: https://issues.apache.org/jira/browse/SPARK-3308
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Yin Huai
Priority: Critical

 Right now we can only read json where each object is on its own line.  It 
 would be nice to be able to read top level json arrays where each element 
 maps to a row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3319) Resolve spark.jars, spark.files, and spark.submit.pyFiles etc.

2014-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117775#comment-14117775
 ] 

Apache Spark commented on SPARK-3319:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/2232

 Resolve spark.jars, spark.files, and spark.submit.pyFiles etc.
 --

 Key: SPARK-3319
 URL: https://issues.apache.org/jira/browse/SPARK-3319
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Andrew Or

 We already do this for --jars, --files, and --py-files etc. For consistency, 
 we should do the same for the corresponding spark configs as well in case the 
 user sets them in spark-defaults.conf instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3340) Deprecate ADD_JARS and ADD_FILES

2014-09-01 Thread Andrew Or (JIRA)
Andrew Or created SPARK-3340:


 Summary: Deprecate ADD_JARS and ADD_FILES
 Key: SPARK-3340
 URL: https://issues.apache.org/jira/browse/SPARK-3340
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Andrew Or


These were introduced before Spark submit even existed. Now that there are many 
better ways of setting jars and python files through Spark submit, we should 
deprecate these environment variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3338) Respect user setting of spark.submit.pyFiles

2014-09-01 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117801#comment-14117801
 ] 

Andrew Or commented on SPARK-3338:
--

https://github.com/apache/spark/pull/2232

 Respect user setting of spark.submit.pyFiles
 

 Key: SPARK-3338
 URL: https://issues.apache.org/jira/browse/SPARK-3338
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Andrew Or

 We currently override any setting of spark.submit.pyFiles. Even though this 
 is not documented, we should still respect this if the user explicitly sets 
 this in his/her default properties file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2638) Improve concurrency of fetching Map outputs

2014-09-01 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-2638:
-

Assignee: Josh Rosen

 Improve concurrency of fetching Map outputs
 ---

 Key: SPARK-2638
 URL: https://issues.apache.org/jira/browse/SPARK-2638
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: All
Reporter: Stephen Boesch
Assignee: Josh Rosen
Priority: Minor
  Labels: MapOutput, concurrency
 Fix For: 1.1.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 This issue was noticed while perusing the MapOutputTracker source code. 
 Notice that the synchronization is on the containing fetching collection - 
 which makes ALL fetches wait if any fetch were occurring.  
 The fix is to synchronize instead on the shuffleId (interned as a string to 
 ensure JVM wide visibility).
   def getServerStatuses(shuffleId: Int, reduceId: Int): 
 Array[(BlockManagerId, Long)] = {
 val statuses = mapStatuses.get(shuffleId).orNull
 if (statuses == null) {
   logInfo(Don't have map outputs for shuffle  + shuffleId + , fetching 
 them)
   var fetchedStatuses: Array[MapStatus] = null
   fetching.synchronized {   // This is existing code
  //  shuffleId.toString.intern.synchronized {  // New Code
 if (fetching.contains(shuffleId)) {
   // Someone else is fetching it; wait for them to be done
   while (fetching.contains(shuffleId)) {
 try {
   fetching.wait()
 } catch {
   case e: InterruptedException =
 }
   }
 This is only a small code change, but the testcases to prove (a) proper 
 functionality and (b) proper performance improvement are not so trivial.  
 For (b) it is not worthwhile to add a testcase to the codebase. Instead I 
 have added a git project that demonstrates the concurrency/performance 
 improvement using the fine-grained approach . The github project is at
 https://github.com/javadba/scalatesting.git  .  Simply run sbt test. Note: 
 it is unclear how/where to include this ancillary testing/verification 
 information that will not be included in the git PR: i am open for any 
 suggestions - even as far as simply removing references to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1174) Adding port configuration for HttpFileServer

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1174.

Resolution: Duplicate

 Adding port configuration for HttpFileServer
 

 Key: SPARK-1174
 URL: https://issues.apache.org/jira/browse/SPARK-1174
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Egor Pakhomov
Assignee: Egor Pahomov
Priority: Minor
 Fix For: 0.9.0


 I run spark in big organization, where to open port accessible to other 
 computers in network, I need to create a ticket on DevOps and it executes for 
 days. I can't have port for some spark service to be changed all the time. I 
 need ability to configure this port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2895) Support mapPartitionsWithContext in Spark Java API

2014-09-01 Thread Chengxiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated SPARK-2895:
-
Description: 
This is a requirement from Hive on Spark, mapPartitionsWithContext only exists 
in Spark Scala API, we expect to access from Spark Java API. 
For HIVE-7627, HIVE-7843, Hive operators which are invoked in mapPartitions 
closure need to get taskId.

  was:This is a requirement from Hive on Spark, mapPartitionsWithContext only 
exists in Spark Scala API, we expect to access from Spark Java API.


 Support mapPartitionsWithContext in Spark Java API
 --

 Key: SPARK-2895
 URL: https://issues.apache.org/jira/browse/SPARK-2895
 Project: Spark
  Issue Type: New Feature
  Components: Java API
Reporter: Chengxiang Li
Assignee: Chengxiang Li
  Labels: hive

 This is a requirement from Hive on Spark, mapPartitionsWithContext only 
 exists in Spark Scala API, we expect to access from Spark Java API. 
 For HIVE-7627, HIVE-7843, Hive operators which are invoked in mapPartitions 
 closure need to get taskId.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2078) Use ISO8601 date formats in logging

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2078.

Resolution: Won't Fix

Discussion on the JIRA suggested let's either keep the current one or just add 
milliseconds.

https://github.com/apache/spark/pull/1018

 Use ISO8601 date formats in logging
 ---

 Key: SPARK-2078
 URL: https://issues.apache.org/jira/browse/SPARK-2078
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Ash
Assignee: Andrew Ash

 Currently, logging has 2 digit years and doesn't include milliseconds in 
 logging timestamps.
 Use ISO8601 date formats instead of the current custom formats.
 There is some precedent here for ISO8601 format -- it's what [Hadoop 
 uses|https://github.com/apache/hadoop-common/blob/d92a8a29978e35ed36c4d4721a21c356c1ff1d4d/hadoop-common-project/hadoop-minikdc/src/main/resources/log4j.properties]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2078) Use ISO8601 date formats in logging

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2078:
---
Issue Type: Improvement  (was: Bug)

 Use ISO8601 date formats in logging
 ---

 Key: SPARK-2078
 URL: https://issues.apache.org/jira/browse/SPARK-2078
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Ash
Assignee: Andrew Ash

 Currently, logging has 2 digit years and doesn't include milliseconds in 
 logging timestamps.
 Use ISO8601 date formats instead of the current custom formats.
 There is some precedent here for ISO8601 format -- it's what [Hadoop 
 uses|https://github.com/apache/hadoop-common/blob/d92a8a29978e35ed36c4d4721a21c356c1ff1d4d/hadoop-common-project/hadoop-minikdc/src/main/resources/log4j.properties]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3179) Add task OutputMetrics

2014-09-01 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117825#comment-14117825
 ] 

Sandy Ryza commented on SPARK-3179:
---

Hi Michael,

Happy to help review your code or answer questions.

 Add task OutputMetrics
 --

 Key: SPARK-3179
 URL: https://issues.apache.org/jira/browse/SPARK-3179
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Sandy Ryza

 Track the bytes that tasks write to HDFS or other output destinations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2773) Shuffle:use growth rate to predict if need to spill

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2773.

Resolution: Invalid

 Shuffle:use growth rate to predict if need to spill
 ---

 Key: SPARK-2773
 URL: https://issues.apache.org/jira/browse/SPARK-2773
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 0.9.0, 1.0.0
Reporter: uncleGen
Priority: Minor

 Right now, Spark uses the total usage of shuffle memory of each thread to 
 predict if need to spill. I think it is not very reasonable. For example, 
 there are two threads pulling shuffle data. The total memory used to buffer 
 data is 21G. The first time to trigger spilling it when one thread has used 
 7G memory to buffer shuffle data, here I assume another one has used the 
 same size. Unfortunately, I still have remaining 7G to use. So, I think 
 current prediction mode is too conservative, and can not maximize the usage 
 of shuffle memory. In my solution, I use the growth rate of shuffle 
 memory. Again, the growth of each time is limited, maybe 10K * 1024(my 
 assumption), then the first time to trigger spilling is when the remaining 
 shuffle memory is less than threads * growth * 2, i.e. 2 * 10M * 2. I think 
 it can maximize the usage of shuffle memory. In my solution, there is also 
 a conservative assumption, i.e. all of threads is pulling shuffle data in one 
 executor. However it dose not have much effect, the grow is limited after 
 all. Any suggestion?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2773) Shuffle:use growth rate to predict if need to spill

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2773.

Resolution: Won't Fix

I don't think this is needed now that SPARK-2316 is fixed. This queue is not 
intended to overflow during normal operation. If you still observe issues in a 
version of Spark that contains SPARK-2316... please report it and we'll see 
what is going on.

 Shuffle:use growth rate to predict if need to spill
 ---

 Key: SPARK-2773
 URL: https://issues.apache.org/jira/browse/SPARK-2773
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 0.9.0, 1.0.0
Reporter: uncleGen
Priority: Minor

 Right now, Spark uses the total usage of shuffle memory of each thread to 
 predict if need to spill. I think it is not very reasonable. For example, 
 there are two threads pulling shuffle data. The total memory used to buffer 
 data is 21G. The first time to trigger spilling it when one thread has used 
 7G memory to buffer shuffle data, here I assume another one has used the 
 same size. Unfortunately, I still have remaining 7G to use. So, I think 
 current prediction mode is too conservative, and can not maximize the usage 
 of shuffle memory. In my solution, I use the growth rate of shuffle 
 memory. Again, the growth of each time is limited, maybe 10K * 1024(my 
 assumption), then the first time to trigger spilling is when the remaining 
 shuffle memory is less than threads * growth * 2, i.e. 2 * 10M * 2. I think 
 it can maximize the usage of shuffle memory. In my solution, there is also 
 a conservative assumption, i.e. all of threads is pulling shuffle data in one 
 executor. However it dose not have much effect, the grow is limited after 
 all. Any suggestion?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-2773) Shuffle:use growth rate to predict if need to spill

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-2773:


 Shuffle:use growth rate to predict if need to spill
 ---

 Key: SPARK-2773
 URL: https://issues.apache.org/jira/browse/SPARK-2773
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 0.9.0, 1.0.0
Reporter: uncleGen
Priority: Minor

 Right now, Spark uses the total usage of shuffle memory of each thread to 
 predict if need to spill. I think it is not very reasonable. For example, 
 there are two threads pulling shuffle data. The total memory used to buffer 
 data is 21G. The first time to trigger spilling it when one thread has used 
 7G memory to buffer shuffle data, here I assume another one has used the 
 same size. Unfortunately, I still have remaining 7G to use. So, I think 
 current prediction mode is too conservative, and can not maximize the usage 
 of shuffle memory. In my solution, I use the growth rate of shuffle 
 memory. Again, the growth of each time is limited, maybe 10K * 1024(my 
 assumption), then the first time to trigger spilling is when the remaining 
 shuffle memory is less than threads * growth * 2, i.e. 2 * 10M * 2. I think 
 it can maximize the usage of shuffle memory. In my solution, there is also 
 a conservative assumption, i.e. all of threads is pulling shuffle data in one 
 executor. However it dose not have much effect, the grow is limited after 
 all. Any suggestion?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3067) JobProgressPage could not show Fair Scheduler Pools section sometimes

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3067:
---
Target Version/s: 1.2.0

 JobProgressPage could not show Fair Scheduler Pools section sometimes
 -

 Key: SPARK-3067
 URL: https://issues.apache.org/jira/browse/SPARK-3067
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: YanTang Zhai
Priority: Minor

 JobProgressPage could not show Fair Scheduler Pools section sometimes.
 SparkContext starts webui and then postEnvironmentUpdate. Sometimes 
 JobProgressPage is accessed between webui starting and postEnvironmentUpdate, 
 then the lazy val isFairScheduler will be false. The Fair Scheduler Pools 
 section will not display any more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM

2014-09-01 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117853#comment-14117853
 ] 

Nicholas Chammas commented on SPARK-:
-

It looks like the default number of  reducers does indeed explain most of the 
performance difference here. But there is still a significant difference even 
after controlling this variable. 

I have 2 identical EC2 clusters as described in this JIRA issue, one on 1.0.2 
and one on 1.1.0-rc3. This time I ran the following PySpark code:

{code}
a = sc.parallelize([Nick, John, Bob])
a = a.repartition(24000)
a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y, 
sc.defaultParallelism).take(1)
{code}

Here are my results for 3 runs on each cluster:
||1.0.2||1.1.0-rc3||
| 95s | 343s |
| 89s | 336s |
| 95s | 334s |

So manually setting the number of reducers to a smaller number does help a lot, 
but there is still a 3-4x performance slowdown.

Can anyone else replicate this result?

 Large number of partitions causes OOM
 -

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
Reporter: Nicholas Chammas

 Here’s a repro for PySpark:
 {code}
 a = sc.parallelize([Nick, John, Bob])
 a = a.repartition(24000)
 a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 {code}
 This code runs fine on 1.0.2. It returns the following result in just over a 
 minute:
 {code}
 [(4, 'NickJohn')]
 {code}
 However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
 runs for a very, very long time (at least  45 min) and then fails with 
 {{java.lang.OutOfMemoryError: Java heap space}}.
 Here is a stack trace taken from a run on 1.1.0-rc2:
 {code}
  a = sc.parallelize([Nick, John, Bob])
  a = a.repartition(24000)
  a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
 heart beats: 175143ms exceeds 45000ms
 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
 heart beats: 175359ms exceeds 45000ms
 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
 heart beats: 173061ms exceeds 45000ms
 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
 heart beats: 176816ms exceeds 45000ms
 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
 heart beats: 182241ms exceeds 45000ms
 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
 heart beats: 178406ms exceeds 45000ms
 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
 thread-3
 java.lang.OutOfMemoryError: Java heap space
 at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
 at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 at 
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
 at 
 

[jira] [Comment Edited] (SPARK-3333) Large number of partitions causes OOM

2014-09-01 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117853#comment-14117853
 ] 

Nicholas Chammas edited comment on SPARK- at 9/2/14 3:13 AM:
-

It looks like the default number of  reducers does indeed explain most of the 
performance difference here. But there is still a significant difference even 
after controlling this variable. 

I have 2 identical EC2 clusters as described in this JIRA issue, one on 1.0.2 
and one on 1.1.0-rc3. This time I ran the following PySpark code:

{code}
a = sc.parallelize([Nick, John, Bob])
a = a.repartition(24000)
a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y, 
sc.defaultParallelism).take(1)
{code}

Here are the runtimes for 3 runs on each cluster:
||1.0.2||1.1.0-rc3||
| 95s | 343s |
| 89s | 336s |
| 95s | 334s |

So manually setting the number of reducers to a smaller number does help a lot, 
but there is still a 3-4x performance slowdown.

Can anyone else replicate this result?


was (Author: nchammas):
It looks like the default number of  reducers does indeed explain most of the 
performance difference here. But there is still a significant difference even 
after controlling this variable. 

I have 2 identical EC2 clusters as described in this JIRA issue, one on 1.0.2 
and one on 1.1.0-rc3. This time I ran the following PySpark code:

{code}
a = sc.parallelize([Nick, John, Bob])
a = a.repartition(24000)
a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y, 
sc.defaultParallelism).take(1)
{code}

Here are my results for 3 runs on each cluster:
||1.0.2||1.1.0-rc3||
| 95s | 343s |
| 89s | 336s |
| 95s | 334s |

So manually setting the number of reducers to a smaller number does help a lot, 
but there is still a 3-4x performance slowdown.

Can anyone else replicate this result?

 Large number of partitions causes OOM
 -

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
Reporter: Nicholas Chammas

 Here’s a repro for PySpark:
 {code}
 a = sc.parallelize([Nick, John, Bob])
 a = a.repartition(24000)
 a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 {code}
 This code runs fine on 1.0.2. It returns the following result in just over a 
 minute:
 {code}
 [(4, 'NickJohn')]
 {code}
 However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
 runs for a very, very long time (at least  45 min) and then fails with 
 {{java.lang.OutOfMemoryError: Java heap space}}.
 Here is a stack trace taken from a run on 1.1.0-rc2:
 {code}
  a = sc.parallelize([Nick, John, Bob])
  a = a.repartition(24000)
  a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
 heart beats: 175143ms exceeds 45000ms
 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
 heart beats: 175359ms exceeds 45000ms
 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
 heart beats: 173061ms exceeds 45000ms
 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
 heart beats: 176816ms exceeds 45000ms
 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
 heart beats: 182241ms exceeds 45000ms
 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
 heart beats: 178406ms exceeds 45000ms
 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
 thread-3
 java.lang.OutOfMemoryError: Java heap space
 at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
 at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 

[jira] [Commented] (SPARK-3168) The ServletContextHandler of webui lacks a SessionManager

2014-09-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117855#comment-14117855
 ] 

Reynold Xin commented on SPARK-3168:


[~tgraves] - can you take a look? I think you added the ui filter stuff. Thanks.

 The ServletContextHandler of webui lacks a SessionManager
 -

 Key: SPARK-3168
 URL: https://issues.apache.org/jira/browse/SPARK-3168
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
 Environment: CAS
Reporter: meiyoula

 When i use CAS to realize single sign of webui, it occurs a exception:
 {code}
 WARN  [qtp1076146544-24] / 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:561)
 java.lang.IllegalStateException: No SessionManager
 at org.eclipse.jetty.server.Request.getSession(Request.java:1269)
 at org.eclipse.jetty.server.Request.getSession(Request.java:1248)
 at 
 org.jasig.cas.client.validation.AbstractTicketValidationFilter.doFilter(AbstractTicketValidationFilter.java:178)
 at 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
 at 
 org.jasig.cas.client.authentication.AuthenticationFilter.doFilter(AuthenticationFilter.java:116)
 at 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
 at 
 org.jasig.cas.client.session.SingleSignOutFilter.doFilter(SingleSignOutFilter.java:76)
 at 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
 at 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499)
 at 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
 at 
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
 at 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
 at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:370)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
 at 
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at 
 org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
 at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
 at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
 at 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3135) Avoid memory copy in TorrentBroadcast serialization

2014-09-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3135.

   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Reynold Xin

 Avoid memory copy in TorrentBroadcast serialization
 ---

 Key: SPARK-3135
 URL: https://issues.apache.org/jira/browse/SPARK-3135
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Assignee: Reynold Xin
  Labels: starter
 Fix For: 1.2.0


 TorrentBroadcast.blockifyObject uses a ByteArrayOutputStream to serialize 
 broadcast object into a single giant byte array, and then separates it into 
 smaller chunks.  We should implement a new OutputStream that writes 
 serialized bytes directly into chunks of byte arrays so we don't need the 
 extra memory copy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3333) Large number of partitions causes OOM

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-:
---
Target Version/s: 1.1.0

 Large number of partitions causes OOM
 -

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
Reporter: Nicholas Chammas

 Here’s a repro for PySpark:
 {code}
 a = sc.parallelize([Nick, John, Bob])
 a = a.repartition(24000)
 a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 {code}
 This code runs fine on 1.0.2. It returns the following result in just over a 
 minute:
 {code}
 [(4, 'NickJohn')]
 {code}
 However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
 runs for a very, very long time (at least  45 min) and then fails with 
 {{java.lang.OutOfMemoryError: Java heap space}}.
 Here is a stack trace taken from a run on 1.1.0-rc2:
 {code}
  a = sc.parallelize([Nick, John, Bob])
  a = a.repartition(24000)
  a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
 heart beats: 175143ms exceeds 45000ms
 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
 heart beats: 175359ms exceeds 45000ms
 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
 heart beats: 173061ms exceeds 45000ms
 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
 heart beats: 176816ms exceeds 45000ms
 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
 heart beats: 182241ms exceeds 45000ms
 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
 heart beats: 178406ms exceeds 45000ms
 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
 thread-3
 java.lang.OutOfMemoryError: Java heap space
 at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
 at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 at 
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Exception in thread Result resolver thread-3 14/08/29 21:56:26 ERROR 
 SendingConnection: Exception while reading SendingConnection to 
 ConnectionManagerId(ip-10-73-142-223.ec2.internal,54014)
 java.nio.channels.ClosedChannelException
 at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
 at org.apache.spark.network.SendingConnection.read(Connection.scala:390)
 at 
 org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:199)
 at 
 

[jira] [Updated] (SPARK-3333) Large number of partitions causes OOM

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-:
---
Priority: Blocker  (was: Major)

 Large number of partitions causes OOM
 -

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
Reporter: Nicholas Chammas
Priority: Blocker

 Here’s a repro for PySpark:
 {code}
 a = sc.parallelize([Nick, John, Bob])
 a = a.repartition(24000)
 a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 {code}
 This code runs fine on 1.0.2. It returns the following result in just over a 
 minute:
 {code}
 [(4, 'NickJohn')]
 {code}
 However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
 runs for a very, very long time (at least  45 min) and then fails with 
 {{java.lang.OutOfMemoryError: Java heap space}}.
 Here is a stack trace taken from a run on 1.1.0-rc2:
 {code}
  a = sc.parallelize([Nick, John, Bob])
  a = a.repartition(24000)
  a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
 heart beats: 175143ms exceeds 45000ms
 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
 heart beats: 175359ms exceeds 45000ms
 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
 heart beats: 173061ms exceeds 45000ms
 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
 heart beats: 176816ms exceeds 45000ms
 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
 heart beats: 182241ms exceeds 45000ms
 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
 heart beats: 178406ms exceeds 45000ms
 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
 thread-3
 java.lang.OutOfMemoryError: Java heap space
 at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
 at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 at 
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Exception in thread Result resolver thread-3 14/08/29 21:56:26 ERROR 
 SendingConnection: Exception while reading SendingConnection to 
 ConnectionManagerId(ip-10-73-142-223.ec2.internal,54014)
 java.nio.channels.ClosedChannelException
 at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
 at org.apache.spark.network.SendingConnection.read(Connection.scala:390)
 at 
 org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:199)
 at 
 

[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM

2014-09-01 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117905#comment-14117905
 ] 

Josh Rosen commented on SPARK-:
---

Still investigating.  I tried this on my laptop by running the following script 
through spark-submit:

{code}
from pyspark import SparkContext

sc = SparkContext(appName=test)
a = sc.parallelize([Nick, John, Bob])
a = a.repartition(1)
parallelism = 4
a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y, parallelism).take(1)
{code}

With spark-1.0.2-bin-hadoop1, this ran in ~46 seconds, while branch-1.1 took 
~30 seconds.  Spark 1.0.2 seemed to experience one long pause that might have 
been due to GC, but I'll have to measure that.  Both ran to completion without 
crashing.  I'll see what happens if I bump up the number of partitions.

 Large number of partitions causes OOM
 -

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
Reporter: Nicholas Chammas
Priority: Blocker

 Here’s a repro for PySpark:
 {code}
 a = sc.parallelize([Nick, John, Bob])
 a = a.repartition(24000)
 a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 {code}
 This code runs fine on 1.0.2. It returns the following result in just over a 
 minute:
 {code}
 [(4, 'NickJohn')]
 {code}
 However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
 runs for a very, very long time (at least  45 min) and then fails with 
 {{java.lang.OutOfMemoryError: Java heap space}}.
 Here is a stack trace taken from a run on 1.1.0-rc2:
 {code}
  a = sc.parallelize([Nick, John, Bob])
  a = a.repartition(24000)
  a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
 heart beats: 175143ms exceeds 45000ms
 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
 heart beats: 175359ms exceeds 45000ms
 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
 heart beats: 173061ms exceeds 45000ms
 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
 heart beats: 176816ms exceeds 45000ms
 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
 heart beats: 182241ms exceeds 45000ms
 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
 heart beats: 178406ms exceeds 45000ms
 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
 thread-3
 java.lang.OutOfMemoryError: Java heap space
 at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
 at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 at 
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 

[jira] [Commented] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()

2014-09-01 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117909#comment-14117909
 ] 

Patrick Wendell commented on SPARK-3005:


I think it might be safe to just define the semantics of cancelTask to be best 
effort and have the default implementation be empty. From what I can tell the 
DAGScheduler is already resilient to a task finishing after the stage has been 
cleaned up (for other reasons I think it needs be resilient to this). So maybe 
we should just remove this entire ableToCancelStages logic in here and make it 
simpler. /cc [~kayousterhout] and [~markhamstra] who worked on this code.

Hey [~xuzhongxing] what do you mean that the tasks themselves already died and 
exited? The code here is designed to cancel outstanding tasks that are still 
running. For instance, I have a job that has 500 tasks running. Then there is a 
failure of one task multiple time so I need to fail the stage, but there are 
still many tasks running. I think those tasks need to be killed still. 
Otherwise you have zombie tasks running. I think in mesos fine-grained mode we 
just won't be able to support this feature... but we should make it so it 
doesn't hang.


 Spark with Mesos fine-grained mode throws UnsupportedOperationException in 
 MesosSchedulerBackend.killTask()
 ---

 Key: SPARK-3005
 URL: https://issues.apache.org/jira/browse/SPARK-3005
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
 Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector
Reporter: Xu Zhongxing
 Attachments: SPARK-3005_1.diff


 I am using Spark, Mesos, spark-cassandra-connector to do some work on a 
 cassandra cluster.
 During the job running, I killed the Cassandra daemon to simulate some 
 failure cases. This results in task failures.
 If I run the job in Mesos coarse-grained mode, the spark driver program 
 throws an exception and shutdown cleanly.
 But when I run the job in Mesos fine-grained mode, the spark driver program 
 hangs.
 The spark log is: 
 {code}
  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 
 Logging.scala (line 58) Cancelling stage 1
  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 
 Logging.scala (line 79) Could not cancel tasks for stage 1
 java.lang.UnsupportedOperationException
   at 
 org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
   at 
 org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at 
 

[jira] [Updated] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()

2014-09-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3005:
---
Target Version/s: 1.1.1, 1.2.0

 Spark with Mesos fine-grained mode throws UnsupportedOperationException in 
 MesosSchedulerBackend.killTask()
 ---

 Key: SPARK-3005
 URL: https://issues.apache.org/jira/browse/SPARK-3005
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
 Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector
Reporter: Xu Zhongxing
 Attachments: SPARK-3005_1.diff


 I am using Spark, Mesos, spark-cassandra-connector to do some work on a 
 cassandra cluster.
 During the job running, I killed the Cassandra daemon to simulate some 
 failure cases. This results in task failures.
 If I run the job in Mesos coarse-grained mode, the spark driver program 
 throws an exception and shutdown cleanly.
 But when I run the job in Mesos fine-grained mode, the spark driver program 
 hangs.
 The spark log is: 
 {code}
  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 
 Logging.scala (line 58) Cancelling stage 1
  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 
 Logging.scala (line 79) Could not cancel tasks for stage 1
 java.lang.UnsupportedOperationException
   at 
 org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
   at 
 org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 

[jira] [Updated] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles

2014-09-01 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-1239:
--
Summary: Don't fetch all map output statuses at each reducer during 
shuffles  (was: Don't fetch all map outputs at each reducer during shuffles)

 Don't fetch all map output statuses at each reducer during shuffles
 ---

 Key: SPARK-1239
 URL: https://issues.apache.org/jira/browse/SPARK-1239
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Andrew Or
 Fix For: 1.1.0


 Instead we should modify the way we fetch map output statuses to take both a 
 mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles

2014-09-01 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117914#comment-14117914
 ] 

Patrick Wendell commented on SPARK-1239:


I think [~andrewor] has this in his backlog but it's not actively being worked 
on. [~bcwalrus] do you or [~sandyr] want to take a crack?

 Don't fetch all map output statuses at each reducer during shuffles
 ---

 Key: SPARK-1239
 URL: https://issues.apache.org/jira/browse/SPARK-1239
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Andrew Or
 Fix For: 1.1.0


 Instead we should modify the way we fetch map output statuses to take both a 
 mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition

2014-09-01 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117917#comment-14117917
 ] 

Nicholas Chammas commented on SPARK-1701:
-

OK, so it sounds like action is:
* Replace all occurrences of slice with partition
* Leave occurrences of task as-is

[~darabos] - Feel free to run with this.

 Inconsistent naming: slice or partition
 ---

 Key: SPARK-1701
 URL: https://issues.apache.org/jira/browse/SPARK-1701
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Daniel Darabos
Priority: Minor
  Labels: starter

 Throughout the documentation and code slice and partition are used 
 interchangeably. (Or so it seems to me.) It would avoid some confusion for 
 new users to settle on one name. I think partition is winning, since that 
 is the name of the class representing the concept.
 This should not be much more complicated to do than a search  replace. I can 
 take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org