date:20141114

Xiangrui Meng created SPARK-4398:


 Summary: Specialize rdd.parallelize for xrange
 Key: SPARK-4398
 URL: https://issues.apache.org/jira/browse/SPARK-4398
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


sc.parallelize(range) is slow, which writes to disk. We should specialize 
xrange for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4398) Specialize rdd.parallelize for xrange


[ 
https://issues.apache.org/jira/browse/SPARK-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212014#comment-14212014
 ] 

Apache Spark commented on SPARK-4398:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/3264

 Specialize rdd.parallelize for xrange
 -

 Key: SPARK-4398
 URL: https://issues.apache.org/jira/browse/SPARK-4398
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 sc.parallelize(range) is slow, which writes to disk. We should specialize 
 xrange for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-689) Task will crash when setting SPARK_WORKER_CORES 128


[ 
https://issues.apache.org/jira/browse/SPARK-689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207721#comment-14207721
 ] 

Andrew Ash edited comment on SPARK-689 at 11/14/14 8:46 AM:


I attempted a repro on a one-node cluster (my laptop) and confirmed that this 
bug no longer exists on master.  A code inspection reveals that there is no 
thread limit of 128 limit anymore on the Executor's threadpool from this 
stacktrace line: {{at spark.executor.Executor.launchTask(Executor.scala:59)}}

Here's the outline of my repro attempt:

{noformat}
aash@aash-mbp ~/git/spark$ cat conf/spark-env.sh
SPARK_WORKER_CORES=200
SPARK_MASTER_IP=aash-mbp.local
SPARK_PUBLIC_DNS=aash-mbp.local
aash@aash-mbp ~/git/spark$ cat conf/spark-defaults.sh
spark.master spark://aash-mbp.local:7077
aash@aash-mbp ~/git/spark$ sbin/start-all.sh
...
aash@aash-mbp ~/git/spark$ bin/spark-shell
spark sc.parallelize(1l to 1l,200).reduce(_+_)
res0: Long = 50005000
spark
{noformat}

I'm now closing this ticket, but please reopen [~xiajunluan] if you're still 
having issues.


was (Author: aash):
I attempted a repro on a one-node cluster (my laptop) and confirmed that this 
bug no longer exists.  A code inspection reveals that there is no thread limit 
of 128 limit anymore on the Executor's threadpool from this stacktrace line: 
{{at spark.executor.Executor.launchTask(Executor.scala:59)}}

Here's the outline of my repro attempt:

{noformat}
aash@aash-mbp ~/git/spark$ cat conf/spark-env.sh
SPARK_WORKER_CORES=200
SPARK_MASTER_IP=aash-mbp.local
SPARK_PUBLIC_DNS=aash-mbp.local
aash@aash-mbp ~/git/spark$ cat conf/spark-defaults.sh
spark.master spark://aash-mbp.local:7077
aash@aash-mbp ~/git/spark$ sbin/start-all.sh
...
aash@aash-mbp ~/git/spark$ bin/spark-shell
spark sc.parallelize(1l to 1l,200).reduce(_+_)
res0: Long = 50005000
spark
{noformat}

I'm now closing this ticket, but please reopen [~xiajunluan] if you're still 
having issues.

 Task will crash when setting SPARK_WORKER_CORES 128
 

 Key: SPARK-689
 URL: https://issues.apache.org/jira/browse/SPARK-689
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.6.1
Reporter: xiajunluan

 when I set SPARK_WORKER_CORES  128(for example 200), and run a job in 
 standalone mode that will allocate 200 tasks in one worker node, then task 
 will crash(it seems that worker cores has been hard-code)
 {noformat}
 13/02/07 11:25:02 ERROR StandaloneExecutorBackend: Task 
 spark.executor.Executor$TaskRunner@5367839e rejected from 
 java.util.concurrent.ThreadPoolExecutor@30f224d9[Running, pool size = 128, 
 active threads = 128, queued tasks = 0, completed tasks = 0]
 java.util.concurrent.RejectedExecutionException: Task 
 spark.executor.Executor$TaskRunner@5367839e rejected from 
 java.util.concurrent.ThreadPoolExecutor@30f224d9[Running, pool size = 128, 
 active threads = 128, queued tasks = 0, completed tasks = 0]
   at 
 java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2013)
   at 
 java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:816)
   at 
 java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1337)
   at spark.executor.Executor.launchTask(Executor.scala:59)
   at 
 spark.executor.StandaloneExecutorBackend$$anonfun$receive$1.apply(StandaloneExecutorBackend.scala:57)
   at 
 spark.executor.StandaloneExecutorBackend$$anonfun$receive$1.apply(StandaloneExecutorBackend.scala:46)
   at akka.actor.Actor$class.apply(Actor.scala:318)
   at 
 spark.executor.StandaloneExecutorBackend.apply(StandaloneExecutorBackend.scala:17)
   at akka.actor.ActorCell.invoke(ActorCell.scala:626)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197)
   at akka.dispatch.Mailbox.run(Mailbox.scala:179)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516)
   at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
   at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
   at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
   at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
 13/02/07 11:25:02 INFO StandaloneExecutorBackend: Connecting to master: 
 akka://spark@10.0.2.19:60882/user/StandaloneScheduler
 13/02/07 11:25:02 INFO StandaloneExecutorBackend: Got assigned task 1929
 13/02/07 11:25:02 INFO Executor: launch taskId: 1929
 13/02/07 11:25:02 ERROR StandaloneExecutorBackend: 
 java.lang.NullPointerException
   at spark.executor.Executor.launchTask(Executor.scala:59)
   at

[jira] [Commented] (SPARK-4395) Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour

2014-11-14 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212019#comment-14212019
 ] 

Davies Liu commented on SPARK-4395:
---

[~marmbrus] After removing the cache(), this script finished quickly, so I 
think there is something wrong the caching.

It also hanged when I move the .cache() before parsing. While hanging, most of 
the CPU is spent in JVM, the python process (only one) is idle. 

 Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour
 --

 Key: SPARK-4395
 URL: https://issues.apache.org/jira/browse/SPARK-4395
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: version 1.2.0-SNAPSHOT
Reporter: Sameer Farooqui

 When I run this command it hangs for one to many hours and then finally 
 returns with successful results:
  sqlContext.sql(SELECT * FROM RatingsTable limit 5).collect()
 Note, the lab environment below is still active, so let me know if you'd like 
 to just access it directly.
 +++ My Environment +++
 - 1-node cluster in Amazon
 - RedHat 6.5 64-bit
 - java version 1.7.0_67
 - SBT version: sbt-0.13.5
 - Scala version: scala-2.11.2
 Ran: 
 sudo yum -y update
 git clone https://github.com/apache/spark
 sudo sbt assembly
 +++ Data file used +++
 http://blueplastic.com/databricks/movielens/ratings.dat
 +++ Code ran +++
  import re
  import string
  from pyspark.sql import SQLContext, Row
  sqlContext = SQLContext(sc)
  RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)'
 
  def parse_ratings_line(line):
 ... match = re.search(RATINGS_PATTERN, line)
 ... if match is None:
 ... # Optionally, you can change this to just ignore if each line of 
 data is not critical.
 ... raise Error(Invalid logline: %s % logline)
 ... return Row(
 ... UserID= int(match.group(1)),
 ... MovieID   = int(match.group(2)),
 ... Rating= int(match.group(3)),
 ... Timestamp = int(match.group(4)))
 ...
  ratings_base_RDD = 
  (sc.textFile(file:///home/ec2-user/movielens/ratings.dat)
 ...# Call the parse_apace_log_line function on each line.
 ....map(parse_ratings_line)
 ...# Caches the objects in memory since they will be queried 
 multiple times.
 ....cache())
  ratings_base_RDD.count()
 1000209
  ratings_base_RDD.first()
 Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1)
  schemaRatings = sqlContext.inferSchema(ratings_base_RDD)
  schemaRatings.registerTempTable(RatingsTable)
  sqlContext.sql(SELECT * FROM RatingsTable limit 5).collect()
 (Now the Python shell hangs...)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-755) Kryo serialization failing - MLbase


[ 
https://issues.apache.org/jira/browse/SPARK-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212033#comment-14212033
 ] 

Andrew Ash commented on SPARK-755:
--

Quick check [~sparks], this hasn't been updated in quite some time; is it still 
a problem for you?

I think you may have been bit by SPARK-2878 which also deals with Kryo 
serialization failing and has since been fixed.

 Kryo serialization failing - MLbase
 ---

 Key: SPARK-755
 URL: https://issues.apache.org/jira/browse/SPARK-755
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 0.8.0
Reporter: Evan Sparks

 When I turn on Kryo serialization, I get the following error as I increase 
 the size of my input dataset. (From ~10GB to ~100GB). This issue does not 
 manifest itself when I turn kryo off.
 I have code that successfully reads files, parses them into an 
 {noformat}RDD[(String,Vector)]{noformat}, which can then be .count()'ed. I 
 then run a .flatMap on these, with a function that has the following 
 signature:
 {code}
 def expandData(x: (String, Vector)): Seq[(String, Float, Vector)]
 {code}
 And running a .count() on that RDD crashes - stack trace of failed task looks 
 like this:
 {noformat}
 13/05/31 00:16:53 INFO cluster.TaskSetManager: Finished TID 2024 in 23594 ms 
 (progress: 10/1000)
 13/05/31 00:16:53 INFO scheduler.DAGScheduler: Completed ResultTask(3, 24)
 13/05/31 00:16:53 INFO cluster.ClusterScheduler: 
 parentName:,name:TaskSet_3,runningTasks:151
 13/05/31 00:16:53 INFO cluster.TaskSetManager: Starting task 3.0:175 as TID 
 2161 on slave 14: ip-10-62-199-77.ec2.internal:40850 (NODE_LOCAL)
 13/05/31 00:16:53 INFO cluster.TaskSetManager: Serialized task 3.0:175 as 
 2832 bytes in 0 ms
 13/05/31 00:16:53 INFO cluster.TaskSetManager: Lost TID 2053 (task 3.0:49)
 13/05/31 00:16:53 INFO cluster.TaskSetManager: Loss was due to 
 com.esotericsoftware.kryo.KryoException
 com.esotericsoftware.kryo.KryoException: 
 java.lang.ArrayIndexOutOfBoundsException
 Serialization trace:
 elements (org.mlbase.Vector)
 _3 (scala.Tuple3)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:585)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
 at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:504)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
 at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:571)
 at spark.KryoSerializationStream.writeObject(KryoSerializer.scala:26)
 at 
 spark.serializer.SerializationStream$class.writeAll(Serializer.scala:63)
 at spark.KryoSerializationStream.writeAll(KryoSerializer.scala:21)
 at spark.storage.BlockManager.dataSerialize(BlockManager.scala:910)
 at spark.storage.MemoryStore.putValues(MemoryStore.scala:61)
 at spark.storage.BlockManager.liftedTree1$1(BlockManager.scala:584)
 at spark.storage.BlockManager.put(BlockManager.scala:580)
 at spark.CacheManager.getOrCompute(CacheManager.scala:55)
 at spark.RDD.iterator(RDD.scala:207)
 at spark.scheduler.ResultTask.run(ResultTask.scala:84)
 at spark.executor.Executor$TaskRunner.run(Executor.scala:104)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:679)
 13/05/31 00:16:53 INFO cluster.ClusterScheduler: 
 parentName:,name:TaskSet_3,runningTasks:151
 13/05/31 00:16:53 INFO cluster.TaskSetManager: Starting task 3.0:49 as TID 
 2162 on slave 12: ip-10-11-46-255.ec2.internal:38878 (NODE_LOCAL)
 13/05/31 00:16:53 INFO cluster.TaskSetManager: Serialized task 3.0:49 as 2832 
 bytes in 0 ms
 13/05/31 00:16:53 INFO cluster.ClusterScheduler: 
 parentName:,name:TaskSet_3,runningTasks:152
 13/05/31 00:16:54 INFO storage.BlockManagerMasterActor$BlockManagerInfo: 
 Added rdd_7_257 in mem
 {noformat}
 My Kryo Registrator looks like this:
 {code}
 class MyRegistrator extends KryoRegistrator {
   override def registerClasses(kryo: Kryo) {
 kryo.register(classOf[Vector])
 kryo.register(classOf[String])
 kryo.register(classOf[Float])
 kryo.register(classOf[Tuple3[String,Float,Vector]])
 kryo.register(classOf[Seq[Tuple3[String,Float,Vector]]])
 kryo.register(classOf[Map[String,Vector]])
   }
 }
 {code}
 Vector in this case is an org.mlbase.Vector, which in this case is a 
 slightly modified version of spark.util.Vector (uses floats instead of 
 Doubles).



--
This message

[jira] [Commented] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark


[ 
https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212043#comment-14212043
 ] 

Apache Spark commented on SPARK-2352:
-

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/3222

 [MLLIB] Add Artificial Neural Network (ANN) to Spark
 

 Key: SPARK-2352
 URL: https://issues.apache.org/jira/browse/SPARK-2352
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
 Environment: MLLIB code
Reporter: Bert Greevenbosch
Assignee: Bert Greevenbosch

 It would be good if the Machine Learning Library contained Artificial Neural 
 Networks (ANNs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-664) Accumulator updates should get locally merged before sent to the driver


[ 
https://issues.apache.org/jira/browse/SPARK-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212046#comment-14212046
 ] 

Andrew Ash commented on SPARK-664:
--

[~irashid] it sounds like your proposal is to batch accumulator updates between 
tasks on the executor before sending them back to the driver?

I agree this would reduce the amount of network traffic, but the batching would 
come at a cost of higher latency between task completion and accumulator update 
landing in the accumulator in the driver.  With the completion of SPARK-2380 
these accumulators are now shown in the UI, so increasing latency would have an 
effect on end users.

If network bandwidth and UI update latency are fundamentally at odds, maybe 
this is a case for a user option to choose to optimize for network or UI, 
something like {{spark.accumulators.mergeUpdatesOnExecutor}} defaulted to false.

cc [~pwendell] for thoughts

 Accumulator updates should get locally merged before sent to the driver
 ---

 Key: SPARK-664
 URL: https://issues.apache.org/jira/browse/SPARK-664
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid
Priority: Minor

 Whenever a task finishes, the accumulator updates from that task are 
 immediately sent back to the driver.  When the accumulator updates are big, 
 this is inefficient because (a) a lot more data has to be sent to the driver 
 and (b) the driver has to do all the work of merging the updates together.
 Probably doesn't matter for small accumulators / low number of tasks, but if 
 both are big, this could be a big bottleneck.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-748) Add documentation page describing interoperability with other software (e.g. HBase, JDBC, Kafka, etc.)


[ 
https://issues.apache.org/jira/browse/SPARK-748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212049#comment-14212049
 ] 

Andrew Ash commented on SPARK-748:
--

I agree this would be valuable -- almost like a Spark Cookbook of how to read 
and write data from various other systems.  Step one is probably deciding what 
software to mention.

Tentatively I propose:

Spark Core
- HDFS
- HBase
- Cassandra
- Elasticsearch
- JDBC, with examples for Postgres and MySQL
- General Hadoop InputFormat

Spark Streaming
- Kafka
- Flume
- Storm

For destination, this could go on the documentation included in the git repo 
and published to the Spark website, or on the Spark project wiki.  I tend to 
prefer the former.  A possible location for that could be 
http://spark.apache.org/docs/latest/programming-guide.html#external-datasets

 Add documentation page describing interoperability with other software (e.g. 
 HBase, JDBC, Kafka, etc.)
 --

 Key: SPARK-748
 URL: https://issues.apache.org/jira/browse/SPARK-748
 Project: Spark
  Issue Type: New Feature
  Components: Documentation
Reporter: Josh Rosen

 Spark seems to be gaining a lot of data input / output features for 
 integrating with systems like HBase, Kafka, JDBC, Hadoop, etc.
 It might be a good idea to create a single documentation page that provides a 
 list of all of the data sources that Spark supports and links to the relevant 
 documentation / examples / {{spark-users}} threads.  This would help 
 prospective users to evaluate how easy it will be to integrate Spark with 
 their existing systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-625) Client hangs when connecting to standalone cluster using wrong address


[ 
https://issues.apache.org/jira/browse/SPARK-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212062#comment-14212062
 ] 

Andrew Ash commented on SPARK-625:
--

Spark is very sensitive to hostnames in Spark URLs, and that comes from Akka 
being very sensitive.  I've personally been bitten by hostnames vs FQDNs vs 
external IP address vs loopback IP address, and it's really a pain.

On current master branch (1.2) with the Spark standalone master listening on 
{{spark://aash-mbp.local:7077}} as confirmed by the master web UI, and the 
spark shell attempting to connect to {{spark://127.0.01:7077}} with the 
{{--master}} parameter, the driver tries 3 attempts and then fails with this 
message:

{noformat}
14/11/14 01:37:56 INFO AppClient$ClientActor: Connecting to master 
spark://127.0.0.1:7077...
14/11/14 01:37:56 WARN AppClient$ClientActor: Could not connect to 
akka.tcp://sparkMaster@127.0.0.1:7077: akka.remote.InvalidAssociation: Invalid 
address: akka.tcp://sparkMaster@127.0.0.1:7077
14/11/14 01:37:56 WARN Remoting: Tried to associate with unreachable remote 
address [akka.tcp://sparkMaster@127.0.0.1:7077]. Address is now gated for 5000 
ms, all messages to this address will be delivered to dead letters. Reason: 
Connection refused: /127.0.0.1:7077
14/11/14 01:38:16 INFO AppClient$ClientActor: Connecting to master 
spark://127.0.0.1:7077...
14/11/14 01:38:16 WARN Remoting: Tried to associate with unreachable remote 
address [akka.tcp://sparkMaster@127.0.0.1:7077]. Address is now gated for 5000 
ms, all messages to this address will be delivered to dead letters. Reason: 
Connection refused: /127.0.0.1:7077
14/11/14 01:38:16 WARN AppClient$ClientActor: Could not connect to 
akka.tcp://sparkMaster@127.0.0.1:7077: akka.remote.InvalidAssociation: Invalid 
address: akka.tcp://sparkMaster@127.0.0.1:7077
14/11/14 01:38:36 INFO AppClient$ClientActor: Connecting to master 
spark://127.0.0.1:7077...
14/11/14 01:38:36 WARN Remoting: Tried to associate with unreachable remote 
address [akka.tcp://sparkMaster@127.0.0.1:7077]. Address is now gated for 5000 
ms, all messages to this address will be delivered to dead letters. Reason: 
Connection refused: /127.0.0.1:7077
14/11/14 01:38:36 WARN AppClient$ClientActor: Could not connect to 
akka.tcp://sparkMaster@127.0.0.1:7077: akka.remote.InvalidAssociation: Invalid 
address: akka.tcp://sparkMaster@127.0.0.1:7077
14/11/14 01:38:56 ERROR SparkDeploySchedulerBackend: Application has been 
killed. Reason: All masters are unresponsive! Giving up.
14/11/14 01:38:56 WARN SparkDeploySchedulerBackend: Application ID is not 
initialized yet.
14/11/14 01:38:56 ERROR TaskSchedulerImpl: Exiting due to error from cluster 
scheduler: All masters are unresponsive! Giving up.
{noformat}

So the hang seems to be gone and replaced with a reasonable 3x attempts and 
fail.

[~joshrosen], short of changing Akka ourselves to make it less strict on exact 
URL matches, is there anything else we can do for this ticket?  I think we can 
reasonably close as fixed.

 Client hangs when connecting to standalone cluster using wrong address
 --

 Key: SPARK-625
 URL: https://issues.apache.org/jira/browse/SPARK-625
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.7.0, 0.7.1, 0.8.0
Reporter: Josh Rosen
Priority: Minor

 I launched a standalone cluster on my laptop, connecting the workers to the 
 master using my machine's public IP address (128.32.*.*:7077).  If I try to 
 connect spark-shell to the master using spark://0.0.0.0:7077, it 
 successfully brings up a Scala prompt but hangs when I try to run a job.
 From the standalone master's log, it looks like the client's messages are 
 being dropped without the client discovering that the connection has failed:
 {code}
 12/11/27 14:00:52 ERROR NettyRemoteTransport(null): dropping message 
 RegisterJob(JobDescription(Spark shell)) for non-local recipient 
 akka://spark@0.0.0.0:7077/user/Master at akka://spark@128.32.*.*:7077 local 
 is akka://spark@128.32.*.*:7077
 12/11/27 14:00:52 ERROR NettyRemoteTransport(null): dropping message 
 DaemonMsgWatch(Actor[akka://spark@128.32.*.*:57518/user/$a],Actor[akka://spark@0.0.0.0:7077/user/Master])
  for non-local recipient akka://spark@0.0.0.0:7077/remote at 
 akka://spark@128.32.*.*:7077 local is akka://spark@128.32.*.*:7077
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2468) Netty-based block server / client module

2014-11-14 Thread zzc (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212064#comment-14212064
]

zzc commented on SPARK-2468:

Hi, Aaron Davidson, I send a email to you about shuffle data performance test.
Looking forward to hear from your reply.
Thanks.

Netty-based block server / client module

Key: SPARK-2468
URL: https://issues.apache.org/jira/browse/SPARK-2468
Project: Spark
Issue Type: Improvement
Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
Fix For: 1.2.0

Right now shuffle send goes through the block manager. This is inefficient
because it requires loading a block from disk into a kernel buffer, then into
a user space buffer, and then back to a kernel send buffer before it reaches
the NIC. It does multiple copies of the data and context switching between
kernel/user. It also creates unnecessary buffer in the JVM that increases GC
Instead, we should use FileChannel.transferTo, which handles this in the
kernel space with zero-copy. See
http://www.ibm.com/developerworks/library/j-zerocopy/
One potential solution is to use Netty. Spark already has a Netty based
network module implemented (org.apache.spark.network.netty). However, it
lacks some functionality and is turned off by default.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-809) Give newly registered apps a set of executors right away


[ 
https://issues.apache.org/jira/browse/SPARK-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212068#comment-14212068
 ] 

Andrew Ash commented on SPARK-809:
--

I believe this situation hasn't changed?

Looking a little at the master branch (1.2) the standalone cluster still gets 
its {{defaultParallelism}} in {{CoarseGrainedSchedulerBackend}} from the 
{{totalCoreCount}} that is incremented/decremented as executors join/leave the 
driver's awareness.

 Give newly registered apps a set of executors right away
 

 Key: SPARK-809
 URL: https://issues.apache.org/jira/browse/SPARK-809
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Reporter: Matei Zaharia
Priority: Minor

 Right now, newly connected apps in the standalone cluster will not set a good 
 defaultParallelism value if they create RDDs right after creating a 
 SparkContext, because the executorAdded calls are asynchronous and happen 
 after. It would be nice to wait for a few such calls before returning from 
 the scheduler initializer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-835) RDD$parallelize() should use object serializer (not closure serializer) for collection objects


 [ 
https://issues.apache.org/jira/browse/SPARK-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-835:
-
Fix Version/s: 0.8.0

 RDD$parallelize() should use object serializer (not closure serializer) for 
 collection objects
 --

 Key: SPARK-835
 URL: https://issues.apache.org/jira/browse/SPARK-835
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.7.3
Reporter: Dmitriy Lyubimov
 Fix For: 0.8.0


 This is a twin issue for SPARK-826 encapsulating all use cases where 
 collection of objects is transferred from front end to backend RDDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-835) RDD$parallelize() should use object serializer (not closure serializer) for collection objects


 [ 
https://issues.apache.org/jira/browse/SPARK-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash resolved SPARK-835.
--
Resolution: Fixed

Closing per [~dlyubimov] with the Fix Version from SPARK-826

 RDD$parallelize() should use object serializer (not closure serializer) for 
 collection objects
 --

 Key: SPARK-835
 URL: https://issues.apache.org/jira/browse/SPARK-835
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.7.3
Reporter: Dmitriy Lyubimov
 Fix For: 0.8.0


 This is a twin issue for SPARK-826 encapsulating all use cases where 
 collection of objects is transferred from front end to backend RDDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-904) Not able to Start/Stop Spark Worker from Remote Machine


[ 
https://issues.apache.org/jira/browse/SPARK-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212078#comment-14212078
 ] 

Andrew Ash commented on SPARK-904:
--

[~ayushmishra2005] I suspect you don't have Spark installed on the remote 
machine -- the {{start-all.sh}} script won't install it for you on remote 
machines.

If you're still having trouble, please reach out to the spark users list from 
http://spark.apache.org/community.html which is a better place for these kinds 
of requests anyway.  I'm closing this issue for now but let me know here if you 
aren't able to get a resolution on the mailing lists.

Thanks, and good luck with Spark!
Andrew

 Not able to Start/Stop Spark Worker from Remote Machine
 ---

 Key: SPARK-904
 URL: https://issues.apache.org/jira/browse/SPARK-904
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.7.3
Reporter: Ayush

 I have two machines A and B. I am trying to run Spark Master on machine A and 
 Spark Worker on machine B. 
 I have set machine B'host name in conf/slaves in my Spark directory. 
 When I am executing start-all.sh to start master and workers, I am getting 
 below message on console:
 abc@abc-vostro:~/spark-scala-2.10$ sudo sh bin/start-all.sh 
 sudo: /etc/sudoers.d is world writable
 starting spark.deploy.master.Master, logging to 
 /home/abc/spark-scala-2.10/bin/../logs/spark-root-spark.deploy.master.Master-1-abc-vostro.out
 13/09/11 14:54:29 WARN spark.Utils: Your hostname, abc-vostro resolves to a 
 loopback address: 127.0.1.1; using 1XY.1XY.Y.Y instead (on interface wlan2)
 13/09/11 14:54:29 WARN spark.Utils: Set SPARK_LOCAL_IP if you need to bind to 
 another address
 Master IP: abc-vostro
 cd /home/abc/spark-scala-2.10/bin/.. ; 
 /home/abc/spark-scala-2.10/bin/start-slave.sh 1 spark://abc-vostro:7077
 xyz@1XX.1XX.X.X's password: 
 xyz@1XX.1XX.X.X: bash: line 0: cd: /home/abc/spark-scala-2.10/bin/..: No such 
 file or directory
 xyz@1XX.1XX.X.X: bash: /home/abc/spark-scala-2.10/bin/start-slave.sh: No such 
 file or directory
 Master is started but worker is failed to start. 
 I have set xyz@1XX.1XX.X.X in conf/slaves in my Spark directory. 
 Can anyone help me to resolve this? This is probably something I'm missing 
 any configuration on my end. 
 However When I create Spark Master and Worker on same machine, It is working 
 fine. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-904) Not able to Start/Stop Spark Worker from Remote Machine


 [ 
https://issues.apache.org/jira/browse/SPARK-904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash closed SPARK-904.

Resolution: Not a Problem

 Not able to Start/Stop Spark Worker from Remote Machine
 ---

 Key: SPARK-904
 URL: https://issues.apache.org/jira/browse/SPARK-904
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.7.3
Reporter: Ayush

 I have two machines A and B. I am trying to run Spark Master on machine A and 
 Spark Worker on machine B. 
 I have set machine B'host name in conf/slaves in my Spark directory. 
 When I am executing start-all.sh to start master and workers, I am getting 
 below message on console:
 abc@abc-vostro:~/spark-scala-2.10$ sudo sh bin/start-all.sh 
 sudo: /etc/sudoers.d is world writable
 starting spark.deploy.master.Master, logging to 
 /home/abc/spark-scala-2.10/bin/../logs/spark-root-spark.deploy.master.Master-1-abc-vostro.out
 13/09/11 14:54:29 WARN spark.Utils: Your hostname, abc-vostro resolves to a 
 loopback address: 127.0.1.1; using 1XY.1XY.Y.Y instead (on interface wlan2)
 13/09/11 14:54:29 WARN spark.Utils: Set SPARK_LOCAL_IP if you need to bind to 
 another address
 Master IP: abc-vostro
 cd /home/abc/spark-scala-2.10/bin/.. ; 
 /home/abc/spark-scala-2.10/bin/start-slave.sh 1 spark://abc-vostro:7077
 xyz@1XX.1XX.X.X's password: 
 xyz@1XX.1XX.X.X: bash: line 0: cd: /home/abc/spark-scala-2.10/bin/..: No such 
 file or directory
 xyz@1XX.1XX.X.X: bash: /home/abc/spark-scala-2.10/bin/start-slave.sh: No such 
 file or directory
 Master is started but worker is failed to start. 
 I have set xyz@1XX.1XX.X.X in conf/slaves in my Spark directory. 
 Can anyone help me to resolve this? This is probably something I'm missing 
 any configuration on my end. 
 However When I create Spark Master and Worker on same machine, It is working 
 fine. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-957) The problem that repeated computation among iterations

[
https://issues.apache.org/jira/browse/SPARK-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212084#comment-14212084
]

Andrew Ash commented on SPARK-957:
--

Hi [~caizhua], are you still having issues with your implementation of the LDA
algorithm? We try to only keep tickets open that have remaining work to be
done.

You can also reference SPARK-1405 for the LDA implementation being worked on
for future inclusion into MLlib.

The problem that repeated computation among iterations
--

Key: SPARK-957
URL: https://issues.apache.org/jira/browse/SPARK-957
Project: Spark
Issue Type: Bug
Components: Examples
Affects Versions: 0.7.3
Reporter: caizhua

For LDA model, if we make each document as a single record of RDD, it is
quite slow, so we try making the RDD as a set of blocks, where each block has
a subset of documents. However, when we run the program, we find that a lot
of computation among iterations are repeated. Basically, when we comes to the
ith iteration, all the jobs that happened in 0 to (i-1)th iteration are
repeated. Certainly, the jobs in the ith iteration will be repeated in the
(i+1) iteration. In total, if you have m iterations, then the jobs in the ith
iteration will be repeated.
However, the result is still correct. :)

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-794) Remove sleep() in ClusterScheduler.stop


[ 
https://issues.apache.org/jira/browse/SPARK-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212088#comment-14212088
 ] 

Andrew Ash commented on SPARK-794:
--

I don't see a {{ClusterScheduler}} class on master -- was that refactored to 
{{TaskSchedulerImpl}}?

There is still a {{Thread.sleep(1000)}} in that class's {{stop()}} though, 
which may be what this ticket refers to: 
https://github.com/apache/spark/blob/1df05a40ebf3493b0aff46d18c0f30d2d5256c7b/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L399

 Remove sleep() in ClusterScheduler.stop
 ---

 Key: SPARK-794
 URL: https://issues.apache.org/jira/browse/SPARK-794
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Matei Zaharia

 This temporary change made a while back slows down the unit tests quite a bit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-665) Create RPM packages for Spark


[ 
https://issues.apache.org/jira/browse/SPARK-665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212094#comment-14212094
 ] 

Andrew Ash commented on SPARK-665:
--

This role of creating RPM packages seems to have been taken up by the various 
Hadoop distributors for their distributions (Cloudera, MapR, HortonWorks, etc).

Does the Apache Spark team still intend to create RPMs for Spark?  An obvious 
subsequent request would be to also release in the DEB format.

I don't think this is a route we want to go down now but wanted to hear others' 
thoughts.

 Create RPM packages for Spark
 -

 Key: SPARK-665
 URL: https://issues.apache.org/jira/browse/SPARK-665
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia

 This could be doable with the JRPM Maven plugin, similar to how we make 
 Debian packages now, but I haven't looked into it. The plugin is described at 
 http://jrpm.sourceforge.net.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-1206) Add python support for average and other summary satistics


 [ 
https://issues.apache.org/jira/browse/SPARK-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash closed SPARK-1206.
-
Resolution: Implemented

Closing per [~holdenk_amp] as already done.

 Add python support for  average and other summary satistics
 ---

 Key: SPARK-1206
 URL: https://issues.apache.org/jira/browse/SPARK-1206
 Project: Spark
  Issue Type: Improvement
Reporter: Holden Karau
Priority: Minor

 We have a number of summary statistics in the DoubleRDDFunctions.scala. We 
 should implement these in python too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-665) Create RPM packages for Spark


[ 
https://issues.apache.org/jira/browse/SPARK-665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212104#comment-14212104
 ] 

Andrew Ash commented on SPARK-665:
--

Sean are you suggesting dropping the .deb packages that Apache Spark releases 
as a simplification effort?  It feels inequitable to support one packaging 
format (deb) but not the other (rpm).

 Create RPM packages for Spark
 -

 Key: SPARK-665
 URL: https://issues.apache.org/jira/browse/SPARK-665
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia

 This could be doable with the JRPM Maven plugin, similar to how we make 
 Debian packages now, but I haven't looked into it. The plugin is described at 
 http://jrpm.sourceforge.net.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-818) Design Spark Job Server


 [ 
https://issues.apache.org/jira/browse/SPARK-818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash closed SPARK-818.

Resolution: Won't Fix

The community consensus was that the Spark job server should live in a separate 
repository from the Apache Spark code.  It now lives in the repository that 
[~velvia] links above, and has continued to be quite active over the past 6 
months.

Closing this Apache Spark issue as Won't Fix -- the job server is a welcome 
addition to the growing Spark ecosystem but its issues don't belong in this 
issue tracker.

Many thanks everyone!

 Design Spark Job Server
 ---

 Key: SPARK-818
 URL: https://issues.apache.org/jira/browse/SPARK-818
 Project: Spark
  Issue Type: New Feature
Reporter: Evan Chan
 Attachments: job-server-architecture.md


 We've been developing a generic REST job server here at Ooyala and would like 
 to outline its goals, architecture, and API, so we could get feedback from 
 the community and hopefully contribute it back.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-665) Create RPM packages for Spark


[ 
https://issues.apache.org/jira/browse/SPARK-665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212109#comment-14212109
 ] 

Sean Owen commented on SPARK-665:
-

Not suggesting that, no. I suppose it does depend on demand. Apparently there 
was enough of a need for a .deb package that it was contributed and maintained, 
but perhaps not for .rpm, because there are other sources? If there were demand 
it could be worth the cost.

 Create RPM packages for Spark
 -

 Key: SPARK-665
 URL: https://issues.apache.org/jira/browse/SPARK-665
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia

 This could be doable with the JRPM Maven plugin, similar to how we make 
 Debian packages now, but I haven't looked into it. The plugin is described at 
 http://jrpm.sourceforge.net.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-1231) DEAD worker should recover automaticly


 [ 
https://issues.apache.org/jira/browse/SPARK-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash closed SPARK-1231.
-
Resolution: Duplicate

 DEAD worker should recover automaticly
 --

 Key: SPARK-1231
 URL: https://issues.apache.org/jira/browse/SPARK-1231
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 0.9.0
Reporter: Yi Tian
  Labels: dead, recover, worker

 master should send a response when DEAD worker sending a heartbeat.
 so worker could clean all applications and drivers and re-register itself to 
 master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1231) DEAD worker should recover automaticly


[ 
https://issues.apache.org/jira/browse/SPARK-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212110#comment-14212110
 ] 

Andrew Ash commented on SPARK-1231:
---

Sorry [~tianyi], when I did my search for prior tickets in SPARK-3736 I didn't 
find this one.  The issue of workers not recovering when disconnected has since 
been resolved though and will be released as part of Spark 1.2.0, so I'm 
closing this ticket as a duplicate.

Please let us know if you have any troubles with that implementation once you 
start testing it.

Thanks for reporting the bug!
Andrew

 DEAD worker should recover automaticly
 --

 Key: SPARK-1231
 URL: https://issues.apache.org/jira/browse/SPARK-1231
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 0.9.0
Reporter: Yi Tian
  Labels: dead, recover, worker

 master should send a response when DEAD worker sending a heartbeat.
 so worker could clean all applications and drivers and re-register itself to 
 master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1169) Add countApproxDistinct and countApproxDistinctByKey to PySpark


[ 
https://issues.apache.org/jira/browse/SPARK-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212112#comment-14212112
 ] 

Andrew Ash commented on SPARK-1169:
---

On current master (1.2) I see that rdd.py now has a countApproxDistinct() 
method, but I don't see one for countApproxDistinctByKey() so this ticket is 
half completed.

 Add countApproxDistinct and countApproxDistinctByKey to PySpark
 ---

 Key: SPARK-1169
 URL: https://issues.apache.org/jira/browse/SPARK-1169
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Matei Zaharia
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-754) Multiple Spark Contexts active in a single Spark Context


[ 
https://issues.apache.org/jira/browse/SPARK-754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212113#comment-14212113
 ] 

Andrew Ash commented on SPARK-754:
--

This is actually currently unsupported, and a ticket to make this possible is 
being tracked at SPARK-2243

I'm going to close this ticket as a duplicate of that one, but please let me 
know if you feel there are subtleties here that keep these from being a 
duplicate.

Thanks for the bug report Erik!
Andrew

 Multiple Spark Contexts active in a single Spark Context
 

 Key: SPARK-754
 URL: https://issues.apache.org/jira/browse/SPARK-754
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.7.0
Reporter: Erik James Freed
Priority: Critical

 This may be no more than creating a unit test to ensure it can be done but it 
 is not clear that one can instantiate multiple spark contexts within a single 
 VM and use them concurrently (one thread in a context at a time).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-754) Multiple Spark Contexts active in a single Spark Context


 [ 
https://issues.apache.org/jira/browse/SPARK-754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash closed SPARK-754.

Resolution: Duplicate

 Multiple Spark Contexts active in a single Spark Context
 

 Key: SPARK-754
 URL: https://issues.apache.org/jira/browse/SPARK-754
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.7.0
Reporter: Erik James Freed
Priority: Critical

 This may be no more than creating a unit test to ensure it can be done but it 
 is not clear that one can instantiate multiple spark contexts within a single 
 VM and use them concurrently (one thread in a context at a time).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2243) Support multiple SparkContexts in the same JVM


 [ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-2243:
--
Affects Version/s: 0.7.0

 Support multiple SparkContexts in the same JVM
 --

 Key: SPARK-2243
 URL: https://issues.apache.org/jira/browse/SPARK-2243
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Spark Core
Affects Versions: 0.7.0, 1.0.0, 1.1.0
Reporter: Miguel Angel Fernandez Diaz

 We're developing a platform where we create several Spark contexts for 
 carrying out different calculations. Is there any restriction when using 
 several Spark contexts? We have two contexts, one for Spark calculations and 
 another one for Spark Streaming jobs. The next error arises when we first 
 execute a Spark calculation and, once the execution is finished, a Spark 
 Streaming job is launched:
 {code}
 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
   at 
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
   at 
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
 java.io.FileNotFoundException
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at

[jira] [Created] (SPARK-4399) Support multiple cloud providers

Andrew Ash created SPARK-4399:
-

 Summary: Support multiple cloud providers
 Key: SPARK-4399
 URL: https://issues.apache.org/jira/browse/SPARK-4399
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Affects Versions: 1.2.0
Reporter: Andrew Ash


We currently have Spark startup scripts for Amazon EC2 but not for various 
other cloud providers.  This ticket is an umbrella to support multiple cloud 
providers in the bundled scripts, not just Amazon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4400) Add scripts for launching Spark on Google Compute Engine (GCE)

Andrew Ash created SPARK-4400:
-

 Summary: Add scripts for launching Spark on Google Compute Engine 
(GCE)
 Key: SPARK-4400
 URL: https://issues.apache.org/jira/browse/SPARK-4400
 Project: Spark
  Issue Type: New Feature
Reporter: Andrew Ash






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2398) Trouble running Spark 1.0 on Yarn


 [ 
https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2398.
--
Resolution: Duplicate

The PR that resolved that was ultimately tied to a new JIRA, SPARK-3768.

 Trouble running Spark 1.0 on Yarn 
 --

 Key: SPARK-2398
 URL: https://issues.apache.org/jira/browse/SPARK-2398
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Nishkam Ravi

 Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
 For example: SparkPageRank when run in standalone mode goes through without 
 any errors (tested for up to 30GB input dataset on a 6-node cluster).  Also 
 runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn 
 cluster mode) as the input data size is increased. Confirmed for 16GB input 
 dataset.
 The same workload runs fine with Spark 0.9 in both standalone and yarn 
 cluster mode (for up to 30 GB input dataset on a 6-node cluster).
 Commandline used:
 (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn 
 --deploy-mode cluster --properties-file pagerank.conf  --driver-memory 30g 
 --driver-cores 16 --num-executors 5 --class 
 org.apache.spark.examples.SparkPageRank 
 /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
  pagerank_in $NUM_ITER)
 pagerank.conf:
 spark.masterspark://c1704.halxg.cloudera.com:7077
 spark.home  /opt/cloudera/parcels/CDH/lib/spark
 spark.executor.memory   32g
 spark.default.parallelism   118
 spark.cores.max 96
 spark.storage.memoryFraction0.6
 spark.shuffle.memoryFraction0.3
 spark.shuffle.compress  true
 spark.shuffle.spill.compresstrue
 spark.broadcast.compresstrue
 spark.rdd.compress  false
 spark.io.compression.codec  org.apache.spark.io.LZFCompressionCodec
 spark.io.compression.snappy.block.size  32768
 spark.reducer.maxMbInFlight 48
 spark.local.dir  /var/lib/jenkins/workspace/tmp
 spark.driver.memory 30g
 spark.executor.cores16
 spark.locality.wait 6000
 spark.executor.instances5
 UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection 
 to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
 java.nio.channels.AsynchronousCloseException
 at 
 java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
 at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
 at 
 org.apache.spark.network.SendingConnection.write(Connection.scala:361)
 at 
 org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
 at 
 org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
 at java.io.FilterInputStream.close(FilterInputStream.java:181)
 at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
 at 
 org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
 at 
 org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
 at 
 org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 ---
 14/07/07 17:59:52 WARN network.SendingConnection: Error finishing connection 
 to

[jira] [Commented] (SPARK-1358) Continuous integrated test should be involved in Spark ecosystem


[ 
https://issues.apache.org/jira/browse/SPARK-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212125#comment-14212125
 ] 

Andrew Ash commented on SPARK-1358:
---

I've heard these sorts of extended tests called end to end tests.  A sample one 
for Spark could be to stand up an HDFS cluster, load some data into it, stand 
up a parallel Spark cluster, read data out of HDFS, and compute some kind of 
aggregate.

They tend to require significant hardware though in order to run because they 
are much more intense and can be long-running.  [~pwendell] is that kind of 
extended hardware available?  I know we currently run some CI on Amplab Jenkins 
but I'm uncertain how much additional capacity it could support.

cc [~shaneknapp]

 Continuous integrated test should be involved in Spark ecosystem 
 -

 Key: SPARK-1358
 URL: https://issues.apache.org/jira/browse/SPARK-1358
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: xiajunluan

 Currently, Spark only contains unit test and performance test, but I think it 
 is not enough for customer to evaluate status about their cluster and spark 
 version they  will used, and it is necessary to build continuous integrated 
 test for spark development , it could included 
 1. complex applications test cases for spark/spark streaming/graphx
 2. stresss test cases
 3. fault tolerance test cases
 4..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4400) Add scripts for launching Spark on Google Compute Engine (GCE)


 [ 
https://issues.apache.org/jira/browse/SPARK-4400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash closed SPARK-4400.
-
Resolution: Duplicate

 Add scripts for launching Spark on Google Compute Engine (GCE)
 --

 Key: SPARK-4400
 URL: https://issues.apache.org/jira/browse/SPARK-4400
 Project: Spark
  Issue Type: New Feature
Reporter: Andrew Ash





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-928) Add support for Unsafe-based serializer in Kryo 2.22


[ 
https://issues.apache.org/jira/browse/SPARK-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212132#comment-14212132
 ] 

Andrew Ash commented on SPARK-928:
--

Latest Chill (0.5.0) is still using Kryo 2.21 so this is still waiting on a 
Chill update

https://github.com/twitter/chill/blob/0.5.0/project/Build.scala#L13

 Add support for Unsafe-based serializer in Kryo 2.22
 

 Key: SPARK-928
 URL: https://issues.apache.org/jira/browse/SPARK-928
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Priority: Minor
  Labels: starter
 Fix For: 1.0.0


 This can reportedly be quite a bit faster, but it also requires Chill to 
 update its Kryo dependency. Once that happens we should add a 
 spark.kryo.useUnsafe flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1568) Spark 0.9.0 hangs reading s3


[ 
https://issues.apache.org/jira/browse/SPARK-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212145#comment-14212145
 ] 

Andrew Ash commented on SPARK-1568:
---

[~sams] did you see an improvement when you upgraded to 1.0.0?

I've noticed myself that reading from s3 can be slow, particularly in the 
scenario where there are many small files.  I think the hadoop S3 adapter makes 
many more API calls than is necessary, and that scales on the number of files 
you have.

 Spark 0.9.0 hangs reading s3
 

 Key: SPARK-1568
 URL: https://issues.apache.org/jira/browse/SPARK-1568
 Project: Spark
  Issue Type: Bug
Reporter: sam

 I've tried several jobs now and many of the tasks complete, then it get stuck 
 and just hangs.  The exact same jobs function perfectly fine if I distcp to 
 hdfs first and read from hdfs.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4401) RuleExecutor correctly logs trace iteration num

2014-11-14 Thread YanTang Zhai (JIRA)

YanTang Zhai created SPARK-4401:
---

 Summary: RuleExecutor correctly logs trace iteration num
 Key: SPARK-4401
 URL: https://issues.apache.org/jira/browse/SPARK-4401
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: YanTang Zhai


RuleExecutor logs trace wrong iteration num as follows
if (curPlan.fastEquals(lastPlan)) {
  logTrace(sFixed point reached for batch ${batch.name} after 
$iteration iterations.)
  continue = false
}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4402) Output path validation of an action statement resulting in runtime exception

2014-11-14 Thread Vijay (JIRA)

Vijay created SPARK-4402:


 Summary: Output path validation of an action statement resulting 
in runtime exception
 Key: SPARK-4402
 URL: https://issues.apache.org/jira/browse/SPARK-4402
 Project: Spark
  Issue Type: Wish
Reporter: Vijay
Priority: Minor


Output path validation is happening at the time of statement execution as a 
part of lazyevolution of action statement. But if the path already exists then 
it throws a runtime exception. Hence all the processing completed till that 
point is lost which results in resource wastage (processing time and CPU usage).

If this I/O related validation is done before the RDD action operations then 
this runtime exception can be avoided.
I believe similar validation/ feature is implemented in hadoop also.

Example:

SchemaRDD.saveAsTextFile() evaluated the path during runtime 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3722) Spark on yarn docs work

2014-11-14 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-3722.
--
  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: WangTaoTheTonic
Target Version/s: 1.3.0

 Spark on yarn docs work
 ---

 Key: SPARK-3722
 URL: https://issues.apache.org/jira/browse/SPARK-3722
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: WangTaoTheTonic
Assignee: WangTaoTheTonic
Priority: Minor
 Fix For: 1.3.0


 Adding another way to gain containers' log.
 Fix outdated link and typo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4403) Elastic allocation(spark.dynamicAllocation.enabled) results in task never being execued.

2014-11-14 Thread Egor Pahomov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egor Pahomov updated SPARK-4403:

Attachment: ipython_out

 Elastic allocation(spark.dynamicAllocation.enabled) results in task never 
 being execued.
 

 Key: SPARK-4403
 URL: https://issues.apache.org/jira/browse/SPARK-4403
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.1.1
Reporter: Egor Pahomov
 Attachments: ipython_out


 I execute ipython notebook + pyspark with spark.dynamicAllocation.enabled = 
 true. Task never ends.
 Code:
 {code}
 import sys
 from random import random
 from operator import add
 partitions = 10
 n = 10 * partitions
 def f(_):
 x = random() * 2 - 1
 y = random() * 2 - 1
 return 1 if x ** 2 + y ** 2  1 else 0
 count = sc.parallelize(xrange(1, n + 1), partitions).map(f).reduce(add)
 print Pi is roughly %f % (4.0 * count / n)
 {code}
 {code}
 pyspark \
 --verbose \
 --master yarn-client \
 --conf spark.driver.port=$((RANDOM_PORT + 2)) \
 --conf spark.broadcast.port=$((RANDOM_PORT + 3)) \
 --conf spark.replClassServer.port=$((RANDOM_PORT + 4)) \
 --conf spark.blockManager.port=$((RANDOM_PORT + 5)) \
 --conf spark.executor.port=$((RANDOM_PORT + 6)) \
 --conf spark.fileserver.port=$((RANDOM_PORT + 7)) \
 --conf spark.shuffle.service.enabled=true \
 --conf spark.dynamicAllocation.enabled=true \
 --conf spark.dynamicAllocation.minExecutors=1 \
 --conf spark.dynamicAllocation.maxExecutors=10 \
 --conf spark.ui.port=$SPARK_UI_PORT
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1568) Spark 0.9.0 hangs reading s3


 [ 
https://issues.apache.org/jira/browse/SPARK-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1568.
--
   Resolution: Fixed
Fix Version/s: 1.0.0

 Spark 0.9.0 hangs reading s3
 

 Key: SPARK-1568
 URL: https://issues.apache.org/jira/browse/SPARK-1568
 Project: Spark
  Issue Type: Bug
Reporter: sam
 Fix For: 1.0.0


 I've tried several jobs now and many of the tasks complete, then it get stuck 
 and just hangs.  The exact same jobs function perfectly fine if I distcp to 
 hdfs first and read from hdfs.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4402) Output path validation of an action statement resulting in runtime exception


[ 
https://issues.apache.org/jira/browse/SPARK-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212376#comment-14212376
 ] 

Sean Owen commented on SPARK-4402:
--

Is this not the same issue resolved by 
https://issues.apache.org/jira/browse/SPARK-1100 ? I think the behavior 
implemented there is the intended behavior here.

 Output path validation of an action statement resulting in runtime exception
 

 Key: SPARK-4402
 URL: https://issues.apache.org/jira/browse/SPARK-4402
 Project: Spark
  Issue Type: Wish
Reporter: Vijay
Priority: Minor

 Output path validation is happening at the time of statement execution as a 
 part of lazyevolution of action statement. But if the path already exists 
 then it throws a runtime exception. Hence all the processing completed till 
 that point is lost which results in resource wastage (processing time and CPU 
 usage).
 If this I/O related validation is done before the RDD action operations then 
 this runtime exception can be avoided.
 I believe similar validation/ feature is implemented in hadoop also.
 Example:
 SchemaRDD.saveAsTextFile() evaluated the path during runtime 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4404) SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-process ends


[ 
https://issues.apache.org/jira/browse/SPARK-4404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212412#comment-14212412
 ] 

Apache Spark commented on SPARK-4404:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/3266

 SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-process 
 ends 
 -

 Key: SPARK-4404
 URL: https://issues.apache.org/jira/browse/SPARK-4404
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: WangTaoTheTonic

 When we have spark.driver.extra* or spark.driver.memory in 
 SPARK_SUBMIT_PROPERTIES_FILE, spark-class will use 
 SparkSubmitDriverBootstrapper to launch driver.
 If we get process id of SparkSubmitDriverBootstrapper and wanna kill it 
 during its running, we expect its SparkSubmit sub-process stop also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4404) SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-process ends

2014-11-14 Thread WangTaoTheTonic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

WangTaoTheTonic updated SPARK-4404:
---
Description: 
When we have spark.driver.extra* or spark.driver.memory in 
SPARK_SUBMIT_PROPERTIES_FILE, spark-class will use 
SparkSubmitDriverBootstrapper to launch driver.
If we get process id of SparkSubmitDriverBootstrapper and wanna kill it during 
its running, we expect its SparkSubmit sub-process stop also.

  was:
We we have spark.driver.extra* or spark.driver.memory in 
SPARK_SUBMIT_PROPERTIES_FILE, spark-class will use 
SparkSubmitDriverBootstrapper to launch driver.
If we get process id of SparkSubmitDriverBootstrapper and wanna kill it during 
its running, we expect its SparkSubmit sub-process stop also.


 SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-process 
 ends 
 -

 Key: SPARK-4404
 URL: https://issues.apache.org/jira/browse/SPARK-4404
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: WangTaoTheTonic

 When we have spark.driver.extra* or spark.driver.memory in 
 SPARK_SUBMIT_PROPERTIES_FILE, spark-class will use 
 SparkSubmitDriverBootstrapper to launch driver.
 If we get process id of SparkSubmitDriverBootstrapper and wanna kill it 
 during its running, we expect its SparkSubmit sub-process stop also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4354) 14/11/12 09:39:00 WARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, HYD-RNDNW-VFRCO-RCORE2): java.lang.NoClassDefFoundError: Could not initialize class org.xerial


[ 
https://issues.apache.org/jira/browse/SPARK-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212482#comment-14212482
 ] 

Apache Spark commented on SPARK-4354:
-

User 'alexliu68' has created a pull request for this issue:
https://github.com/apache/spark/pull/3211

 14/11/12 09:39:00 WARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, 
 HYD-RNDNW-VFRCO-RCORE2): java.lang.NoClassDefFoundError: Could not initialize 
 class org.xerial.snappy.Snappy
 --

 Key: SPARK-4354
 URL: https://issues.apache.org/jira/browse/SPARK-4354
 Project: Spark
  Issue Type: Question
  Components: Examples
Affects Versions: 1.1.0
 Environment: Linux
Reporter: Shyam
  Labels: newbie
 Attachments: client-exception.txt


 Prebuilt Spark for Hadoop 2.4 installed in 4 redhat linux machines 
 Standalone cluster mode.
 Machine 1(Master)
 Machine 2(Worker node 1)
 Machine 3(Worker node 2)
 Machine 4(Client for executing spark examples)
 I ran below mentioned command in Machine 4 then got exception mentioned in 
 the summary of this issue.
 sh spark-submit  --class org.apache.spark.examples.SparkPi --jars 
 /FS/lib/spark-assembly-1.1.0-hadoop2.4.0.jar  --master spark://MasterIP:7077 
 --deploy-mode client /FS/lib/spark-examples-1.1.0-hadoop2.4.0.jar 10
 java.lang.NoClassDefFoundError: Could not initialize class 
 org.xerial.snappy.Snappy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4338) Remove yarn-alpha support


[ 
https://issues.apache.org/jira/browse/SPARK-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212485#comment-14212485
 ] 

Andrew Ash commented on SPARK-4338:
---

From discussion on the dev list today, Sandy aims to include an example of 
building against Hadoop 2.5 in the docs with this PR

 Remove yarn-alpha support
 -

 Key: SPARK-4338
 URL: https://issues.apache.org/jira/browse/SPARK-4338
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Reporter: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4405) Matrices.* construction methods should check for rows x cols overflow

2014-11-14 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-4405:


 Summary: Matrices.* construction methods should check for rows x 
cols overflow
 Key: SPARK-4405
 URL: https://issues.apache.org/jira/browse/SPARK-4405
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor


Matrices has several methods which construct new all-0, all-1, or random 
matrices.  They take numRows, numCols as Int and multiply them to get the 
matrix size.  They should check for overflow before trying to create the matrix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4406) SVD should check for k 1

2014-11-14 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-4406:


 Summary: SVD should check for k  1
 Key: SPARK-4406
 URL: https://issues.apache.org/jira/browse/SPARK-4406
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor


When SVD is called with k  1, it still tries to compute the SVD, causing a 
lower-level error.  It should fail early.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4407) Thrift server for 0.13.1 doesn't deserialize complex types properly

2014-11-14 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-4407:
-

 Summary: Thrift server for 0.13.1 doesn't deserialize complex 
types properly
 Key: SPARK-4407
 URL: https://issues.apache.org/jira/browse/SPARK-4407
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.2
Reporter: Cheng Lian
Priority: Blocker


The following snippet can reproduce this issue:
{code}
CREATE TABLE t0(m MAPINT, STRING);
INSERT OVERWRITE TABLE t0 SELECT MAP(key, value) FROM src LIMIT 10;
SELECT * FROM t0;
{code}
Exception throw:
{code}
java.lang.RuntimeException: java.lang.ClassCastException: 
scala.collection.immutable.Map$Map1 cannot be cast to java.lang.String
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:84)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
at 
org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
at com.sun.proxy.$Proxy21.fetchResults(Unknown Source)
at 
org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:405)
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:530)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: scala.collection.immutable.Map$Map1 
cannot be cast to java.lang.String
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(Shim13.scala:142)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(Shim13.scala:165)
at 
org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:192)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:471)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
... 19 more
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4408) Behavior difference between spark-submit conf vs cmd line args

2014-11-14 Thread Pedro Rodriguez (JIRA)

Pedro Rodriguez created SPARK-4408:
--

 Summary: Behavior difference between spark-submit conf vs cmd line 
args
 Key: SPARK-4408
 URL: https://issues.apache.org/jira/browse/SPARK-4408
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Documentation
Affects Versions: 1.1.0, 1.2.0
Reporter: Pedro Rodriguez
Priority: Minor


There seems to be a difference between the behavior of bin/spark-submit with 
using command line arguments vs configuration file. It looks like either a bug 
or at least where documentation could be clearer about this difference.

Steps to Replicate:
1. Submit a job with a command similar to:
bin/spark-submit --class nipslda.NipsLda --master local --conf 
spark.executor.memory=2g --conf spark.driver.memory=2g --verbose 
~/Code/nips-lda/target/scala-2.10/nips-lda-assembly-0.1.jar
2. Navigate to SparkUI.
3. Environment tab lists driver and executor memory correctly.
4. Executor tab shows memory as ~260MB (my case) or default JVM limit
5. Write memory arguments to conf/spark-defaults.conf
6. Run same command without memory arguments
7. SparkUI executor tab correctly shows memory setting

Looking at spark-submit, it includes this passage in the comments:

# For client mode, the driver will be launched in the same JVM that launches
# SparkSubmit, so we may need to read the properties file for any extra class
# paths, library paths, java options and memory early on. Otherwise, it will
# be too late by the time the driver JVM has started.

Based on this, it seems that JVM parameters for spark-submit are used for the 
job itself when in client mode. Effectively, it makes it impossible to use the 
command line argument setting method to change JVM parameters since the JVM is 
already launched.

This seems like unexpected/undesirable behavior which could be fixed or docs 
could be changed to better reflect how this works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4407) Thrift server for 0.13.1 doesn't deserialize complex types properly


[ 
https://issues.apache.org/jira/browse/SPARK-4407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212544#comment-14212544
 ] 

Apache Spark commented on SPARK-4407:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/3178

 Thrift server for 0.13.1 doesn't deserialize complex types properly
 ---

 Key: SPARK-4407
 URL: https://issues.apache.org/jira/browse/SPARK-4407
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.2
Reporter: Cheng Lian
Priority: Blocker

 The following snippet can reproduce this issue:
 {code}
 CREATE TABLE t0(m MAPINT, STRING);
 INSERT OVERWRITE TABLE t0 SELECT MAP(key, value) FROM src LIMIT 10;
 SELECT * FROM t0;
 {code}
 Exception throw:
 {code}
 java.lang.RuntimeException: java.lang.ClassCastException: 
 scala.collection.immutable.Map$Map1 cannot be cast to java.lang.String
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:84)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
 at com.sun.proxy.$Proxy21.fetchResults(Unknown Source)
 at 
 org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:405)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:530)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ClassCastException: scala.collection.immutable.Map$Map1 
 cannot be cast to java.lang.String
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(Shim13.scala:142)
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(Shim13.scala:165)
 at 
 org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:192)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:471)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 ... 19 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-603) add simple Counter API

2014-11-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-603.
---
Resolution: Won't Fix

Closing this one as part of [~aash]'s cleanup. I think this problem is being 
fixed as we add accumulator / metrics values to the web ui.

 add simple Counter API
 --

 Key: SPARK-603
 URL: https://issues.apache.org/jira/browse/SPARK-603
 Project: Spark
  Issue Type: New Feature
Priority: Minor

 Users need a very simple way to create counters in their jobs.  Accumulators 
 provide a way to do this, but are a little clunky, for two reasons:
 1) the setup is a nuisance
 2) w/ delayed evaluation, you don't know when it will actually run, so its 
 hard to look at the values
 consider this code:
 {code}
 def filterBogus(rdd:RDD[MyCustomClass], sc: SparkContext) = {
   val filterCount = sc.accumulator(0)
   val filtered = rdd.filter{r =
 if (isOK(r)) true else {filterCount += 1; false}
   }
   println(removed  + filterCount.value +  records)
   filtered
 }
 {code}
 The println will always say 0 records were filtered, because its printed 
 before anything has actually run.  I could print out the value later on, but 
 note that it would destroy the modularity of the method -- kinda ugly to 
 return the accumulator just so that it can get printed later on.  (and of 
 course, the caller in turn might not know when the filter is going to get 
 applied, and would have to pass the accumulator up even further ...)
 I'd like to have Counters which just automatically get printed out whenever a 
 stage has been run, and also with some api to get them back.  I realize this 
 is tricky b/c a stage can get re-computed, so maybe you should only increment 
 the counters once.
 Maybe a more general way to do this is to provide some callback for whenever 
 an RDD is computed -- by default, you would just print the counters, but the 
 user could replace w/ a custom handler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4410) Support for external sort

Michael Armbrust created SPARK-4410:
---

 Summary: Support for external sort
 Key: SPARK-4410
 URL: https://issues.apache.org/jira/browse/SPARK-4410
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical


When any given key is of too high cardinality the current sorting code can tip 
over (since it loads a whole partition into memory).  I propose we add optional 
support for using sparks built in external sorting mechanism.  It can be off by 
default, but if we determine this code path does not regress performance we can 
turn it on by default in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4409) Additional (but limited) Linear Algebra Utils

2014-11-14 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-4409:
--

 Summary: Additional (but limited) Linear Algebra Utils
 Key: SPARK-4409
 URL: https://issues.apache.org/jira/browse/SPARK-4409
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Burak Yavuz
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4395) Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour


 [ 
https://issues.apache.org/jira/browse/SPARK-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4395:

Description: 
When I run this command it hangs for one to many hours and then finally returns 
with successful results:
 sqlContext.sql(SELECT * FROM RatingsTable limit 5).collect()

Note, the lab environment below is still active, so let me know if you'd like 
to just access it directly.

+++ My Environment +++
- 1-node cluster in Amazon
- RedHat 6.5 64-bit
- java version 1.7.0_67
- SBT version: sbt-0.13.5
- Scala version: scala-2.11.2

Ran: 
sudo yum -y update

git clone https://github.com/apache/spark

sudo sbt assembly
+++ Data file used +++
http://blueplastic.com/databricks/movielens/ratings.dat

{code}
 import re
 import string
 from pyspark.sql import SQLContext, Row
 sqlContext = SQLContext(sc)
 RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)'

 def parse_ratings_line(line):
... match = re.search(RATINGS_PATTERN, line)
... if match is None:
... # Optionally, you can change this to just ignore if each line of 
data is not critical.
... raise Error(Invalid logline: %s % logline)
... return Row(
... UserID= int(match.group(1)),
... MovieID   = int(match.group(2)),
... Rating= int(match.group(3)),
... Timestamp = int(match.group(4)))
...
 ratings_base_RDD = 
 (sc.textFile(file:///home/ec2-user/movielens/ratings.dat)
...# Call the parse_apace_log_line function on each line.
....map(parse_ratings_line)
...# Caches the objects in memory since they will be queried 
multiple times.
....cache())
 ratings_base_RDD.count()
1000209
 ratings_base_RDD.first()
Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1)
 schemaRatings = sqlContext.inferSchema(ratings_base_RDD)
 schemaRatings.registerTempTable(RatingsTable)
 sqlContext.sql(SELECT * FROM RatingsTable limit 5).collect()
{code}

(Now the Python shell hangs...)


  was:
When I run this command it hangs for one to many hours and then finally returns 
with successful results:
 sqlContext.sql(SELECT * FROM RatingsTable limit 5).collect()

Note, the lab environment below is still active, so let me know if you'd like 
to just access it directly.

+++ My Environment +++
- 1-node cluster in Amazon
- RedHat 6.5 64-bit
- java version 1.7.0_67
- SBT version: sbt-0.13.5
- Scala version: scala-2.11.2

Ran: 
sudo yum -y update

git clone https://github.com/apache/spark

sudo sbt assembly
+++ Data file used +++
http://blueplastic.com/databricks/movielens/ratings.dat

+++ Code ran +++
 import re
 import string
 from pyspark.sql import SQLContext, Row
 sqlContext = SQLContext(sc)
 RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)'

 def parse_ratings_line(line):
... match = re.search(RATINGS_PATTERN, line)
... if match is None:
... # Optionally, you can change this to just ignore if each line of 
data is not critical.
... raise Error(Invalid logline: %s % logline)
... return Row(
... UserID= int(match.group(1)),
... MovieID   = int(match.group(2)),
... Rating= int(match.group(3)),
... Timestamp = int(match.group(4)))
...
 ratings_base_RDD = 
 (sc.textFile(file:///home/ec2-user/movielens/ratings.dat)
...# Call the parse_apace_log_line function on each line.
....map(parse_ratings_line)
...# Caches the objects in memory since they will be queried 
multiple times.
....cache())
 ratings_base_RDD.count()
1000209
 ratings_base_RDD.first()
Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1)
 schemaRatings = sqlContext.inferSchema(ratings_base_RDD)
 schemaRatings.registerTempTable(RatingsTable)
 sqlContext.sql(SELECT * FROM RatingsTable limit 5).collect()

(Now the Python shell hangs...)


 Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour
 --

 Key: SPARK-4395
 URL: https://issues.apache.org/jira/browse/SPARK-4395
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: version 1.2.0-SNAPSHOT
Reporter: Sameer Farooqui

 When I run this command it hangs for one to many hours and then finally 
 returns with successful results:
  sqlContext.sql(SELECT * FROM RatingsTable limit 5).collect()
 Note, the lab environment below is still active, so let me know if you'd like 
 to just access it directly.
 +++ My Environment +++
 - 1-node cluster in Amazon
 - RedHat 6.5 64-bit
 - java version 1.7.0_67
 - SBT version: sbt-0.13.5
 - Scala version: scala-2.11.2
 Ran: 
 sudo yum -y update
 git clone https://github.com/apache/spark
 sudo sbt assembly
 +++ Data file used +++
 http://blueplastic.com/databricks/movielens/ratings.dat
 {code}

[jira] [Resolved] (SPARK-4394) Allow datasources to support IN and sizeInBytes

2014-11-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-4394.

   Resolution: Fixed
Fix Version/s: 1.2.0

 Allow datasources to support IN and sizeInBytes
 ---

 Key: SPARK-4394
 URL: https://issues.apache.org/jira/browse/SPARK-4394
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.2.0


 These are both pretty critical for running benchmarks against externally 
 defined datasources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4409) Additional (but limited) Linear Algebra Utils

2014-11-14 Thread Burak Yavuz (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Burak Yavuz updated SPARK-4409:
---
Description:
This ticket is to discuss the addition of a very limited number of local matrix
manipulation and generation methods that would be helpful in the further
development for algorithms on top of BlockMatrix (SPARK-3974), such as
Randomized SVD, and Multi Model Training (SPARK-1486).
The proposed methods for addition are:
For `Matrix`
- map: maps the values in the matrix with a given function. Produces a new
matrix.
- update: the values in the matrix are updated with a given function. Occurs
in place.
Factory methods for `DenseMatrix`:
- *zeros: Generate a matrix consisting of zeros
- *ones: Generate a matrix consisting of ones
- *eye: Generate an identity matrix
- *rand: Generate a matrix consisting of i.i.d. uniform random numbers
- *randn: Generate a matrix consisting of i.i.d. gaussian random numbers
- *diag: Generate a diagonal matrix from a supplied vector
*These methods already exist in the factory methods for `Matrices`, however for
cases where we require a `DenseMatrix`, you constantly have to add
`.asInstanceOf[DenseMatrix]` everywhere, which makes the code dirtier. I
propose moving these functions to factory methods for `DenseMatrix` where the
putput will be a `DenseMatrix` and the factory methods for `Matrices` will call
these functions directly and output a generic `Matrix`.

Factory methods for `SparseMatrix`:
- speye: Identity matrix in sparse format. Saves a ton of memory when
dimensions are large, especially in Multi Model Training, where each row
requires being multiplied by a scalar.
- sprand: Generate a sparse matrix with a given density consisting of i.i.d.
uniform random numbers.
- sprandn: Generate a sparse matrix with a given density consisting of i.i.d.
gaussian random numbers.
- diag: Generate a diagonal matrix from a supplied vector, but is memory
efficient, because it just stores the diagonal. Again, very helpful in Multi
Model Training.

Factory methods for `Matrices`:
- Include all the factory methods given above, but return a generic `Matrix`
rather than `SparseMatrix` or `DenseMatrix`.
- horzCat: Horizontally concatenate matrices to form one larger matrix. Very
useful in both Multi Model Training, and for the repartitioning of BlockMatrix.
- vertCat: Vertically concatenate matrices to form one larger matrix. Very
useful for the repartitioning of BlockMatrix.

Additional (but limited) Linear Algebra Utils
-

Key: SPARK-4409
URL: https://issues.apache.org/jira/browse/SPARK-4409
Project: Spark
Issue Type: Improvement
Components: MLlib
Reporter: Burak Yavuz
Priority: Minor

This ticket is to discuss the addition of a very limited number of local
matrix manipulation and generation methods that would be helpful in the
further development for algorithms on top of BlockMatrix (SPARK-3974), such
as Randomized SVD, and Multi Model Training (SPARK-1486).
The proposed methods for addition are:
For `Matrix`
- map: maps the values in the matrix with a given function. Produces a new
matrix.
- update: the values in the matrix are updated with a given function.
Occurs in place.
Factory methods for `DenseMatrix`:
- *zeros: Generate a matrix consisting of zeros
- *ones: Generate a matrix consisting of ones
- *eye: Generate an identity matrix
- *rand: Generate a matrix consisting of i.i.d. uniform random numbers
- *randn: Generate a matrix consisting of i.i.d. gaussian random numbers
- *diag: Generate a diagonal matrix from a supplied vector
*These methods already exist in the factory methods for `Matrices`, however
for cases where we require a `DenseMatrix`, you constantly have to add
`.asInstanceOf[DenseMatrix]` everywhere, which makes the code dirtier. I
propose moving these functions to factory methods for `DenseMatrix` where the
putput will be a `DenseMatrix` and the factory methods for `Matrices` will
call these functions directly and output a generic `Matrix`.
Factory methods for `SparseMatrix`:
- speye: Identity matrix in sparse format. Saves a ton of memory when
dimensions are large, especially in Multi Model Training, where each row
requires being multiplied by a scalar.
- sprand: Generate a sparse matrix with a given density consisting of
i.i.d. uniform random numbers.
- sprandn: Generate a sparse matrix with a given density consisting of
i.i.d. gaussian random numbers.
- diag: Generate a diagonal matrix from a supplied vector, but is memory
efficient, because it just stores the diagonal. Again, very helpful in Multi
Model Training.
Factory methods for `Matrices`:
- Include all the factory methods given above, but return a

[jira] [Created] (SPARK-4411) Add kill link for jobs in the UI

2014-11-14 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-4411:
-

 Summary: Add kill link for jobs in the UI
 Key: SPARK-4411
 URL: https://issues.apache.org/jira/browse/SPARK-4411
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.0
Reporter: Kay Ousterhout


SPARK-4145 changes the default landing page for the UI to show jobs. We should 
have a kill link for each job, similar to what we have for each stage, so 
it's easier for users to kill slow jobs (and the semantics of killing a job are 
slightly different than killing a stage).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4411) Add kill link for jobs in the UI

2014-11-14 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-4411:
--
Issue Type: New Feature  (was: Bug)

 Add kill link for jobs in the UI
 --

 Key: SPARK-4411
 URL: https://issues.apache.org/jira/browse/SPARK-4411
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 1.2.0
Reporter: Kay Ousterhout

 SPARK-4145 changes the default landing page for the UI to show jobs. We 
 should have a kill link for each job, similar to what we have for each 
 stage, so it's easier for users to kill slow jobs (and the semantics of 
 killing a job are slightly different than killing a stage).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4413) Parquet support through datasource API

Michael Armbrust created SPARK-4413:
---

 Summary: Parquet support through datasource API
 Key: SPARK-4413
 URL: https://issues.apache.org/jira/browse/SPARK-4413
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical


Right now there are several issues with out parquet support.  Specifically, the 
only way to access parquet files though pure SQL is by including Hive, which 
has the following issues
 - fairly verbose syntax
 - requires you to explicitly add partitions
 - does not support decimal types.
 - querying tables with many partitions results in metadata operations 
dominating the query time (even worse when reading from S3).

It would be great to have better native support here though the new datasources 
API.  Ideally once that is in place we can deprecate the existing 
ParquetRelation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4398) Specialize rdd.parallelize for xrange


 [ 
https://issues.apache.org/jira/browse/SPARK-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4398.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3264
[https://github.com/apache/spark/pull/3264]

 Specialize rdd.parallelize for xrange
 -

 Key: SPARK-4398
 URL: https://issues.apache.org/jira/browse/SPARK-4398
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.2.0


 sc.parallelize(range) is slow, which writes to disk. We should specialize 
 xrange for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4413) Parquet support through datasource API


[ 
https://issues.apache.org/jira/browse/SPARK-4413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212805#comment-14212805
 ] 

Apache Spark commented on SPARK-4413:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/3269

 Parquet support through datasource API
 --

 Key: SPARK-4413
 URL: https://issues.apache.org/jira/browse/SPARK-4413
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical

 Right now there are several issues with out parquet support.  Specifically, 
 the only way to access parquet files though pure SQL is by including Hive, 
 which has the following issues
  - fairly verbose syntax
  - requires you to explicitly add partitions
  - does not support decimal types.
  - querying tables with many partitions results in metadata operations 
 dominating the query time (even worse when reading from S3).
 It would be great to have better native support here though the new 
 datasources API.  Ideally once that is in place we can deprecate the existing 
 ParquetRelation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets


[ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212815#comment-14212815
 ] 

Xiangrui Meng edited comment on SPARK-3080 at 11/14/14 8:56 PM:


Thanks for the confirmation! If [~ilganeli] also confirms that this is the root 
cause, I'm going to close this JIRA as not a problem because this is the only 
way I could reproduce the issue, which violates the immutability assumption of 
RDD.


was (Author: mengxr):
Thanks for the confirmation! If [~ilganeli] also confirms that this is the root 
cause, I'm going to close this JIRA as not a problem because this is the only 
way I could reproduce the issue.

 ArrayIndexOutOfBoundsException in ALS for Large datasets
 

 Key: SPARK-3080
 URL: https://issues.apache.org/jira/browse/SPARK-3080
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Burak Yavuz

 The stack trace is below:
 {quote}
 java.lang.ArrayIndexOutOfBoundsException: 2716
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
 scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 
 org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 {quote}
 This happened after the dataset was sub-sampled. 
 Dataset properties: ~12B ratings
 Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets


[ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212815#comment-14212815
 ] 

Xiangrui Meng commented on SPARK-3080:
--

Thanks for the confirmation! If [~ilganeli] also confirms that this is the root 
cause, I'm going to close this JIRA as not a problem because this is the only 
way I could reproduce the issue.

 ArrayIndexOutOfBoundsException in ALS for Large datasets
 

 Key: SPARK-3080
 URL: https://issues.apache.org/jira/browse/SPARK-3080
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Burak Yavuz

 The stack trace is below:
 {quote}
 java.lang.ArrayIndexOutOfBoundsException: 2716
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
 scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 
 org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 {quote}
 This happened after the dataset was sub-sampled. 
 Dataset properties: ~12B ratings
 Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets


 [ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3080:
-
Assignee: Xiangrui Meng

 ArrayIndexOutOfBoundsException in ALS for Large datasets
 

 Key: SPARK-3080
 URL: https://issues.apache.org/jira/browse/SPARK-3080
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Burak Yavuz
Assignee: Xiangrui Meng

 The stack trace is below:
 {quote}
 java.lang.ArrayIndexOutOfBoundsException: 2716
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
 scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 
 org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 {quote}
 This happened after the dataset was sub-sampled. 
 Dataset properties: ~12B ratings
 Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets

2014-11-14 Thread Ilya Ganelin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212822#comment-14212822
 ] 

Ilya Ganelin commented on SPARK-3080:
-

Hi Xiangrui - I was not doing any sort of randomization or sampling in the code 
that produced this issue. 

 ArrayIndexOutOfBoundsException in ALS for Large datasets
 

 Key: SPARK-3080
 URL: https://issues.apache.org/jira/browse/SPARK-3080
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Burak Yavuz
Assignee: Xiangrui Meng

 The stack trace is below:
 {quote}
 java.lang.ArrayIndexOutOfBoundsException: 2716
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
 scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 
 org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 {quote}
 This happened after the dataset was sub-sampled. 
 Dataset properties: ~12B ratings
 Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4414) SparkContext.wholeTextFiles Doesn't work with S3 Buckets

2014-11-14 Thread Pedro Rodriguez (JIRA)

Pedro Rodriguez created SPARK-4414:
--

Summary: SparkContext.wholeTextFiles Doesn't work with S3 Buckets
Key: SPARK-4414
URL: https://issues.apache.org/jira/browse/SPARK-4414
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: Pedro Rodriguez

SparkContext.wholeTextFiles does not read files which SparkContext.textFile can
read. Below are general steps to reproduce, my specific case is following that
on a git repo.

Steps to reproduce.
1. Create Amazon S3 bucket, make public with multiple files
2. Attempt to read bucket with
sc.wholeTextFiles(s3n://mybucket/myfile.txt)
3. Spark returns the following error, even if the file exists.
Exception in thread main java.io.FileNotFoundException: File does not exist:
/myfile.txt
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.init(CombineFileInputFormat.java:489)
4. Change the call to
sc.textFile(s3n://mybucket/myfile.txt)
and there is no error message, the application should run fine.

There is a question on StackOverflow as well on this:
http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist

This is link to repo/lines of code. The uncommented call doesn't work, the
commented call works as expected:
https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19

It would be easy to use textFile with a multifile argument, but this should
work correctly for s3 bucket files as well.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-11-14 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212833#comment-14212833
 ] 

sam commented on SPARK-1867:


[~ansonism] Are you 100% sure your jar is also the same hadoop dep version??

 Spark Documentation Error causes java.lang.IllegalStateException: unread 
 block data
 ---

 Key: SPARK-1867
 URL: https://issues.apache.org/jira/browse/SPARK-1867
 Project: Spark
  Issue Type: Bug
Reporter: sam

 I've employed two System Administrators on a contract basis (for quite a bit 
 of money), and both contractors have independently hit the following 
 exception.  What we are doing is:
 1. Installing Spark 0.9.1 according to the documentation on the website, 
 along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
 cluster
 I've also included code snippets, and sbt deps at the bottom.
 When I've Googled this, there seems to be two somewhat vague responses:
 a) Mismatching spark versions on nodes/user code
 b) Need to add more jars to the SparkConf
 Now I know that (b) is not the problem having successfully run the same code 
 on other clusters while only including one jar (it's a fat jar).
 But I have no idea how to check for (a) - it appears Spark doesn't have any 
 version checks or anything - it would be nice if it checked versions and 
 threw a mismatching version exception: you have user code using version X 
 and node Y has version Z.
 I would be very grateful for advice on this.
 The exception:
 Exception in thread main org.apache.spark.SparkException: Job aborted: Task 
 0.0:1 failed 32 times (most recent failure: Exception failure: 
 java.lang.IllegalStateException: unread block data)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.IllegalStateException: unread block data [duplicate 59]
 My code snippet:
 val conf = new SparkConf()
.setMaster(clusterMaster)
.setAppName(appName)
.setSparkHome(sparkHome)
.setJars(SparkContext.jarOfClass(this.getClass))
 println(count =  + new SparkContext(conf).textFile(someHdfsPath).count())
 My SBT dependencies:
 // relevant
 org.apache.spark % spark-core_2.10 % 0.9.1,
 org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0,
 // standard, probably unrelated
 com.github.seratch %% awscala % [0.2,),
 org.scalacheck %% scalacheck % 1.10.1 % test,
 org.specs2 %% specs2 % 1.14 % test,
 org.scala-lang % scala-reflect % 2.10.3,
 org.scalaz %% scalaz-core % 7.0.5,
 net.minidev % json-smart % 1.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3860) Improve dimension joins


 [ 
https://issues.apache.org/jira/browse/SPARK-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3860:

Priority: Critical  (was: Major)

 Improve dimension joins
 ---

 Key: SPARK-3860
 URL: https://issues.apache.org/jira/browse/SPARK-3860
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Priority: Critical

 This is an umbrella ticket for improving performance for joining multiple 
 dimension tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-11-14 Thread Anson Abraham (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212852#comment-14212852
 ] 

Anson Abraham commented on SPARK-1867:
--

Yes.  i added 3 data nodes just for this.  And I had 2 of them as my worker 
node and the other as my master.  Still getting that issue.  Also the jar files 
were all supplied by Cloudera.

 Spark Documentation Error causes java.lang.IllegalStateException: unread 
 block data
 ---

 Key: SPARK-1867
 URL: https://issues.apache.org/jira/browse/SPARK-1867
 Project: Spark
  Issue Type: Bug
Reporter: sam

 I've employed two System Administrators on a contract basis (for quite a bit 
 of money), and both contractors have independently hit the following 
 exception.  What we are doing is:
 1. Installing Spark 0.9.1 according to the documentation on the website, 
 along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
 cluster
 I've also included code snippets, and sbt deps at the bottom.
 When I've Googled this, there seems to be two somewhat vague responses:
 a) Mismatching spark versions on nodes/user code
 b) Need to add more jars to the SparkConf
 Now I know that (b) is not the problem having successfully run the same code 
 on other clusters while only including one jar (it's a fat jar).
 But I have no idea how to check for (a) - it appears Spark doesn't have any 
 version checks or anything - it would be nice if it checked versions and 
 threw a mismatching version exception: you have user code using version X 
 and node Y has version Z.
 I would be very grateful for advice on this.
 The exception:
 Exception in thread main org.apache.spark.SparkException: Job aborted: Task 
 0.0:1 failed 32 times (most recent failure: Exception failure: 
 java.lang.IllegalStateException: unread block data)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.IllegalStateException: unread block data [duplicate 59]
 My code snippet:
 val conf = new SparkConf()
.setMaster(clusterMaster)
.setAppName(appName)
.setSparkHome(sparkHome)
.setJars(SparkContext.jarOfClass(this.getClass))
 println(count =  + new SparkContext(conf).textFile(someHdfsPath).count())
 My SBT dependencies:
 // relevant
 org.apache.spark % spark-core_2.10 % 0.9.1,
 org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0,
 // standard, probably unrelated
 com.github.seratch %% awscala % [0.2,),
 org.scalacheck %% scalacheck % 1.10.1 % test,
 org.specs2 %% specs2 % 1.14 % test,
 org.scala-lang % scala-reflect % 2.10.3,
 org.scalaz %% scalaz-core % 7.0.5,
 net.minidev % json-smart % 1.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4380) Executor full of log spilling in-memory map of 0 MB to disk


 [ 
https://issues.apache.org/jira/browse/SPARK-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4380:
-
Assignee: Hong Shen

 Executor full of log spilling in-memory map of 0 MB to disk
 -

 Key: SPARK-4380
 URL: https://issues.apache.org/jira/browse/SPARK-4380
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Hong Shen
Assignee: Hong Shen

 When I set spark.shuffle.manager = sort, Executor full of log, It confuse me 
 a lot:
 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling 
 in-memory batch of 0 MB to disk (277 spills so far)
 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling 
 in-memory batch of 0 MB to disk (278 spills so far)
 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling 
 in-memory batch of 0 MB to disk (279 spills so far)
 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling 
 in-memory batch of 0 MB to disk (280 spills so far)
 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling 
 in-memory batch of 0 MB to disk (281 spills so far)
 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling 
 in-memory batch of 0 MB to disk (282 spills so far)
 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling 
 in-memory batch of 0 MB to disk (283 spills so far)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4380) Executor full of log spilling in-memory map of 0 MB to disk


[ 
https://issues.apache.org/jira/browse/SPARK-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212857#comment-14212857
 ] 

Andrew Or commented on SPARK-4380:
--

In general it's pretty worrying that it's spilling so much. Nevertheless this 
is a good change. We should look into why it's spilling all the time separately.

 Executor full of log spilling in-memory map of 0 MB to disk
 -

 Key: SPARK-4380
 URL: https://issues.apache.org/jira/browse/SPARK-4380
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Hong Shen
Assignee: Hong Shen

 When I set spark.shuffle.manager = sort, Executor full of log, It confuse me 
 a lot:
 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling 
 in-memory batch of 0 MB to disk (277 spills so far)
 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling 
 in-memory batch of 0 MB to disk (278 spills so far)
 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling 
 in-memory batch of 0 MB to disk (279 spills so far)
 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling 
 in-memory batch of 0 MB to disk (280 spills so far)
 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling 
 in-memory batch of 0 MB to disk (281 spills so far)
 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling 
 in-memory batch of 0 MB to disk (282 spills so far)
 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling 
 in-memory batch of 0 MB to disk (283 spills so far)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4313) Thread Dump link is broken in yarn-cluster mode

[
https://issues.apache.org/jira/browse/SPARK-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Or closed SPARK-4313.

Resolution: Fixed
Fix Version/s: 1.2.0
Target Version/s: 1.2.0

Thread Dump link is broken in yarn-cluster mode
-

Key: SPARK-4313
URL: https://issues.apache.org/jira/browse/SPARK-4313
Project: Spark
Issue Type: Bug
Components: Web UI, YARN
Affects Versions: 1.2.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor
Fix For: 1.2.0

In yarn-cluster mode, the Web UI is running behind a yarn proxy server. Some
features(or bugs?) of yarn proxy server will break the links for thread dump.
1. Yarn proxy server will do http redirect internally, so if opening
http://example.com:8088/cluster/app/application_1415344371838_0012/executors;,
it will fetch
http://example.com:8088/cluster/app/application_1415344371838_0012/executors/;
and return the content but won't change the link in the browser. Then when a
user clicks Thread Dump, it will jump to
http://example.com:8088/proxy/application_1415344371838_0012/threadDump/?executorId=2;.
This is a wrong link. The correct link should be
http://example.com:8088/proxy/application_1415344371838_0012/executors/threadDump/?executorId=2;.
2. Yarn proxy server has a bug about the URL encode/decode. When a user
accesses
http://example.com:8088/proxy/application_1415344371838_0006/executors/threadDump/?executorId=%3Cdriver%3E;,
the yarn proxy server will require
http://example.com:36429/executors/threadDump/?executorId=%25253Cdriver%25253E;.
But Spark web server expects
http://example.com:36429/executors/threadDump/?executorId=%3Cdriver%3E;. I
will report this issue to Hadoop community later.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4313) Thread Dump link is broken in yarn-cluster mode

[
https://issues.apache.org/jira/browse/SPARK-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Or updated SPARK-4313:
-
Assignee: Shixiong Zhu

Thread Dump link is broken in yarn-cluster mode
-

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4404) SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-process ends


 [ 
https://issues.apache.org/jira/browse/SPARK-4404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4404:
-
Affects Version/s: 1.1.0

 SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-process 
 ends 
 -

 Key: SPARK-4404
 URL: https://issues.apache.org/jira/browse/SPARK-4404
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: WangTaoTheTonic

 When we have spark.driver.extra* or spark.driver.memory in 
 SPARK_SUBMIT_PROPERTIES_FILE, spark-class will use 
 SparkSubmitDriverBootstrapper to launch driver.
 If we get process id of SparkSubmitDriverBootstrapper and wanna kill it 
 during its running, we expect its SparkSubmit sub-process stop also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag

2014-11-14 Thread Ilya Ganelin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212875#comment-14212875
 ] 

Ilya Ganelin commented on SPARK-3694:
-

There is also task serialization that happens within the DAG Scheduler. Do we 
want to print that?

 Allow printing object graph of tasks/RDD's with a debug flag
 

 Key: SPARK-3694
 URL: https://issues.apache.org/jira/browse/SPARK-3694
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Ilya Ganelin
  Labels: starter

 This would be useful for debugging extra references inside of RDD's
 Here is an example for inspiration:
 http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html
 We'd want to print this trace for both the RDD serialization inside of the 
 DAGScheduler and the task serialization in the TaskSetManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2321) Design a proper progress reporting event listener API

[
https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212884#comment-14212884
]

Apache Spark commented on SPARK-2321:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3197

Design a proper progress reporting event listener API
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4239) support view in HiveQL


 [ 
https://issues.apache.org/jira/browse/SPARK-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4239.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3131
[https://github.com/apache/spark/pull/3131]

 support view in HiveQL
 --

 Key: SPARK-4239
 URL: https://issues.apache.org/jira/browse/SPARK-4239
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Adrian Wang
Assignee: Adrian Wang
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4245) Fix containsNull of the result ArrayType of CreateArray expression.


 [ 
https://issues.apache.org/jira/browse/SPARK-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4245.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3110
[https://github.com/apache/spark/pull/3110]

 Fix containsNull of the result ArrayType of CreateArray expression.
 ---

 Key: SPARK-4245
 URL: https://issues.apache.org/jira/browse/SPARK-4245
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin
 Fix For: 1.2.0


 The {{containsNull}} of the result {{ArrayType}} of {{CreateArray}} should be 
 true only if the children is empty or there exists nullable child.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4333) Correctly log number of iterations in RuleExecutor


 [ 
https://issues.apache.org/jira/browse/SPARK-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4333.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3180
[https://github.com/apache/spark/pull/3180]

 Correctly log number of iterations in RuleExecutor
 --

 Key: SPARK-4333
 URL: https://issues.apache.org/jira/browse/SPARK-4333
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: DoingDone9
Priority: Minor
 Fix For: 1.2.0


 RuleExecutor breaks, num of iteration should be $(iteration -1) not 
 $(iteration) .
 Log looks like Fixed point reached for batch $(batch.name) after 3 
 iterations., but it did 2 iterations really!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4062) Improve KafkaReceiver to prevent data loss


 [ 
https://issues.apache.org/jira/browse/SPARK-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-4062.
--
Resolution: Fixed

 Improve KafkaReceiver to prevent data loss
 --

 Key: SPARK-4062
 URL: https://issues.apache.org/jira/browse/SPARK-4062
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Saisai Shao
Assignee: Saisai Shao
 Attachments: RefactoredKafkaReceiver.pdf


 Current KafkaReceiver has data loss and data re-consuming problem. Here we 
 propose a ReliableKafkaReceiver to improving its reliability and fault 
 tolerance with the power of Spark Streaming's WAL mechanism.
 This is a follow up work of SPARK-3129. Design doc is posted, any comments 
 would be greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3129) Prevent data loss in Spark Streaming on driver failure


 [ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-3129.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

I am marking this as fixed, as all non-test related issues have been merged. 
The one sub-task left is related to unit-tests that uses the WAL to do 
end-to-end tests and verify no data loss.

 Prevent data loss in Spark Streaming on driver failure
 --

 Key: SPARK-3129
 URL: https://issues.apache.org/jira/browse/SPARK-3129
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.0.3
Reporter: Hari Shreedharan
Assignee: Tathagata Das
Priority: Critical
 Fix For: 1.2.0

 Attachments: SecurityFix.diff


 Spark Streaming can small amounts of data when the driver goes down - and the 
 sending system cannot re-send the data (or the data has already expired on 
 the sender side). This currently affects all receivers. 
 The solution we propose is to reliably store all the received data into HDFS. 
 This will allow the data to persist through driver failures, and therefore 
 can be processed when the driver gets restarted. 
 The high level design doc for this feature is given here. 
 https://docs.google.com/document/d/1vTCB5qVfyxQPlHuv8rit9-zjdttlgaSrMgfCDQlCJIM/edit?usp=sharing
 This major task has been divided in sub-tasks
 - Implementing a write ahead log management system that can manage rolling 
 write ahead logs - write to log, recover on failure and clean up old logs
 - Implementing a HDFS backed block RDD that can read data either from Spark's 
 BlockManager or from HDFS files
 - Implementing a ReceivedBlockHandler interface that abstracts out the 
 functionality of saving received blocks
 - Implementing a ReceivedBlockTracker and other associated changes in the 
 driver that allows metadata of received blocks and block-to-batch allocations 
 to be recovered upon driver retart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4246) Add testsuite with end-to-end testing of driver failure


 [ 
https://issues.apache.org/jira/browse/SPARK-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-4246:
-
Target Version/s: 1.2.0

 Add testsuite with end-to-end testing of driver failure 
 

 Key: SPARK-4246
 URL: https://issues.apache.org/jira/browse/SPARK-4246
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3129) Prevent data loss in Spark Streaming on driver failure using Write Ahead Logs


 [ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-3129:
-
Summary: Prevent data loss in Spark Streaming on driver failure using Write 
Ahead Logs  (was: Prevent data loss in Spark Streaming on driver failure)

 Prevent data loss in Spark Streaming on driver failure using Write Ahead Logs
 -

 Key: SPARK-3129
 URL: https://issues.apache.org/jira/browse/SPARK-3129
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.0.3
Reporter: Hari Shreedharan
Assignee: Tathagata Das
Priority: Critical
 Fix For: 1.2.0

 Attachments: SecurityFix.diff


 Spark Streaming can small amounts of data when the driver goes down - and the 
 sending system cannot re-send the data (or the data has already expired on 
 the sender side). This currently affects all receivers. 
 The solution we propose is to reliably store all the received data into HDFS. 
 This will allow the data to persist through driver failures, and therefore 
 can be processed when the driver gets restarted. 
 The high level design doc for this feature is given here. 
 https://docs.google.com/document/d/1vTCB5qVfyxQPlHuv8rit9-zjdttlgaSrMgfCDQlCJIM/edit?usp=sharing
 This major task has been divided in sub-tasks
 - Implementing a write ahead log management system that can manage rolling 
 write ahead logs - write to log, recover on failure and clean up old logs
 - Implementing a HDFS backed block RDD that can read data either from Spark's 
 BlockManager or from HDFS files
 - Implementing a ReceivedBlockHandler interface that abstracts out the 
 functionality of saving received blocks
 - Implementing a ReceivedBlockTracker and other associated changes in the 
 driver that allows metadata of received blocks and block-to-batch allocations 
 to be recovered upon driver retart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4391) Parquet Filter pushdown flag should be set with SQLConf


 [ 
https://issues.apache.org/jira/browse/SPARK-4391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4391.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3258
[https://github.com/apache/spark/pull/3258]

 Parquet Filter pushdown flag should be set with SQLConf
 ---

 Key: SPARK-4391
 URL: https://issues.apache.org/jira/browse/SPARK-4391
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4322) Struct fields can't be used as sub-expression of grouping fields


 [ 
https://issues.apache.org/jira/browse/SPARK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4322.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3248
[https://github.com/apache/spark/pull/3248]

 Struct fields can't be used as sub-expression of grouping fields
 

 Key: SPARK-4322
 URL: https://issues.apache.org/jira/browse/SPARK-4322
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.2.0


 Some examples:
 {code}
 sqlContext.jsonRDD(sc.parallelize({a: {b: [{c: 1}]}} :: 
 Nil)).registerTempTable(data1)
 sqlContext.sql(SELECT a.b[0].c FROM data1 GROUP BY a.b[0].c).collect()
 sqlContext.jsonRDD(sc.parallelize({a: {b: 1}} :: 
 Nil)).registerTempTable(data2)
 sqlContext.sql(SELECT a.b + 1 FROM data2 GROUP BY a.b + 1).collect()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4386) Parquet file write performance improvement


 [ 
https://issues.apache.org/jira/browse/SPARK-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4386.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3254
[https://github.com/apache/spark/pull/3254]

 Parquet file write performance improvement
 --

 Key: SPARK-4386
 URL: https://issues.apache.org/jira/browse/SPARK-4386
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jim Carroll
 Fix For: 1.2.0


 If you profile the writing of a Parquet file, the single worst time consuming 
 call inside of org.apache.spark.sql.parquet.MutableRowWriteSupport.write is 
 actually in the scala.collection.AbstractSequence.size call. This is because 
 the size call actually ends up COUNTING the elements in a 
 scala.collection.LinearSeqOptimized (optimized?).
 This doesn't need to be done. size is called repeatedly where needed rather 
 than called once at the top of the method and stored in a 'val'. I have a PR 
 for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4365) Remove unnecessary filter call on records returned from parquet library


 [ 
https://issues.apache.org/jira/browse/SPARK-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4365.
-
Resolution: Fixed

Issue resolved by pull request 3229
[https://github.com/apache/spark/pull/3229]

 Remove unnecessary filter call on records returned from parquet library
 ---

 Key: SPARK-4365
 URL: https://issues.apache.org/jira/browse/SPARK-4365
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yash Datta
Priority: Minor
 Fix For: 1.2.0


 Since parquet library has been updated , we no longer need to filter the 
 records returned from parquet library for null records , as now the library 
 skips those :
 from 
 parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java
   public boolean nextKeyValue() throws IOException, InterruptedException {
 boolean recordFound = false;
 while (!recordFound) {
   // no more records left
   if (current = total) { return false; }
   try {
 checkRead();
 currentValue = recordReader.read();
 current ++; 
 if (recordReader.shouldSkipCurrentRecord()) {
   // this record is being filtered via the filter2 package
   if (DEBUG) LOG.debug(skipping record);
   continue;
 }   
 if (currentValue == null) {
   // only happens with FilteredRecordReader at end of block
   current = totalCountLoadedSoFar;
   if (DEBUG) LOG.debug(filtered record reader reached end of block);
   continue;
 }   
   recordFound = true;
 if (DEBUG) LOG.debug(read value:  + currentValue);
   } catch (RuntimeException e) {
 throw new ParquetDecodingException(format(Can not read value at %d 
 in block %d in file %s, current, currentBlock, file), e); 
   }   
 }   
 return true;
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4412) Parquet logger cannot be configured


 [ 
https://issues.apache.org/jira/browse/SPARK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4412.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3271
[https://github.com/apache/spark/pull/3271]

 Parquet logger cannot be configured
 ---

 Key: SPARK-4412
 URL: https://issues.apache.org/jira/browse/SPARK-4412
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jim Carroll
 Fix For: 1.2.0


 The Spark ParquetRelation.scala code makes the assumption that the 
 parquet.Log class has already been loaded. If 
 ParquetRelation.enableLogForwarding executes prior to the parquet.Log class 
 being loaded then the code in enableLogForwarding has no affect.
 ParquetRelation.scala attempts to override the parquet logger but, at least 
 currently (and if your application simply reads a parquet file before it does 
 anything else with Parquet), the parquet.Log class hasn't been loaded yet. 
 Therefore the code in ParquetRelation.enableLogForwarding has no affect. If 
 you look at the code in parquet.Log there's a static initializer that needs 
 to be called prior to enableLogForwarding or whatever enableLogForwarding 
 does gets undone by this static initializer.
 The fix would be to force the static initializer to get called in 
 parquet.Log as part of enableForwardLogging. 
 PR will be forthcomming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4415) Driver did not exit after python driver had exited.


[ 
https://issues.apache.org/jira/browse/SPARK-4415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213076#comment-14213076
 ] 

Apache Spark commented on SPARK-4415:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3274

 Driver did not exit after python driver had exited.
 ---

 Key: SPARK-4415
 URL: https://issues.apache.org/jira/browse/SPARK-4415
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: Davies Liu
Priority: Critical

 If we have spark_driver_memory in spark-default.cfg, and start the spark job 
 by 
 {code}
 $ python xxx.py
 {code}
 Then the spark driver will not exit after the python process had exited.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4214) With dynamic allocation, avoid outstanding requests for more executors than pending tasks need


 [ 
https://issues.apache.org/jira/browse/SPARK-4214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4214.

   Resolution: Fixed
Fix Version/s: 1.2.0

 With dynamic allocation, avoid outstanding requests for more executors than 
 pending tasks need
 --

 Key: SPARK-4214
 URL: https://issues.apache.org/jira/browse/SPARK-4214
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 1.2.0


 Dynamic allocation tries to allocate more executors while we have pending 
 tasks remaining.  Our current policy can end up with more outstanding 
 executor requests than needed to fulfill all the pending tasks.  Capping the 
 executor requests to the number of cores needed to fulfill all pending tasks 
 would make dynamic allocation behavior less sensitive to settings for 
 maxExecutors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-4345) Spark SQL Hive throws exception when drop a none-exist table

2014-11-14 Thread Alex Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Liu updated SPARK-4345:

Comment: was deleted

(was: Swallow NoSuchObjectException exception when drop a none-exist hive 
table. pull @https://github.com/apache/spark/pull/3211)

 Spark SQL Hive throws exception when drop a none-exist table
 

 Key: SPARK-4345
 URL: https://issues.apache.org/jira/browse/SPARK-4345
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Alex Liu
Priority: Minor

 When drop a none-exist hive table, an exception is thrown.
 log
 {code}
 scala val t = hql(drop table if exists test_table);
 warning: there were 1 deprecation warning(s); re-run with -deprecation for 
 details
 t: org.apache.spark.sql.SchemaRDD = 
 SchemaRDD[13] at RDD at SchemaRDD.scala:103
 == Query Plan ==
 == Physical Plan ==
 DropTable test_table, true
 scala val t = hql(drop table if exists test_table);
 warning: there were 1 deprecation warning(s); re-run with -deprecation for 
 details
 14/11/11 10:21:49 ERROR Hive: 
 NoSuchObjectException(message:default.test_table table not found)
   at 
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1373)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103)
   at com.sun.proxy.$Proxy14.get_table(Unknown Source)
   at 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:854)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
   at com.sun.proxy.$Proxy15.getTable(Unknown Source)
   at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
   at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:892)
   at org.apache.hadoop.hive.ql.exec.DDLTask.dropTable(DDLTask.java:3276)
   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:277)
   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
   at 
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:298)
   at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272)
   at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:65)
   at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:63)
   at 
 org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:71)
   at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
   at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
   at 
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
   at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103)
   at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:106)
   at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:110)
   at 
 $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:69)
   at 
 $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:74)
   at 
 $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:76)
   at

[jira] [Commented] (SPARK-4345) Spark SQL Hive throws exception when drop a none-exist table

2014-11-14 Thread Alex Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213121#comment-14213121
 ] 

Alex Liu commented on SPARK-4345:
-

It looks like a bug in Hive https://issues.apache.org/jira/browse/HIVE-8564

 Spark SQL Hive throws exception when drop a none-exist table
 

 Key: SPARK-4345
 URL: https://issues.apache.org/jira/browse/SPARK-4345
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Alex Liu
Priority: Minor

 When drop a none-exist hive table, an exception is thrown.
 log
 {code}
 scala val t = hql(drop table if exists test_table);
 warning: there were 1 deprecation warning(s); re-run with -deprecation for 
 details
 t: org.apache.spark.sql.SchemaRDD = 
 SchemaRDD[13] at RDD at SchemaRDD.scala:103
 == Query Plan ==
 == Physical Plan ==
 DropTable test_table, true
 scala val t = hql(drop table if exists test_table);
 warning: there were 1 deprecation warning(s); re-run with -deprecation for 
 details
 14/11/11 10:21:49 ERROR Hive: 
 NoSuchObjectException(message:default.test_table table not found)
   at 
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1373)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103)
   at com.sun.proxy.$Proxy14.get_table(Unknown Source)
   at 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:854)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
   at com.sun.proxy.$Proxy15.getTable(Unknown Source)
   at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
   at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:892)
   at org.apache.hadoop.hive.ql.exec.DDLTask.dropTable(DDLTask.java:3276)
   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:277)
   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
   at 
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:298)
   at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272)
   at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:65)
   at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:63)
   at 
 org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:71)
   at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
   at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
   at 
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
   at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103)
   at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:106)
   at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:110)
   at 
 $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:69)
   at 
 $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:74)
   at 
 $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:76)
   at

[jira] [Resolved] (SPARK-4345) Spark SQL Hive throws exception when drop a none-exist table

2014-11-14 Thread Alex Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Liu resolved SPARK-4345.
-
Resolution: Won't Fix

 Spark SQL Hive throws exception when drop a none-exist table
 

 Key: SPARK-4345
 URL: https://issues.apache.org/jira/browse/SPARK-4345
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Alex Liu
Priority: Minor

 When drop a none-exist hive table, an exception is thrown.
 log
 {code}
 scala val t = hql(drop table if exists test_table);
 warning: there were 1 deprecation warning(s); re-run with -deprecation for 
 details
 t: org.apache.spark.sql.SchemaRDD = 
 SchemaRDD[13] at RDD at SchemaRDD.scala:103
 == Query Plan ==
 == Physical Plan ==
 DropTable test_table, true
 scala val t = hql(drop table if exists test_table);
 warning: there were 1 deprecation warning(s); re-run with -deprecation for 
 details
 14/11/11 10:21:49 ERROR Hive: 
 NoSuchObjectException(message:default.test_table table not found)
   at 
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1373)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103)
   at com.sun.proxy.$Proxy14.get_table(Unknown Source)
   at 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:854)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
   at com.sun.proxy.$Proxy15.getTable(Unknown Source)
   at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
   at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:892)
   at org.apache.hadoop.hive.ql.exec.DDLTask.dropTable(DDLTask.java:3276)
   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:277)
   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
   at 
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:298)
   at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272)
   at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:65)
   at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:63)
   at 
 org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:71)
   at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
   at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
   at 
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
   at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103)
   at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:106)
   at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:110)
   at 
 $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:69)
   at 
 $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:74)
   at 
 $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:76)
   at 
 $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:78)
   at

[jira] [Commented] (SPARK-4349) Spark driver hangs on sc.parallelize() if exception is thrown during serialization


[ 
https://issues.apache.org/jira/browse/SPARK-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213128#comment-14213128
 ] 

Apache Spark commented on SPARK-4349:
-

User 'mccheah' has created a pull request for this issue:
https://github.com/apache/spark/pull/3275

 Spark driver hangs on sc.parallelize() if exception is thrown during 
 serialization
 --

 Key: SPARK-4349
 URL: https://issues.apache.org/jira/browse/SPARK-4349
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Matt Cheah
 Fix For: 1.3.0


 Executing the following in the Spark Shell will lead to the Spark Shell 
 hanging after a stack trace is printed. The serializer is set to the Kryo 
 serializer.
 {code}
 scala import com.esotericsoftware.kryo.io.Input
 import com.esotericsoftware.kryo.io.Input
 scala import com.esotericsoftware.kryo.io.Output
 import com.esotericsoftware.kryo.io.Output
 scala class MyKryoSerializable extends 
 com.esotericsoftware.kryo.KryoSerializable { def write (kryo: 
 com.esotericsoftware.kryo.Kryo, output: Output) { throw new 
 com.esotericsoftware.kryo.KryoException; } ; def read (kryo: 
 com.esotericsoftware.kryo.Kryo, input: Input) { throw new 
 com.esotericsoftware.kryo.KryoException; } }
 defined class MyKryoSerializable
 scala sc.parallelize(Seq(new MyKryoSerializable, new 
 MyKryoSerializable)).collect
 {code}
 A stack trace is printed during serialization as expected, but another stack 
 trace is printed afterwards, indicating that the driver can't recover:
 {code}
 14/11/11 14:10:03 ERROR OneForOneStrategy: actor name [ExecutorActor] is not 
 unique!
 akka.actor.PostRestartException: exception post restart (class 
 java.io.IOException)
   at 
 akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:249)
   at 
 akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:247)
   at 
 akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:302)
   at 
 akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:297)
   at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
   at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
   at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
   at 
 akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:247)
   at 
 akka.actor.dungeon.FaultHandling$class.faultRecreate(FaultHandling.scala:76)
   at akka.actor.ActorCell.faultRecreate(ActorCell.scala:369)
   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:459)
   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Caused by: akka.actor.InvalidActorNameException: actor name [ExecutorActor] 
 is not unique!
   at 
 akka.actor.dungeon.ChildrenContainer$NormalChildrenContainer.reserve(ChildrenContainer.scala:130)
   at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77)
   at akka.actor.ActorCell.reserveChild(ActorCell.scala:369)
   at akka.actor.dungeon.Children$class.makeChild(Children.scala:202)
   at akka.actor.dungeon.Children$class.attachChild(Children.scala:42)
   at akka.actor.ActorCell.attachChild(ActorCell.scala:369)
   at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:552)
   at org.apache.spark.executor.Executor.init(Executor.scala:97)
   at 
 org.apache.spark.scheduler.local.LocalActor.init(LocalBackend.scala:53)
   at 
 org.apache.spark.scheduler.local.LocalBackend$$anonfun$start$1.apply(LocalBackend.scala:96)
   at 
 org.apache.spark.scheduler.local.LocalBackend$$anonfun$start$1.apply(LocalBackend.scala:96)
   at akka.actor.TypedCreatorFunctionConsumer.produce(Props.scala:343)
   at akka.actor.Props.newActor(Props.scala:252)
   at akka.actor.ActorCell.newActor(ActorCell.scala:552)
   at 
 akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:234)
   ... 11 more
 {code}



--
This message was sent by Atlassian JIRA

[jira] [Created] (SPARK-4417) New API: sample RDD to fixed number of items

2014-11-14 Thread Davies Liu (JIRA)

Davies Liu created SPARK-4417:
-

 Summary: New API: sample RDD to fixed number of items
 Key: SPARK-4417
 URL: https://issues.apache.org/jira/browse/SPARK-4417
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core
Reporter: Davies Liu


Sometimes, we just want to a fixed number of items randomly selected from an 
RDD, for example, before sort an RDD we need to gather a fixed number of keys 
from each partitions.

In order to do this, we need to two pass on the RDD, get the total number, then 
calculate the right ratio for sampling. In fact, we could do this in one pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3445) Deprecate and later remove YARN alpha support

2014-11-14 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213300#comment-14213300
 ] 

Guoqiang Li commented on SPARK-3445:


I think there are a lot of people are using hadoop 1.x. We should support at 
least until 2015.

 Deprecate and later remove YARN alpha support
 -

 Key: SPARK-3445
 URL: https://issues.apache.org/jira/browse/SPARK-3445
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Patrick Wendell

 This will depend a bit on both user demand and the commitment level of 
 maintainers, but I'd like to propose the following timeline for yarn-alpha 
 support.
 Spark 1.2: Deprecate YARN-alpha
 Spark 1.3: Remove YARN-alpha (i.e. require YARN-stable)
 Since YARN-alpha is clearly identified as an alpha API, it seems reasonable 
 to drop support for it in a minor release. However, it does depend a bit 
 whether anyone uses this outside of Yahoo!, and that I'm not sure of. In the 
 past this API has been used and maintained by Yahoo, but they'll be migrating 
 soon to the stable API's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4404) SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-process ends


 [ 
https://issues.apache.org/jira/browse/SPARK-4404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4404.

  Resolution: Fixed
   Fix Version/s: 1.2.0
Assignee: WangTaoTheTonic
Target Version/s: 1.2.0

 SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-process 
 ends 
 -

 Key: SPARK-4404
 URL: https://issues.apache.org/jira/browse/SPARK-4404
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: WangTaoTheTonic
Assignee: WangTaoTheTonic
 Fix For: 1.2.0


 When we have spark.driver.extra* or spark.driver.memory in 
 SPARK_SUBMIT_PROPERTIES_FILE, spark-class will use 
 SparkSubmitDriverBootstrapper to launch driver.
 If we get process id of SparkSubmitDriverBootstrapper and wanna kill it 
 during its running, we expect its SparkSubmit sub-process stop also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4415) Driver did not exit after python driver had exited.