[jira] [Commented] (SPARK-2321) Design a proper progress reporting event listener API
[ https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212000#comment-14212000 ] Rui Li commented on SPARK-2321: --- Hi [~joshrosen], The new API is quite useful. But the information exposed is relatively limited at the moment. Do you have any plan to enhance it? For example, submission and completion time is not available in {{SparkStageInfo}}, while they're provided in {{StageInfo}}. Thanks! Design a proper progress reporting event listener API --- Key: SPARK-2321 URL: https://issues.apache.org/jira/browse/SPARK-2321 Project: Spark Issue Type: Improvement Components: Java API, Spark Core Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Josh Rosen Priority: Critical Fix For: 1.2.0 This is a ticket to track progress on redesigning the SparkListener and JobProgressListener API. There are multiple problems with the current design, including: 0. I'm not sure if the API is usable in Java (there are at least some enums we used in Scala and a bunch of case classes that might complicate things). 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of attention to it yet. Something as important as progress reporting deserves a more stable API. 2. There is no easy way to connect jobs with stages. Similarly, there is no easy way to connect job groups with jobs / stages. 3. JobProgressListener itself has no encapsulation at all. States can be arbitrarily mutated by external programs. Variable names are sort of randomly decided and inconsistent. We should just revisit these and propose a new, concrete design. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4398) Specialize rdd.parallelize for xrange
Xiangrui Meng created SPARK-4398: Summary: Specialize rdd.parallelize for xrange Key: SPARK-4398 URL: https://issues.apache.org/jira/browse/SPARK-4398 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Xiangrui Meng Assignee: Xiangrui Meng sc.parallelize(range) is slow, which writes to disk. We should specialize xrange for performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4398) Specialize rdd.parallelize for xrange
[ https://issues.apache.org/jira/browse/SPARK-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212014#comment-14212014 ] Apache Spark commented on SPARK-4398: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/3264 Specialize rdd.parallelize for xrange - Key: SPARK-4398 URL: https://issues.apache.org/jira/browse/SPARK-4398 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Xiangrui Meng Assignee: Xiangrui Meng sc.parallelize(range) is slow, which writes to disk. We should specialize xrange for performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-689) Task will crash when setting SPARK_WORKER_CORES 128
[ https://issues.apache.org/jira/browse/SPARK-689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207721#comment-14207721 ] Andrew Ash edited comment on SPARK-689 at 11/14/14 8:46 AM: I attempted a repro on a one-node cluster (my laptop) and confirmed that this bug no longer exists on master. A code inspection reveals that there is no thread limit of 128 limit anymore on the Executor's threadpool from this stacktrace line: {{at spark.executor.Executor.launchTask(Executor.scala:59)}} Here's the outline of my repro attempt: {noformat} aash@aash-mbp ~/git/spark$ cat conf/spark-env.sh SPARK_WORKER_CORES=200 SPARK_MASTER_IP=aash-mbp.local SPARK_PUBLIC_DNS=aash-mbp.local aash@aash-mbp ~/git/spark$ cat conf/spark-defaults.sh spark.master spark://aash-mbp.local:7077 aash@aash-mbp ~/git/spark$ sbin/start-all.sh ... aash@aash-mbp ~/git/spark$ bin/spark-shell spark sc.parallelize(1l to 1l,200).reduce(_+_) res0: Long = 50005000 spark {noformat} I'm now closing this ticket, but please reopen [~xiajunluan] if you're still having issues. was (Author: aash): I attempted a repro on a one-node cluster (my laptop) and confirmed that this bug no longer exists. A code inspection reveals that there is no thread limit of 128 limit anymore on the Executor's threadpool from this stacktrace line: {{at spark.executor.Executor.launchTask(Executor.scala:59)}} Here's the outline of my repro attempt: {noformat} aash@aash-mbp ~/git/spark$ cat conf/spark-env.sh SPARK_WORKER_CORES=200 SPARK_MASTER_IP=aash-mbp.local SPARK_PUBLIC_DNS=aash-mbp.local aash@aash-mbp ~/git/spark$ cat conf/spark-defaults.sh spark.master spark://aash-mbp.local:7077 aash@aash-mbp ~/git/spark$ sbin/start-all.sh ... aash@aash-mbp ~/git/spark$ bin/spark-shell spark sc.parallelize(1l to 1l,200).reduce(_+_) res0: Long = 50005000 spark {noformat} I'm now closing this ticket, but please reopen [~xiajunluan] if you're still having issues. Task will crash when setting SPARK_WORKER_CORES 128 Key: SPARK-689 URL: https://issues.apache.org/jira/browse/SPARK-689 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.6.1 Reporter: xiajunluan when I set SPARK_WORKER_CORES 128(for example 200), and run a job in standalone mode that will allocate 200 tasks in one worker node, then task will crash(it seems that worker cores has been hard-code) {noformat} 13/02/07 11:25:02 ERROR StandaloneExecutorBackend: Task spark.executor.Executor$TaskRunner@5367839e rejected from java.util.concurrent.ThreadPoolExecutor@30f224d9[Running, pool size = 128, active threads = 128, queued tasks = 0, completed tasks = 0] java.util.concurrent.RejectedExecutionException: Task spark.executor.Executor$TaskRunner@5367839e rejected from java.util.concurrent.ThreadPoolExecutor@30f224d9[Running, pool size = 128, active threads = 128, queued tasks = 0, completed tasks = 0] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2013) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:816) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1337) at spark.executor.Executor.launchTask(Executor.scala:59) at spark.executor.StandaloneExecutorBackend$$anonfun$receive$1.apply(StandaloneExecutorBackend.scala:57) at spark.executor.StandaloneExecutorBackend$$anonfun$receive$1.apply(StandaloneExecutorBackend.scala:46) at akka.actor.Actor$class.apply(Actor.scala:318) at spark.executor.StandaloneExecutorBackend.apply(StandaloneExecutorBackend.scala:17) at akka.actor.ActorCell.invoke(ActorCell.scala:626) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197) at akka.dispatch.Mailbox.run(Mailbox.scala:179) at akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516) at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259) at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975) at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) 13/02/07 11:25:02 INFO StandaloneExecutorBackend: Connecting to master: akka://spark@10.0.2.19:60882/user/StandaloneScheduler 13/02/07 11:25:02 INFO StandaloneExecutorBackend: Got assigned task 1929 13/02/07 11:25:02 INFO Executor: launch taskId: 1929 13/02/07 11:25:02 ERROR StandaloneExecutorBackend: java.lang.NullPointerException at spark.executor.Executor.launchTask(Executor.scala:59) at
[jira] [Commented] (SPARK-4395) Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour
[ https://issues.apache.org/jira/browse/SPARK-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212019#comment-14212019 ] Davies Liu commented on SPARK-4395: --- [~marmbrus] After removing the cache(), this script finished quickly, so I think there is something wrong the caching. It also hanged when I move the .cache() before parsing. While hanging, most of the CPU is spent in JVM, the python process (only one) is idle. Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour -- Key: SPARK-4395 URL: https://issues.apache.org/jira/browse/SPARK-4395 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: version 1.2.0-SNAPSHOT Reporter: Sameer Farooqui When I run this command it hangs for one to many hours and then finally returns with successful results: sqlContext.sql(SELECT * FROM RatingsTable limit 5).collect() Note, the lab environment below is still active, so let me know if you'd like to just access it directly. +++ My Environment +++ - 1-node cluster in Amazon - RedHat 6.5 64-bit - java version 1.7.0_67 - SBT version: sbt-0.13.5 - Scala version: scala-2.11.2 Ran: sudo yum -y update git clone https://github.com/apache/spark sudo sbt assembly +++ Data file used +++ http://blueplastic.com/databricks/movielens/ratings.dat +++ Code ran +++ import re import string from pyspark.sql import SQLContext, Row sqlContext = SQLContext(sc) RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)' def parse_ratings_line(line): ... match = re.search(RATINGS_PATTERN, line) ... if match is None: ... # Optionally, you can change this to just ignore if each line of data is not critical. ... raise Error(Invalid logline: %s % logline) ... return Row( ... UserID= int(match.group(1)), ... MovieID = int(match.group(2)), ... Rating= int(match.group(3)), ... Timestamp = int(match.group(4))) ... ratings_base_RDD = (sc.textFile(file:///home/ec2-user/movielens/ratings.dat) ...# Call the parse_apace_log_line function on each line. ....map(parse_ratings_line) ...# Caches the objects in memory since they will be queried multiple times. ....cache()) ratings_base_RDD.count() 1000209 ratings_base_RDD.first() Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1) schemaRatings = sqlContext.inferSchema(ratings_base_RDD) schemaRatings.registerTempTable(RatingsTable) sqlContext.sql(SELECT * FROM RatingsTable limit 5).collect() (Now the Python shell hangs...) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-755) Kryo serialization failing - MLbase
[ https://issues.apache.org/jira/browse/SPARK-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212033#comment-14212033 ] Andrew Ash commented on SPARK-755: -- Quick check [~sparks], this hasn't been updated in quite some time; is it still a problem for you? I think you may have been bit by SPARK-2878 which also deals with Kryo serialization failing and has since been fixed. Kryo serialization failing - MLbase --- Key: SPARK-755 URL: https://issues.apache.org/jira/browse/SPARK-755 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Affects Versions: 0.8.0 Reporter: Evan Sparks When I turn on Kryo serialization, I get the following error as I increase the size of my input dataset. (From ~10GB to ~100GB). This issue does not manifest itself when I turn kryo off. I have code that successfully reads files, parses them into an {noformat}RDD[(String,Vector)]{noformat}, which can then be .count()'ed. I then run a .flatMap on these, with a function that has the following signature: {code} def expandData(x: (String, Vector)): Seq[(String, Float, Vector)] {code} And running a .count() on that RDD crashes - stack trace of failed task looks like this: {noformat} 13/05/31 00:16:53 INFO cluster.TaskSetManager: Finished TID 2024 in 23594 ms (progress: 10/1000) 13/05/31 00:16:53 INFO scheduler.DAGScheduler: Completed ResultTask(3, 24) 13/05/31 00:16:53 INFO cluster.ClusterScheduler: parentName:,name:TaskSet_3,runningTasks:151 13/05/31 00:16:53 INFO cluster.TaskSetManager: Starting task 3.0:175 as TID 2161 on slave 14: ip-10-62-199-77.ec2.internal:40850 (NODE_LOCAL) 13/05/31 00:16:53 INFO cluster.TaskSetManager: Serialized task 3.0:175 as 2832 bytes in 0 ms 13/05/31 00:16:53 INFO cluster.TaskSetManager: Lost TID 2053 (task 3.0:49) 13/05/31 00:16:53 INFO cluster.TaskSetManager: Loss was due to com.esotericsoftware.kryo.KryoException com.esotericsoftware.kryo.KryoException: java.lang.ArrayIndexOutOfBoundsException Serialization trace: elements (org.mlbase.Vector) _3 (scala.Tuple3) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:585) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:504) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:571) at spark.KryoSerializationStream.writeObject(KryoSerializer.scala:26) at spark.serializer.SerializationStream$class.writeAll(Serializer.scala:63) at spark.KryoSerializationStream.writeAll(KryoSerializer.scala:21) at spark.storage.BlockManager.dataSerialize(BlockManager.scala:910) at spark.storage.MemoryStore.putValues(MemoryStore.scala:61) at spark.storage.BlockManager.liftedTree1$1(BlockManager.scala:584) at spark.storage.BlockManager.put(BlockManager.scala:580) at spark.CacheManager.getOrCompute(CacheManager.scala:55) at spark.RDD.iterator(RDD.scala:207) at spark.scheduler.ResultTask.run(ResultTask.scala:84) at spark.executor.Executor$TaskRunner.run(Executor.scala:104) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:679) 13/05/31 00:16:53 INFO cluster.ClusterScheduler: parentName:,name:TaskSet_3,runningTasks:151 13/05/31 00:16:53 INFO cluster.TaskSetManager: Starting task 3.0:49 as TID 2162 on slave 12: ip-10-11-46-255.ec2.internal:38878 (NODE_LOCAL) 13/05/31 00:16:53 INFO cluster.TaskSetManager: Serialized task 3.0:49 as 2832 bytes in 0 ms 13/05/31 00:16:53 INFO cluster.ClusterScheduler: parentName:,name:TaskSet_3,runningTasks:152 13/05/31 00:16:54 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Added rdd_7_257 in mem {noformat} My Kryo Registrator looks like this: {code} class MyRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo) { kryo.register(classOf[Vector]) kryo.register(classOf[String]) kryo.register(classOf[Float]) kryo.register(classOf[Tuple3[String,Float,Vector]]) kryo.register(classOf[Seq[Tuple3[String,Float,Vector]]]) kryo.register(classOf[Map[String,Vector]]) } } {code} Vector in this case is an org.mlbase.Vector, which in this case is a slightly modified version of spark.util.Vector (uses floats instead of Doubles). -- This message
[jira] [Commented] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark
[ https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212043#comment-14212043 ] Apache Spark commented on SPARK-2352: - User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/3222 [MLLIB] Add Artificial Neural Network (ANN) to Spark Key: SPARK-2352 URL: https://issues.apache.org/jira/browse/SPARK-2352 Project: Spark Issue Type: New Feature Components: MLlib Environment: MLLIB code Reporter: Bert Greevenbosch Assignee: Bert Greevenbosch It would be good if the Machine Learning Library contained Artificial Neural Networks (ANNs). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-664) Accumulator updates should get locally merged before sent to the driver
[ https://issues.apache.org/jira/browse/SPARK-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212046#comment-14212046 ] Andrew Ash commented on SPARK-664: -- [~irashid] it sounds like your proposal is to batch accumulator updates between tasks on the executor before sending them back to the driver? I agree this would reduce the amount of network traffic, but the batching would come at a cost of higher latency between task completion and accumulator update landing in the accumulator in the driver. With the completion of SPARK-2380 these accumulators are now shown in the UI, so increasing latency would have an effect on end users. If network bandwidth and UI update latency are fundamentally at odds, maybe this is a case for a user option to choose to optimize for network or UI, something like {{spark.accumulators.mergeUpdatesOnExecutor}} defaulted to false. cc [~pwendell] for thoughts Accumulator updates should get locally merged before sent to the driver --- Key: SPARK-664 URL: https://issues.apache.org/jira/browse/SPARK-664 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid Priority: Minor Whenever a task finishes, the accumulator updates from that task are immediately sent back to the driver. When the accumulator updates are big, this is inefficient because (a) a lot more data has to be sent to the driver and (b) the driver has to do all the work of merging the updates together. Probably doesn't matter for small accumulators / low number of tasks, but if both are big, this could be a big bottleneck. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-748) Add documentation page describing interoperability with other software (e.g. HBase, JDBC, Kafka, etc.)
[ https://issues.apache.org/jira/browse/SPARK-748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212049#comment-14212049 ] Andrew Ash commented on SPARK-748: -- I agree this would be valuable -- almost like a Spark Cookbook of how to read and write data from various other systems. Step one is probably deciding what software to mention. Tentatively I propose: Spark Core - HDFS - HBase - Cassandra - Elasticsearch - JDBC, with examples for Postgres and MySQL - General Hadoop InputFormat Spark Streaming - Kafka - Flume - Storm For destination, this could go on the documentation included in the git repo and published to the Spark website, or on the Spark project wiki. I tend to prefer the former. A possible location for that could be http://spark.apache.org/docs/latest/programming-guide.html#external-datasets Add documentation page describing interoperability with other software (e.g. HBase, JDBC, Kafka, etc.) -- Key: SPARK-748 URL: https://issues.apache.org/jira/browse/SPARK-748 Project: Spark Issue Type: New Feature Components: Documentation Reporter: Josh Rosen Spark seems to be gaining a lot of data input / output features for integrating with systems like HBase, Kafka, JDBC, Hadoop, etc. It might be a good idea to create a single documentation page that provides a list of all of the data sources that Spark supports and links to the relevant documentation / examples / {{spark-users}} threads. This would help prospective users to evaluate how easy it will be to integrate Spark with their existing systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-625) Client hangs when connecting to standalone cluster using wrong address
[ https://issues.apache.org/jira/browse/SPARK-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212062#comment-14212062 ] Andrew Ash commented on SPARK-625: -- Spark is very sensitive to hostnames in Spark URLs, and that comes from Akka being very sensitive. I've personally been bitten by hostnames vs FQDNs vs external IP address vs loopback IP address, and it's really a pain. On current master branch (1.2) with the Spark standalone master listening on {{spark://aash-mbp.local:7077}} as confirmed by the master web UI, and the spark shell attempting to connect to {{spark://127.0.01:7077}} with the {{--master}} parameter, the driver tries 3 attempts and then fails with this message: {noformat} 14/11/14 01:37:56 INFO AppClient$ClientActor: Connecting to master spark://127.0.0.1:7077... 14/11/14 01:37:56 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@127.0.0.1:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@127.0.0.1:7077 14/11/14 01:37:56 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster@127.0.0.1:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /127.0.0.1:7077 14/11/14 01:38:16 INFO AppClient$ClientActor: Connecting to master spark://127.0.0.1:7077... 14/11/14 01:38:16 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster@127.0.0.1:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /127.0.0.1:7077 14/11/14 01:38:16 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@127.0.0.1:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@127.0.0.1:7077 14/11/14 01:38:36 INFO AppClient$ClientActor: Connecting to master spark://127.0.0.1:7077... 14/11/14 01:38:36 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster@127.0.0.1:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /127.0.0.1:7077 14/11/14 01:38:36 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@127.0.0.1:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@127.0.0.1:7077 14/11/14 01:38:56 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up. 14/11/14 01:38:56 WARN SparkDeploySchedulerBackend: Application ID is not initialized yet. 14/11/14 01:38:56 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up. {noformat} So the hang seems to be gone and replaced with a reasonable 3x attempts and fail. [~joshrosen], short of changing Akka ourselves to make it less strict on exact URL matches, is there anything else we can do for this ticket? I think we can reasonably close as fixed. Client hangs when connecting to standalone cluster using wrong address -- Key: SPARK-625 URL: https://issues.apache.org/jira/browse/SPARK-625 Project: Spark Issue Type: Bug Affects Versions: 0.7.0, 0.7.1, 0.8.0 Reporter: Josh Rosen Priority: Minor I launched a standalone cluster on my laptop, connecting the workers to the master using my machine's public IP address (128.32.*.*:7077). If I try to connect spark-shell to the master using spark://0.0.0.0:7077, it successfully brings up a Scala prompt but hangs when I try to run a job. From the standalone master's log, it looks like the client's messages are being dropped without the client discovering that the connection has failed: {code} 12/11/27 14:00:52 ERROR NettyRemoteTransport(null): dropping message RegisterJob(JobDescription(Spark shell)) for non-local recipient akka://spark@0.0.0.0:7077/user/Master at akka://spark@128.32.*.*:7077 local is akka://spark@128.32.*.*:7077 12/11/27 14:00:52 ERROR NettyRemoteTransport(null): dropping message DaemonMsgWatch(Actor[akka://spark@128.32.*.*:57518/user/$a],Actor[akka://spark@0.0.0.0:7077/user/Master]) for non-local recipient akka://spark@0.0.0.0:7077/remote at akka://spark@128.32.*.*:7077 local is akka://spark@128.32.*.*:7077 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212064#comment-14212064 ] zzc commented on SPARK-2468: Hi, Aaron Davidson, I send a email to you about shuffle data performance test. Looking forward to hear from your reply. Thanks. Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-809) Give newly registered apps a set of executors right away
[ https://issues.apache.org/jira/browse/SPARK-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212068#comment-14212068 ] Andrew Ash commented on SPARK-809: -- I believe this situation hasn't changed? Looking a little at the master branch (1.2) the standalone cluster still gets its {{defaultParallelism}} in {{CoarseGrainedSchedulerBackend}} from the {{totalCoreCount}} that is incremented/decremented as executors join/leave the driver's awareness. Give newly registered apps a set of executors right away Key: SPARK-809 URL: https://issues.apache.org/jira/browse/SPARK-809 Project: Spark Issue Type: New Feature Components: Deploy Reporter: Matei Zaharia Priority: Minor Right now, newly connected apps in the standalone cluster will not set a good defaultParallelism value if they create RDDs right after creating a SparkContext, because the executorAdded calls are asynchronous and happen after. It would be nice to wait for a few such calls before returning from the scheduler initializer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-835) RDD$parallelize() should use object serializer (not closure serializer) for collection objects
[ https://issues.apache.org/jira/browse/SPARK-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-835: - Fix Version/s: 0.8.0 RDD$parallelize() should use object serializer (not closure serializer) for collection objects -- Key: SPARK-835 URL: https://issues.apache.org/jira/browse/SPARK-835 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.7.3 Reporter: Dmitriy Lyubimov Fix For: 0.8.0 This is a twin issue for SPARK-826 encapsulating all use cases where collection of objects is transferred from front end to backend RDDs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-835) RDD$parallelize() should use object serializer (not closure serializer) for collection objects
[ https://issues.apache.org/jira/browse/SPARK-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash resolved SPARK-835. -- Resolution: Fixed Closing per [~dlyubimov] with the Fix Version from SPARK-826 RDD$parallelize() should use object serializer (not closure serializer) for collection objects -- Key: SPARK-835 URL: https://issues.apache.org/jira/browse/SPARK-835 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.7.3 Reporter: Dmitriy Lyubimov Fix For: 0.8.0 This is a twin issue for SPARK-826 encapsulating all use cases where collection of objects is transferred from front end to backend RDDs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-904) Not able to Start/Stop Spark Worker from Remote Machine
[ https://issues.apache.org/jira/browse/SPARK-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212078#comment-14212078 ] Andrew Ash commented on SPARK-904: -- [~ayushmishra2005] I suspect you don't have Spark installed on the remote machine -- the {{start-all.sh}} script won't install it for you on remote machines. If you're still having trouble, please reach out to the spark users list from http://spark.apache.org/community.html which is a better place for these kinds of requests anyway. I'm closing this issue for now but let me know here if you aren't able to get a resolution on the mailing lists. Thanks, and good luck with Spark! Andrew Not able to Start/Stop Spark Worker from Remote Machine --- Key: SPARK-904 URL: https://issues.apache.org/jira/browse/SPARK-904 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.7.3 Reporter: Ayush I have two machines A and B. I am trying to run Spark Master on machine A and Spark Worker on machine B. I have set machine B'host name in conf/slaves in my Spark directory. When I am executing start-all.sh to start master and workers, I am getting below message on console: abc@abc-vostro:~/spark-scala-2.10$ sudo sh bin/start-all.sh sudo: /etc/sudoers.d is world writable starting spark.deploy.master.Master, logging to /home/abc/spark-scala-2.10/bin/../logs/spark-root-spark.deploy.master.Master-1-abc-vostro.out 13/09/11 14:54:29 WARN spark.Utils: Your hostname, abc-vostro resolves to a loopback address: 127.0.1.1; using 1XY.1XY.Y.Y instead (on interface wlan2) 13/09/11 14:54:29 WARN spark.Utils: Set SPARK_LOCAL_IP if you need to bind to another address Master IP: abc-vostro cd /home/abc/spark-scala-2.10/bin/.. ; /home/abc/spark-scala-2.10/bin/start-slave.sh 1 spark://abc-vostro:7077 xyz@1XX.1XX.X.X's password: xyz@1XX.1XX.X.X: bash: line 0: cd: /home/abc/spark-scala-2.10/bin/..: No such file or directory xyz@1XX.1XX.X.X: bash: /home/abc/spark-scala-2.10/bin/start-slave.sh: No such file or directory Master is started but worker is failed to start. I have set xyz@1XX.1XX.X.X in conf/slaves in my Spark directory. Can anyone help me to resolve this? This is probably something I'm missing any configuration on my end. However When I create Spark Master and Worker on same machine, It is working fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-904) Not able to Start/Stop Spark Worker from Remote Machine
[ https://issues.apache.org/jira/browse/SPARK-904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash closed SPARK-904. Resolution: Not a Problem Not able to Start/Stop Spark Worker from Remote Machine --- Key: SPARK-904 URL: https://issues.apache.org/jira/browse/SPARK-904 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.7.3 Reporter: Ayush I have two machines A and B. I am trying to run Spark Master on machine A and Spark Worker on machine B. I have set machine B'host name in conf/slaves in my Spark directory. When I am executing start-all.sh to start master and workers, I am getting below message on console: abc@abc-vostro:~/spark-scala-2.10$ sudo sh bin/start-all.sh sudo: /etc/sudoers.d is world writable starting spark.deploy.master.Master, logging to /home/abc/spark-scala-2.10/bin/../logs/spark-root-spark.deploy.master.Master-1-abc-vostro.out 13/09/11 14:54:29 WARN spark.Utils: Your hostname, abc-vostro resolves to a loopback address: 127.0.1.1; using 1XY.1XY.Y.Y instead (on interface wlan2) 13/09/11 14:54:29 WARN spark.Utils: Set SPARK_LOCAL_IP if you need to bind to another address Master IP: abc-vostro cd /home/abc/spark-scala-2.10/bin/.. ; /home/abc/spark-scala-2.10/bin/start-slave.sh 1 spark://abc-vostro:7077 xyz@1XX.1XX.X.X's password: xyz@1XX.1XX.X.X: bash: line 0: cd: /home/abc/spark-scala-2.10/bin/..: No such file or directory xyz@1XX.1XX.X.X: bash: /home/abc/spark-scala-2.10/bin/start-slave.sh: No such file or directory Master is started but worker is failed to start. I have set xyz@1XX.1XX.X.X in conf/slaves in my Spark directory. Can anyone help me to resolve this? This is probably something I'm missing any configuration on my end. However When I create Spark Master and Worker on same machine, It is working fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-957) The problem that repeated computation among iterations
[ https://issues.apache.org/jira/browse/SPARK-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212084#comment-14212084 ] Andrew Ash commented on SPARK-957: -- Hi [~caizhua], are you still having issues with your implementation of the LDA algorithm? We try to only keep tickets open that have remaining work to be done. You can also reference SPARK-1405 for the LDA implementation being worked on for future inclusion into MLlib. The problem that repeated computation among iterations -- Key: SPARK-957 URL: https://issues.apache.org/jira/browse/SPARK-957 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 0.7.3 Reporter: caizhua For LDA model, if we make each document as a single record of RDD, it is quite slow, so we try making the RDD as a set of blocks, where each block has a subset of documents. However, when we run the program, we find that a lot of computation among iterations are repeated. Basically, when we comes to the ith iteration, all the jobs that happened in 0 to (i-1)th iteration are repeated. Certainly, the jobs in the ith iteration will be repeated in the (i+1) iteration. In total, if you have m iterations, then the jobs in the ith iteration will be repeated. However, the result is still correct. :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-794) Remove sleep() in ClusterScheduler.stop
[ https://issues.apache.org/jira/browse/SPARK-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212088#comment-14212088 ] Andrew Ash commented on SPARK-794: -- I don't see a {{ClusterScheduler}} class on master -- was that refactored to {{TaskSchedulerImpl}}? There is still a {{Thread.sleep(1000)}} in that class's {{stop()}} though, which may be what this ticket refers to: https://github.com/apache/spark/blob/1df05a40ebf3493b0aff46d18c0f30d2d5256c7b/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L399 Remove sleep() in ClusterScheduler.stop --- Key: SPARK-794 URL: https://issues.apache.org/jira/browse/SPARK-794 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Matei Zaharia This temporary change made a while back slows down the unit tests quite a bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-665) Create RPM packages for Spark
[ https://issues.apache.org/jira/browse/SPARK-665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212094#comment-14212094 ] Andrew Ash commented on SPARK-665: -- This role of creating RPM packages seems to have been taken up by the various Hadoop distributors for their distributions (Cloudera, MapR, HortonWorks, etc). Does the Apache Spark team still intend to create RPMs for Spark? An obvious subsequent request would be to also release in the DEB format. I don't think this is a route we want to go down now but wanted to hear others' thoughts. Create RPM packages for Spark - Key: SPARK-665 URL: https://issues.apache.org/jira/browse/SPARK-665 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia This could be doable with the JRPM Maven plugin, similar to how we make Debian packages now, but I haven't looked into it. The plugin is described at http://jrpm.sourceforge.net. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-1206) Add python support for average and other summary satistics
[ https://issues.apache.org/jira/browse/SPARK-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash closed SPARK-1206. - Resolution: Implemented Closing per [~holdenk_amp] as already done. Add python support for average and other summary satistics --- Key: SPARK-1206 URL: https://issues.apache.org/jira/browse/SPARK-1206 Project: Spark Issue Type: Improvement Reporter: Holden Karau Priority: Minor We have a number of summary statistics in the DoubleRDDFunctions.scala. We should implement these in python too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-665) Create RPM packages for Spark
[ https://issues.apache.org/jira/browse/SPARK-665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212104#comment-14212104 ] Andrew Ash commented on SPARK-665: -- Sean are you suggesting dropping the .deb packages that Apache Spark releases as a simplification effort? It feels inequitable to support one packaging format (deb) but not the other (rpm). Create RPM packages for Spark - Key: SPARK-665 URL: https://issues.apache.org/jira/browse/SPARK-665 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia This could be doable with the JRPM Maven plugin, similar to how we make Debian packages now, but I haven't looked into it. The plugin is described at http://jrpm.sourceforge.net. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-818) Design Spark Job Server
[ https://issues.apache.org/jira/browse/SPARK-818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash closed SPARK-818. Resolution: Won't Fix The community consensus was that the Spark job server should live in a separate repository from the Apache Spark code. It now lives in the repository that [~velvia] links above, and has continued to be quite active over the past 6 months. Closing this Apache Spark issue as Won't Fix -- the job server is a welcome addition to the growing Spark ecosystem but its issues don't belong in this issue tracker. Many thanks everyone! Design Spark Job Server --- Key: SPARK-818 URL: https://issues.apache.org/jira/browse/SPARK-818 Project: Spark Issue Type: New Feature Reporter: Evan Chan Attachments: job-server-architecture.md We've been developing a generic REST job server here at Ooyala and would like to outline its goals, architecture, and API, so we could get feedback from the community and hopefully contribute it back. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-665) Create RPM packages for Spark
[ https://issues.apache.org/jira/browse/SPARK-665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212109#comment-14212109 ] Sean Owen commented on SPARK-665: - Not suggesting that, no. I suppose it does depend on demand. Apparently there was enough of a need for a .deb package that it was contributed and maintained, but perhaps not for .rpm, because there are other sources? If there were demand it could be worth the cost. Create RPM packages for Spark - Key: SPARK-665 URL: https://issues.apache.org/jira/browse/SPARK-665 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia This could be doable with the JRPM Maven plugin, similar to how we make Debian packages now, but I haven't looked into it. The plugin is described at http://jrpm.sourceforge.net. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-1231) DEAD worker should recover automaticly
[ https://issues.apache.org/jira/browse/SPARK-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash closed SPARK-1231. - Resolution: Duplicate DEAD worker should recover automaticly -- Key: SPARK-1231 URL: https://issues.apache.org/jira/browse/SPARK-1231 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 0.9.0 Reporter: Yi Tian Labels: dead, recover, worker master should send a response when DEAD worker sending a heartbeat. so worker could clean all applications and drivers and re-register itself to master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1231) DEAD worker should recover automaticly
[ https://issues.apache.org/jira/browse/SPARK-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212110#comment-14212110 ] Andrew Ash commented on SPARK-1231: --- Sorry [~tianyi], when I did my search for prior tickets in SPARK-3736 I didn't find this one. The issue of workers not recovering when disconnected has since been resolved though and will be released as part of Spark 1.2.0, so I'm closing this ticket as a duplicate. Please let us know if you have any troubles with that implementation once you start testing it. Thanks for reporting the bug! Andrew DEAD worker should recover automaticly -- Key: SPARK-1231 URL: https://issues.apache.org/jira/browse/SPARK-1231 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 0.9.0 Reporter: Yi Tian Labels: dead, recover, worker master should send a response when DEAD worker sending a heartbeat. so worker could clean all applications and drivers and re-register itself to master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1169) Add countApproxDistinct and countApproxDistinctByKey to PySpark
[ https://issues.apache.org/jira/browse/SPARK-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212112#comment-14212112 ] Andrew Ash commented on SPARK-1169: --- On current master (1.2) I see that rdd.py now has a countApproxDistinct() method, but I don't see one for countApproxDistinctByKey() so this ticket is half completed. Add countApproxDistinct and countApproxDistinctByKey to PySpark --- Key: SPARK-1169 URL: https://issues.apache.org/jira/browse/SPARK-1169 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Matei Zaharia Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-754) Multiple Spark Contexts active in a single Spark Context
[ https://issues.apache.org/jira/browse/SPARK-754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212113#comment-14212113 ] Andrew Ash commented on SPARK-754: -- This is actually currently unsupported, and a ticket to make this possible is being tracked at SPARK-2243 I'm going to close this ticket as a duplicate of that one, but please let me know if you feel there are subtleties here that keep these from being a duplicate. Thanks for the bug report Erik! Andrew Multiple Spark Contexts active in a single Spark Context Key: SPARK-754 URL: https://issues.apache.org/jira/browse/SPARK-754 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.7.0 Reporter: Erik James Freed Priority: Critical This may be no more than creating a unit test to ensure it can be done but it is not clear that one can instantiate multiple spark contexts within a single VM and use them concurrently (one thread in a context at a time). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-754) Multiple Spark Contexts active in a single Spark Context
[ https://issues.apache.org/jira/browse/SPARK-754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash closed SPARK-754. Resolution: Duplicate Multiple Spark Contexts active in a single Spark Context Key: SPARK-754 URL: https://issues.apache.org/jira/browse/SPARK-754 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.7.0 Reporter: Erik James Freed Priority: Critical This may be no more than creating a unit test to ensure it can be done but it is not clear that one can instantiate multiple spark contexts within a single VM and use them concurrently (one thread in a context at a time). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-2243: -- Affects Version/s: 0.7.0 Support multiple SparkContexts in the same JVM -- Key: SPARK-2243 URL: https://issues.apache.org/jira/browse/SPARK-2243 Project: Spark Issue Type: New Feature Components: Block Manager, Spark Core Affects Versions: 0.7.0, 1.0.0, 1.1.0 Reporter: Miguel Angel Fernandez Diaz We're developing a platform where we create several Spark contexts for carrying out different calculations. Is there any restriction when using several Spark contexts? We have two contexts, one for Spark calculations and another one for Spark Streaming jobs. The next error arises when we first execute a Spark calculation and, once the execution is finished, a Spark Streaming job is launched: {code} 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at
[jira] [Created] (SPARK-4399) Support multiple cloud providers
Andrew Ash created SPARK-4399: - Summary: Support multiple cloud providers Key: SPARK-4399 URL: https://issues.apache.org/jira/browse/SPARK-4399 Project: Spark Issue Type: New Feature Components: EC2 Affects Versions: 1.2.0 Reporter: Andrew Ash We currently have Spark startup scripts for Amazon EC2 but not for various other cloud providers. This ticket is an umbrella to support multiple cloud providers in the bundled scripts, not just Amazon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4400) Add scripts for launching Spark on Google Compute Engine (GCE)
Andrew Ash created SPARK-4400: - Summary: Add scripts for launching Spark on Google Compute Engine (GCE) Key: SPARK-4400 URL: https://issues.apache.org/jira/browse/SPARK-4400 Project: Spark Issue Type: New Feature Reporter: Andrew Ash -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2398) Trouble running Spark 1.0 on Yarn
[ https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2398. -- Resolution: Duplicate The PR that resolved that was ultimately tied to a new JIRA, SPARK-3768. Trouble running Spark 1.0 on Yarn -- Key: SPARK-2398 URL: https://issues.apache.org/jira/browse/SPARK-2398 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Nishkam Ravi Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. For example: SparkPageRank when run in standalone mode goes through without any errors (tested for up to 30GB input dataset on a 6-node cluster). Also runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn cluster mode) as the input data size is increased. Confirmed for 16GB input dataset. The same workload runs fine with Spark 0.9 in both standalone and yarn cluster mode (for up to 30 GB input dataset on a 6-node cluster). Commandline used: (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g --driver-cores 16 --num-executors 5 --class org.apache.spark.examples.SparkPageRank /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar pagerank_in $NUM_ITER) pagerank.conf: spark.masterspark://c1704.halxg.cloudera.com:7077 spark.home /opt/cloudera/parcels/CDH/lib/spark spark.executor.memory 32g spark.default.parallelism 118 spark.cores.max 96 spark.storage.memoryFraction0.6 spark.shuffle.memoryFraction0.3 spark.shuffle.compress true spark.shuffle.spill.compresstrue spark.broadcast.compresstrue spark.rdd.compress false spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec spark.io.compression.snappy.block.size 32768 spark.reducer.maxMbInFlight 48 spark.local.dir /var/lib/jenkins/workspace/tmp spark.driver.memory 30g spark.executor.cores16 spark.locality.wait 6000 spark.executor.instances5 UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions: 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection to ConnectionManagerId(a1016.halxg.cloudera.com,54105) java.nio.channels.AsynchronousCloseException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) at org.apache.spark.network.SendingConnection.write(Connection.scala:361) at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.hadoop.util.LineReader.close(LineReader.java:150) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244) at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) --- 14/07/07 17:59:52 WARN network.SendingConnection: Error finishing connection to
[jira] [Commented] (SPARK-1358) Continuous integrated test should be involved in Spark ecosystem
[ https://issues.apache.org/jira/browse/SPARK-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212125#comment-14212125 ] Andrew Ash commented on SPARK-1358: --- I've heard these sorts of extended tests called end to end tests. A sample one for Spark could be to stand up an HDFS cluster, load some data into it, stand up a parallel Spark cluster, read data out of HDFS, and compute some kind of aggregate. They tend to require significant hardware though in order to run because they are much more intense and can be long-running. [~pwendell] is that kind of extended hardware available? I know we currently run some CI on Amplab Jenkins but I'm uncertain how much additional capacity it could support. cc [~shaneknapp] Continuous integrated test should be involved in Spark ecosystem - Key: SPARK-1358 URL: https://issues.apache.org/jira/browse/SPARK-1358 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: xiajunluan Currently, Spark only contains unit test and performance test, but I think it is not enough for customer to evaluate status about their cluster and spark version they will used, and it is necessary to build continuous integrated test for spark development , it could included 1. complex applications test cases for spark/spark streaming/graphx 2. stresss test cases 3. fault tolerance test cases 4.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4400) Add scripts for launching Spark on Google Compute Engine (GCE)
[ https://issues.apache.org/jira/browse/SPARK-4400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash closed SPARK-4400. - Resolution: Duplicate Add scripts for launching Spark on Google Compute Engine (GCE) -- Key: SPARK-4400 URL: https://issues.apache.org/jira/browse/SPARK-4400 Project: Spark Issue Type: New Feature Reporter: Andrew Ash -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-928) Add support for Unsafe-based serializer in Kryo 2.22
[ https://issues.apache.org/jira/browse/SPARK-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212132#comment-14212132 ] Andrew Ash commented on SPARK-928: -- Latest Chill (0.5.0) is still using Kryo 2.21 so this is still waiting on a Chill update https://github.com/twitter/chill/blob/0.5.0/project/Build.scala#L13 Add support for Unsafe-based serializer in Kryo 2.22 Key: SPARK-928 URL: https://issues.apache.org/jira/browse/SPARK-928 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matei Zaharia Priority: Minor Labels: starter Fix For: 1.0.0 This can reportedly be quite a bit faster, but it also requires Chill to update its Kryo dependency. Once that happens we should add a spark.kryo.useUnsafe flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1568) Spark 0.9.0 hangs reading s3
[ https://issues.apache.org/jira/browse/SPARK-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212145#comment-14212145 ] Andrew Ash commented on SPARK-1568: --- [~sams] did you see an improvement when you upgraded to 1.0.0? I've noticed myself that reading from s3 can be slow, particularly in the scenario where there are many small files. I think the hadoop S3 adapter makes many more API calls than is necessary, and that scales on the number of files you have. Spark 0.9.0 hangs reading s3 Key: SPARK-1568 URL: https://issues.apache.org/jira/browse/SPARK-1568 Project: Spark Issue Type: Bug Reporter: sam I've tried several jobs now and many of the tasks complete, then it get stuck and just hangs. The exact same jobs function perfectly fine if I distcp to hdfs first and read from hdfs. Many thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4401) RuleExecutor correctly logs trace iteration num
YanTang Zhai created SPARK-4401: --- Summary: RuleExecutor correctly logs trace iteration num Key: SPARK-4401 URL: https://issues.apache.org/jira/browse/SPARK-4401 Project: Spark Issue Type: Bug Components: SQL Reporter: YanTang Zhai RuleExecutor logs trace wrong iteration num as follows if (curPlan.fastEquals(lastPlan)) { logTrace(sFixed point reached for batch ${batch.name} after $iteration iterations.) continue = false } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4402) Output path validation of an action statement resulting in runtime exception
Vijay created SPARK-4402: Summary: Output path validation of an action statement resulting in runtime exception Key: SPARK-4402 URL: https://issues.apache.org/jira/browse/SPARK-4402 Project: Spark Issue Type: Wish Reporter: Vijay Priority: Minor Output path validation is happening at the time of statement execution as a part of lazyevolution of action statement. But if the path already exists then it throws a runtime exception. Hence all the processing completed till that point is lost which results in resource wastage (processing time and CPU usage). If this I/O related validation is done before the RDD action operations then this runtime exception can be avoided. I believe similar validation/ feature is implemented in hadoop also. Example: SchemaRDD.saveAsTextFile() evaluated the path during runtime -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3722) Spark on yarn docs work
[ https://issues.apache.org/jira/browse/SPARK-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-3722. -- Resolution: Fixed Fix Version/s: 1.3.0 Assignee: WangTaoTheTonic Target Version/s: 1.3.0 Spark on yarn docs work --- Key: SPARK-3722 URL: https://issues.apache.org/jira/browse/SPARK-3722 Project: Spark Issue Type: Improvement Components: Documentation Reporter: WangTaoTheTonic Assignee: WangTaoTheTonic Priority: Minor Fix For: 1.3.0 Adding another way to gain containers' log. Fix outdated link and typo. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4403) Elastic allocation(spark.dynamicAllocation.enabled) results in task never being execued.
[ https://issues.apache.org/jira/browse/SPARK-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Egor Pahomov updated SPARK-4403: Attachment: ipython_out Elastic allocation(spark.dynamicAllocation.enabled) results in task never being execued. Key: SPARK-4403 URL: https://issues.apache.org/jira/browse/SPARK-4403 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.1.1 Reporter: Egor Pahomov Attachments: ipython_out I execute ipython notebook + pyspark with spark.dynamicAllocation.enabled = true. Task never ends. Code: {code} import sys from random import random from operator import add partitions = 10 n = 10 * partitions def f(_): x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 1 else 0 count = sc.parallelize(xrange(1, n + 1), partitions).map(f).reduce(add) print Pi is roughly %f % (4.0 * count / n) {code} {code} pyspark \ --verbose \ --master yarn-client \ --conf spark.driver.port=$((RANDOM_PORT + 2)) \ --conf spark.broadcast.port=$((RANDOM_PORT + 3)) \ --conf spark.replClassServer.port=$((RANDOM_PORT + 4)) \ --conf spark.blockManager.port=$((RANDOM_PORT + 5)) \ --conf spark.executor.port=$((RANDOM_PORT + 6)) \ --conf spark.fileserver.port=$((RANDOM_PORT + 7)) \ --conf spark.shuffle.service.enabled=true \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.minExecutors=1 \ --conf spark.dynamicAllocation.maxExecutors=10 \ --conf spark.ui.port=$SPARK_UI_PORT {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1568) Spark 0.9.0 hangs reading s3
[ https://issues.apache.org/jira/browse/SPARK-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1568. -- Resolution: Fixed Fix Version/s: 1.0.0 Spark 0.9.0 hangs reading s3 Key: SPARK-1568 URL: https://issues.apache.org/jira/browse/SPARK-1568 Project: Spark Issue Type: Bug Reporter: sam Fix For: 1.0.0 I've tried several jobs now and many of the tasks complete, then it get stuck and just hangs. The exact same jobs function perfectly fine if I distcp to hdfs first and read from hdfs. Many thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4402) Output path validation of an action statement resulting in runtime exception
[ https://issues.apache.org/jira/browse/SPARK-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212376#comment-14212376 ] Sean Owen commented on SPARK-4402: -- Is this not the same issue resolved by https://issues.apache.org/jira/browse/SPARK-1100 ? I think the behavior implemented there is the intended behavior here. Output path validation of an action statement resulting in runtime exception Key: SPARK-4402 URL: https://issues.apache.org/jira/browse/SPARK-4402 Project: Spark Issue Type: Wish Reporter: Vijay Priority: Minor Output path validation is happening at the time of statement execution as a part of lazyevolution of action statement. But if the path already exists then it throws a runtime exception. Hence all the processing completed till that point is lost which results in resource wastage (processing time and CPU usage). If this I/O related validation is done before the RDD action operations then this runtime exception can be avoided. I believe similar validation/ feature is implemented in hadoop also. Example: SchemaRDD.saveAsTextFile() evaluated the path during runtime -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4404) SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-process ends
[ https://issues.apache.org/jira/browse/SPARK-4404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212412#comment-14212412 ] Apache Spark commented on SPARK-4404: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/3266 SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-process ends - Key: SPARK-4404 URL: https://issues.apache.org/jira/browse/SPARK-4404 Project: Spark Issue Type: Bug Components: Spark Core Reporter: WangTaoTheTonic When we have spark.driver.extra* or spark.driver.memory in SPARK_SUBMIT_PROPERTIES_FILE, spark-class will use SparkSubmitDriverBootstrapper to launch driver. If we get process id of SparkSubmitDriverBootstrapper and wanna kill it during its running, we expect its SparkSubmit sub-process stop also. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4404) SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-process ends
[ https://issues.apache.org/jira/browse/SPARK-4404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] WangTaoTheTonic updated SPARK-4404: --- Description: When we have spark.driver.extra* or spark.driver.memory in SPARK_SUBMIT_PROPERTIES_FILE, spark-class will use SparkSubmitDriverBootstrapper to launch driver. If we get process id of SparkSubmitDriverBootstrapper and wanna kill it during its running, we expect its SparkSubmit sub-process stop also. was: We we have spark.driver.extra* or spark.driver.memory in SPARK_SUBMIT_PROPERTIES_FILE, spark-class will use SparkSubmitDriverBootstrapper to launch driver. If we get process id of SparkSubmitDriverBootstrapper and wanna kill it during its running, we expect its SparkSubmit sub-process stop also. SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-process ends - Key: SPARK-4404 URL: https://issues.apache.org/jira/browse/SPARK-4404 Project: Spark Issue Type: Bug Components: Spark Core Reporter: WangTaoTheTonic When we have spark.driver.extra* or spark.driver.memory in SPARK_SUBMIT_PROPERTIES_FILE, spark-class will use SparkSubmitDriverBootstrapper to launch driver. If we get process id of SparkSubmitDriverBootstrapper and wanna kill it during its running, we expect its SparkSubmit sub-process stop also. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4354) 14/11/12 09:39:00 WARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, HYD-RNDNW-VFRCO-RCORE2): java.lang.NoClassDefFoundError: Could not initialize class org.xerial
[ https://issues.apache.org/jira/browse/SPARK-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212482#comment-14212482 ] Apache Spark commented on SPARK-4354: - User 'alexliu68' has created a pull request for this issue: https://github.com/apache/spark/pull/3211 14/11/12 09:39:00 WARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, HYD-RNDNW-VFRCO-RCORE2): java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy -- Key: SPARK-4354 URL: https://issues.apache.org/jira/browse/SPARK-4354 Project: Spark Issue Type: Question Components: Examples Affects Versions: 1.1.0 Environment: Linux Reporter: Shyam Labels: newbie Attachments: client-exception.txt Prebuilt Spark for Hadoop 2.4 installed in 4 redhat linux machines Standalone cluster mode. Machine 1(Master) Machine 2(Worker node 1) Machine 3(Worker node 2) Machine 4(Client for executing spark examples) I ran below mentioned command in Machine 4 then got exception mentioned in the summary of this issue. sh spark-submit --class org.apache.spark.examples.SparkPi --jars /FS/lib/spark-assembly-1.1.0-hadoop2.4.0.jar --master spark://MasterIP:7077 --deploy-mode client /FS/lib/spark-examples-1.1.0-hadoop2.4.0.jar 10 java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4338) Remove yarn-alpha support
[ https://issues.apache.org/jira/browse/SPARK-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212485#comment-14212485 ] Andrew Ash commented on SPARK-4338: --- From discussion on the dev list today, Sandy aims to include an example of building against Hadoop 2.5 in the docs with this PR Remove yarn-alpha support - Key: SPARK-4338 URL: https://issues.apache.org/jira/browse/SPARK-4338 Project: Spark Issue Type: Sub-task Components: YARN Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4405) Matrices.* construction methods should check for rows x cols overflow
Joseph K. Bradley created SPARK-4405: Summary: Matrices.* construction methods should check for rows x cols overflow Key: SPARK-4405 URL: https://issues.apache.org/jira/browse/SPARK-4405 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor Matrices has several methods which construct new all-0, all-1, or random matrices. They take numRows, numCols as Int and multiply them to get the matrix size. They should check for overflow before trying to create the matrix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4406) SVD should check for k 1
Joseph K. Bradley created SPARK-4406: Summary: SVD should check for k 1 Key: SPARK-4406 URL: https://issues.apache.org/jira/browse/SPARK-4406 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor When SVD is called with k 1, it still tries to compute the SVD, causing a lower-level error. It should fail early. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4407) Thrift server for 0.13.1 doesn't deserialize complex types properly
Cheng Lian created SPARK-4407: - Summary: Thrift server for 0.13.1 doesn't deserialize complex types properly Key: SPARK-4407 URL: https://issues.apache.org/jira/browse/SPARK-4407 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.2 Reporter: Cheng Lian Priority: Blocker The following snippet can reproduce this issue: {code} CREATE TABLE t0(m MAPINT, STRING); INSERT OVERWRITE TABLE t0 SELECT MAP(key, value) FROM src LIMIT 10; SELECT * FROM t0; {code} Exception throw: {code} java.lang.RuntimeException: java.lang.ClassCastException: scala.collection.immutable.Map$Map1 cannot be cast to java.lang.String at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:84) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy21.fetchResults(Unknown Source) at org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:405) at org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:530) at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553) at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassCastException: scala.collection.immutable.Map$Map1 cannot be cast to java.lang.String at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(Shim13.scala:142) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(Shim13.scala:165) at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:192) at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:471) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) ... 19 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4408) Behavior difference between spark-submit conf vs cmd line args
Pedro Rodriguez created SPARK-4408: -- Summary: Behavior difference between spark-submit conf vs cmd line args Key: SPARK-4408 URL: https://issues.apache.org/jira/browse/SPARK-4408 Project: Spark Issue Type: Bug Components: Deploy, Documentation Affects Versions: 1.1.0, 1.2.0 Reporter: Pedro Rodriguez Priority: Minor There seems to be a difference between the behavior of bin/spark-submit with using command line arguments vs configuration file. It looks like either a bug or at least where documentation could be clearer about this difference. Steps to Replicate: 1. Submit a job with a command similar to: bin/spark-submit --class nipslda.NipsLda --master local --conf spark.executor.memory=2g --conf spark.driver.memory=2g --verbose ~/Code/nips-lda/target/scala-2.10/nips-lda-assembly-0.1.jar 2. Navigate to SparkUI. 3. Environment tab lists driver and executor memory correctly. 4. Executor tab shows memory as ~260MB (my case) or default JVM limit 5. Write memory arguments to conf/spark-defaults.conf 6. Run same command without memory arguments 7. SparkUI executor tab correctly shows memory setting Looking at spark-submit, it includes this passage in the comments: # For client mode, the driver will be launched in the same JVM that launches # SparkSubmit, so we may need to read the properties file for any extra class # paths, library paths, java options and memory early on. Otherwise, it will # be too late by the time the driver JVM has started. Based on this, it seems that JVM parameters for spark-submit are used for the job itself when in client mode. Effectively, it makes it impossible to use the command line argument setting method to change JVM parameters since the JVM is already launched. This seems like unexpected/undesirable behavior which could be fixed or docs could be changed to better reflect how this works. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4407) Thrift server for 0.13.1 doesn't deserialize complex types properly
[ https://issues.apache.org/jira/browse/SPARK-4407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212544#comment-14212544 ] Apache Spark commented on SPARK-4407: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/3178 Thrift server for 0.13.1 doesn't deserialize complex types properly --- Key: SPARK-4407 URL: https://issues.apache.org/jira/browse/SPARK-4407 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.2 Reporter: Cheng Lian Priority: Blocker The following snippet can reproduce this issue: {code} CREATE TABLE t0(m MAPINT, STRING); INSERT OVERWRITE TABLE t0 SELECT MAP(key, value) FROM src LIMIT 10; SELECT * FROM t0; {code} Exception throw: {code} java.lang.RuntimeException: java.lang.ClassCastException: scala.collection.immutable.Map$Map1 cannot be cast to java.lang.String at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:84) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy21.fetchResults(Unknown Source) at org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:405) at org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:530) at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553) at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassCastException: scala.collection.immutable.Map$Map1 cannot be cast to java.lang.String at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(Shim13.scala:142) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(Shim13.scala:165) at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:192) at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:471) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) ... 19 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-603) add simple Counter API
[ https://issues.apache.org/jira/browse/SPARK-603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-603. --- Resolution: Won't Fix Closing this one as part of [~aash]'s cleanup. I think this problem is being fixed as we add accumulator / metrics values to the web ui. add simple Counter API -- Key: SPARK-603 URL: https://issues.apache.org/jira/browse/SPARK-603 Project: Spark Issue Type: New Feature Priority: Minor Users need a very simple way to create counters in their jobs. Accumulators provide a way to do this, but are a little clunky, for two reasons: 1) the setup is a nuisance 2) w/ delayed evaluation, you don't know when it will actually run, so its hard to look at the values consider this code: {code} def filterBogus(rdd:RDD[MyCustomClass], sc: SparkContext) = { val filterCount = sc.accumulator(0) val filtered = rdd.filter{r = if (isOK(r)) true else {filterCount += 1; false} } println(removed + filterCount.value + records) filtered } {code} The println will always say 0 records were filtered, because its printed before anything has actually run. I could print out the value later on, but note that it would destroy the modularity of the method -- kinda ugly to return the accumulator just so that it can get printed later on. (and of course, the caller in turn might not know when the filter is going to get applied, and would have to pass the accumulator up even further ...) I'd like to have Counters which just automatically get printed out whenever a stage has been run, and also with some api to get them back. I realize this is tricky b/c a stage can get re-computed, so maybe you should only increment the counters once. Maybe a more general way to do this is to provide some callback for whenever an RDD is computed -- by default, you would just print the counters, but the user could replace w/ a custom handler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4410) Support for external sort
Michael Armbrust created SPARK-4410: --- Summary: Support for external sort Key: SPARK-4410 URL: https://issues.apache.org/jira/browse/SPARK-4410 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical When any given key is of too high cardinality the current sorting code can tip over (since it loads a whole partition into memory). I propose we add optional support for using sparks built in external sorting mechanism. It can be off by default, but if we determine this code path does not regress performance we can turn it on by default in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4409) Additional (but limited) Linear Algebra Utils
Burak Yavuz created SPARK-4409: -- Summary: Additional (but limited) Linear Algebra Utils Key: SPARK-4409 URL: https://issues.apache.org/jira/browse/SPARK-4409 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Burak Yavuz Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4395) Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour
[ https://issues.apache.org/jira/browse/SPARK-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4395: Description: When I run this command it hangs for one to many hours and then finally returns with successful results: sqlContext.sql(SELECT * FROM RatingsTable limit 5).collect() Note, the lab environment below is still active, so let me know if you'd like to just access it directly. +++ My Environment +++ - 1-node cluster in Amazon - RedHat 6.5 64-bit - java version 1.7.0_67 - SBT version: sbt-0.13.5 - Scala version: scala-2.11.2 Ran: sudo yum -y update git clone https://github.com/apache/spark sudo sbt assembly +++ Data file used +++ http://blueplastic.com/databricks/movielens/ratings.dat {code} import re import string from pyspark.sql import SQLContext, Row sqlContext = SQLContext(sc) RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)' def parse_ratings_line(line): ... match = re.search(RATINGS_PATTERN, line) ... if match is None: ... # Optionally, you can change this to just ignore if each line of data is not critical. ... raise Error(Invalid logline: %s % logline) ... return Row( ... UserID= int(match.group(1)), ... MovieID = int(match.group(2)), ... Rating= int(match.group(3)), ... Timestamp = int(match.group(4))) ... ratings_base_RDD = (sc.textFile(file:///home/ec2-user/movielens/ratings.dat) ...# Call the parse_apace_log_line function on each line. ....map(parse_ratings_line) ...# Caches the objects in memory since they will be queried multiple times. ....cache()) ratings_base_RDD.count() 1000209 ratings_base_RDD.first() Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1) schemaRatings = sqlContext.inferSchema(ratings_base_RDD) schemaRatings.registerTempTable(RatingsTable) sqlContext.sql(SELECT * FROM RatingsTable limit 5).collect() {code} (Now the Python shell hangs...) was: When I run this command it hangs for one to many hours and then finally returns with successful results: sqlContext.sql(SELECT * FROM RatingsTable limit 5).collect() Note, the lab environment below is still active, so let me know if you'd like to just access it directly. +++ My Environment +++ - 1-node cluster in Amazon - RedHat 6.5 64-bit - java version 1.7.0_67 - SBT version: sbt-0.13.5 - Scala version: scala-2.11.2 Ran: sudo yum -y update git clone https://github.com/apache/spark sudo sbt assembly +++ Data file used +++ http://blueplastic.com/databricks/movielens/ratings.dat +++ Code ran +++ import re import string from pyspark.sql import SQLContext, Row sqlContext = SQLContext(sc) RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)' def parse_ratings_line(line): ... match = re.search(RATINGS_PATTERN, line) ... if match is None: ... # Optionally, you can change this to just ignore if each line of data is not critical. ... raise Error(Invalid logline: %s % logline) ... return Row( ... UserID= int(match.group(1)), ... MovieID = int(match.group(2)), ... Rating= int(match.group(3)), ... Timestamp = int(match.group(4))) ... ratings_base_RDD = (sc.textFile(file:///home/ec2-user/movielens/ratings.dat) ...# Call the parse_apace_log_line function on each line. ....map(parse_ratings_line) ...# Caches the objects in memory since they will be queried multiple times. ....cache()) ratings_base_RDD.count() 1000209 ratings_base_RDD.first() Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1) schemaRatings = sqlContext.inferSchema(ratings_base_RDD) schemaRatings.registerTempTable(RatingsTable) sqlContext.sql(SELECT * FROM RatingsTable limit 5).collect() (Now the Python shell hangs...) Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour -- Key: SPARK-4395 URL: https://issues.apache.org/jira/browse/SPARK-4395 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: version 1.2.0-SNAPSHOT Reporter: Sameer Farooqui When I run this command it hangs for one to many hours and then finally returns with successful results: sqlContext.sql(SELECT * FROM RatingsTable limit 5).collect() Note, the lab environment below is still active, so let me know if you'd like to just access it directly. +++ My Environment +++ - 1-node cluster in Amazon - RedHat 6.5 64-bit - java version 1.7.0_67 - SBT version: sbt-0.13.5 - Scala version: scala-2.11.2 Ran: sudo yum -y update git clone https://github.com/apache/spark sudo sbt assembly +++ Data file used +++ http://blueplastic.com/databricks/movielens/ratings.dat {code}
[jira] [Resolved] (SPARK-4394) Allow datasources to support IN and sizeInBytes
[ https://issues.apache.org/jira/browse/SPARK-4394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-4394. Resolution: Fixed Fix Version/s: 1.2.0 Allow datasources to support IN and sizeInBytes --- Key: SPARK-4394 URL: https://issues.apache.org/jira/browse/SPARK-4394 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.2.0 These are both pretty critical for running benchmarks against externally defined datasources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4409) Additional (but limited) Linear Algebra Utils
[ https://issues.apache.org/jira/browse/SPARK-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-4409: --- Description: This ticket is to discuss the addition of a very limited number of local matrix manipulation and generation methods that would be helpful in the further development for algorithms on top of BlockMatrix (SPARK-3974), such as Randomized SVD, and Multi Model Training (SPARK-1486). The proposed methods for addition are: For `Matrix` - map: maps the values in the matrix with a given function. Produces a new matrix. - update: the values in the matrix are updated with a given function. Occurs in place. Factory methods for `DenseMatrix`: - *zeros: Generate a matrix consisting of zeros - *ones: Generate a matrix consisting of ones - *eye: Generate an identity matrix - *rand: Generate a matrix consisting of i.i.d. uniform random numbers - *randn: Generate a matrix consisting of i.i.d. gaussian random numbers - *diag: Generate a diagonal matrix from a supplied vector *These methods already exist in the factory methods for `Matrices`, however for cases where we require a `DenseMatrix`, you constantly have to add `.asInstanceOf[DenseMatrix]` everywhere, which makes the code dirtier. I propose moving these functions to factory methods for `DenseMatrix` where the putput will be a `DenseMatrix` and the factory methods for `Matrices` will call these functions directly and output a generic `Matrix`. Factory methods for `SparseMatrix`: - speye: Identity matrix in sparse format. Saves a ton of memory when dimensions are large, especially in Multi Model Training, where each row requires being multiplied by a scalar. - sprand: Generate a sparse matrix with a given density consisting of i.i.d. uniform random numbers. - sprandn: Generate a sparse matrix with a given density consisting of i.i.d. gaussian random numbers. - diag: Generate a diagonal matrix from a supplied vector, but is memory efficient, because it just stores the diagonal. Again, very helpful in Multi Model Training. Factory methods for `Matrices`: - Include all the factory methods given above, but return a generic `Matrix` rather than `SparseMatrix` or `DenseMatrix`. - horzCat: Horizontally concatenate matrices to form one larger matrix. Very useful in both Multi Model Training, and for the repartitioning of BlockMatrix. - vertCat: Vertically concatenate matrices to form one larger matrix. Very useful for the repartitioning of BlockMatrix. Additional (but limited) Linear Algebra Utils - Key: SPARK-4409 URL: https://issues.apache.org/jira/browse/SPARK-4409 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Burak Yavuz Priority: Minor This ticket is to discuss the addition of a very limited number of local matrix manipulation and generation methods that would be helpful in the further development for algorithms on top of BlockMatrix (SPARK-3974), such as Randomized SVD, and Multi Model Training (SPARK-1486). The proposed methods for addition are: For `Matrix` - map: maps the values in the matrix with a given function. Produces a new matrix. - update: the values in the matrix are updated with a given function. Occurs in place. Factory methods for `DenseMatrix`: - *zeros: Generate a matrix consisting of zeros - *ones: Generate a matrix consisting of ones - *eye: Generate an identity matrix - *rand: Generate a matrix consisting of i.i.d. uniform random numbers - *randn: Generate a matrix consisting of i.i.d. gaussian random numbers - *diag: Generate a diagonal matrix from a supplied vector *These methods already exist in the factory methods for `Matrices`, however for cases where we require a `DenseMatrix`, you constantly have to add `.asInstanceOf[DenseMatrix]` everywhere, which makes the code dirtier. I propose moving these functions to factory methods for `DenseMatrix` where the putput will be a `DenseMatrix` and the factory methods for `Matrices` will call these functions directly and output a generic `Matrix`. Factory methods for `SparseMatrix`: - speye: Identity matrix in sparse format. Saves a ton of memory when dimensions are large, especially in Multi Model Training, where each row requires being multiplied by a scalar. - sprand: Generate a sparse matrix with a given density consisting of i.i.d. uniform random numbers. - sprandn: Generate a sparse matrix with a given density consisting of i.i.d. gaussian random numbers. - diag: Generate a diagonal matrix from a supplied vector, but is memory efficient, because it just stores the diagonal. Again, very helpful in Multi Model Training. Factory methods for `Matrices`: - Include all the factory methods given above, but return a
[jira] [Created] (SPARK-4411) Add kill link for jobs in the UI
Kay Ousterhout created SPARK-4411: - Summary: Add kill link for jobs in the UI Key: SPARK-4411 URL: https://issues.apache.org/jira/browse/SPARK-4411 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.0 Reporter: Kay Ousterhout SPARK-4145 changes the default landing page for the UI to show jobs. We should have a kill link for each job, similar to what we have for each stage, so it's easier for users to kill slow jobs (and the semantics of killing a job are slightly different than killing a stage). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4411) Add kill link for jobs in the UI
[ https://issues.apache.org/jira/browse/SPARK-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-4411: -- Issue Type: New Feature (was: Bug) Add kill link for jobs in the UI -- Key: SPARK-4411 URL: https://issues.apache.org/jira/browse/SPARK-4411 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 1.2.0 Reporter: Kay Ousterhout SPARK-4145 changes the default landing page for the UI to show jobs. We should have a kill link for each job, similar to what we have for each stage, so it's easier for users to kill slow jobs (and the semantics of killing a job are slightly different than killing a stage). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4413) Parquet support through datasource API
Michael Armbrust created SPARK-4413: --- Summary: Parquet support through datasource API Key: SPARK-4413 URL: https://issues.apache.org/jira/browse/SPARK-4413 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical Right now there are several issues with out parquet support. Specifically, the only way to access parquet files though pure SQL is by including Hive, which has the following issues - fairly verbose syntax - requires you to explicitly add partitions - does not support decimal types. - querying tables with many partitions results in metadata operations dominating the query time (even worse when reading from S3). It would be great to have better native support here though the new datasources API. Ideally once that is in place we can deprecate the existing ParquetRelation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4398) Specialize rdd.parallelize for xrange
[ https://issues.apache.org/jira/browse/SPARK-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4398. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3264 [https://github.com/apache/spark/pull/3264] Specialize rdd.parallelize for xrange - Key: SPARK-4398 URL: https://issues.apache.org/jira/browse/SPARK-4398 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.2.0 sc.parallelize(range) is slow, which writes to disk. We should specialize xrange for performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4413) Parquet support through datasource API
[ https://issues.apache.org/jira/browse/SPARK-4413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212805#comment-14212805 ] Apache Spark commented on SPARK-4413: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/3269 Parquet support through datasource API -- Key: SPARK-4413 URL: https://issues.apache.org/jira/browse/SPARK-4413 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical Right now there are several issues with out parquet support. Specifically, the only way to access parquet files though pure SQL is by including Hive, which has the following issues - fairly verbose syntax - requires you to explicitly add partitions - does not support decimal types. - querying tables with many partitions results in metadata operations dominating the query time (even worse when reading from S3). It would be great to have better native support here though the new datasources API. Ideally once that is in place we can deprecate the existing ParquetRelation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212815#comment-14212815 ] Xiangrui Meng edited comment on SPARK-3080 at 11/14/14 8:56 PM: Thanks for the confirmation! If [~ilganeli] also confirms that this is the root cause, I'm going to close this JIRA as not a problem because this is the only way I could reproduce the issue, which violates the immutability assumption of RDD. was (Author: mengxr): Thanks for the confirmation! If [~ilganeli] also confirms that this is the root cause, I'm going to close this JIRA as not a problem because this is the only way I could reproduce the issue. ArrayIndexOutOfBoundsException in ALS for Large datasets Key: SPARK-3080 URL: https://issues.apache.org/jira/browse/SPARK-3080 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Burak Yavuz The stack trace is below: {quote} java.lang.ArrayIndexOutOfBoundsException: 2716 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) {quote} This happened after the dataset was sub-sampled. Dataset properties: ~12B ratings Setup: 55 r3.8xlarge ec2 instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212815#comment-14212815 ] Xiangrui Meng commented on SPARK-3080: -- Thanks for the confirmation! If [~ilganeli] also confirms that this is the root cause, I'm going to close this JIRA as not a problem because this is the only way I could reproduce the issue. ArrayIndexOutOfBoundsException in ALS for Large datasets Key: SPARK-3080 URL: https://issues.apache.org/jira/browse/SPARK-3080 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Burak Yavuz The stack trace is below: {quote} java.lang.ArrayIndexOutOfBoundsException: 2716 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) {quote} This happened after the dataset was sub-sampled. Dataset properties: ~12B ratings Setup: 55 r3.8xlarge ec2 instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3080: - Assignee: Xiangrui Meng ArrayIndexOutOfBoundsException in ALS for Large datasets Key: SPARK-3080 URL: https://issues.apache.org/jira/browse/SPARK-3080 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Burak Yavuz Assignee: Xiangrui Meng The stack trace is below: {quote} java.lang.ArrayIndexOutOfBoundsException: 2716 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) {quote} This happened after the dataset was sub-sampled. Dataset properties: ~12B ratings Setup: 55 r3.8xlarge ec2 instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212822#comment-14212822 ] Ilya Ganelin commented on SPARK-3080: - Hi Xiangrui - I was not doing any sort of randomization or sampling in the code that produced this issue. ArrayIndexOutOfBoundsException in ALS for Large datasets Key: SPARK-3080 URL: https://issues.apache.org/jira/browse/SPARK-3080 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Burak Yavuz Assignee: Xiangrui Meng The stack trace is below: {quote} java.lang.ArrayIndexOutOfBoundsException: 2716 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) {quote} This happened after the dataset was sub-sampled. Dataset properties: ~12B ratings Setup: 55 r3.8xlarge ec2 instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4414) SparkContext.wholeTextFiles Doesn't work with S3 Buckets
Pedro Rodriguez created SPARK-4414: -- Summary: SparkContext.wholeTextFiles Doesn't work with S3 Buckets Key: SPARK-4414 URL: https://issues.apache.org/jira/browse/SPARK-4414 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Pedro Rodriguez SparkContext.wholeTextFiles does not read files which SparkContext.textFile can read. Below are general steps to reproduce, my specific case is following that on a git repo. Steps to reproduce. 1. Create Amazon S3 bucket, make public with multiple files 2. Attempt to read bucket with sc.wholeTextFiles(s3n://mybucket/myfile.txt) 3. Spark returns the following error, even if the file exists. Exception in thread main java.io.FileNotFoundException: File does not exist: /myfile.txt at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.init(CombineFileInputFormat.java:489) 4. Change the call to sc.textFile(s3n://mybucket/myfile.txt) and there is no error message, the application should run fine. There is a question on StackOverflow as well on this: http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist This is link to repo/lines of code. The uncommented call doesn't work, the commented call works as expected: https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19 It would be easy to use textFile with a multifile argument, but this should work correctly for s3 bucket files as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212833#comment-14212833 ] sam commented on SPARK-1867: [~ansonism] Are you 100% sure your jar is also the same hadoop dep version?? Spark Documentation Error causes java.lang.IllegalStateException: unread block data --- Key: SPARK-1867 URL: https://issues.apache.org/jira/browse/SPARK-1867 Project: Spark Issue Type: Bug Reporter: sam I've employed two System Administrators on a contract basis (for quite a bit of money), and both contractors have independently hit the following exception. What we are doing is: 1. Installing Spark 0.9.1 according to the documentation on the website, along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. 2. Building a fat jar with a Spark app with sbt then trying to run it on the cluster I've also included code snippets, and sbt deps at the bottom. When I've Googled this, there seems to be two somewhat vague responses: a) Mismatching spark versions on nodes/user code b) Need to add more jars to the SparkConf Now I know that (b) is not the problem having successfully run the same code on other clusters while only including one jar (it's a fat jar). But I have no idea how to check for (a) - it appears Spark doesn't have any version checks or anything - it would be nice if it checked versions and threw a mismatching version exception: you have user code using version X and node Y has version Z. I would be very grateful for advice on this. The exception: Exception in thread main org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 32 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to java.lang.IllegalStateException: unread block data [duplicate 59] My code snippet: val conf = new SparkConf() .setMaster(clusterMaster) .setAppName(appName) .setSparkHome(sparkHome) .setJars(SparkContext.jarOfClass(this.getClass)) println(count = + new SparkContext(conf).textFile(someHdfsPath).count()) My SBT dependencies: // relevant org.apache.spark % spark-core_2.10 % 0.9.1, org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0, // standard, probably unrelated com.github.seratch %% awscala % [0.2,), org.scalacheck %% scalacheck % 1.10.1 % test, org.specs2 %% specs2 % 1.14 % test, org.scala-lang % scala-reflect % 2.10.3, org.scalaz %% scalaz-core % 7.0.5, net.minidev % json-smart % 1.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3860) Improve dimension joins
[ https://issues.apache.org/jira/browse/SPARK-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3860: Priority: Critical (was: Major) Improve dimension joins --- Key: SPARK-3860 URL: https://issues.apache.org/jira/browse/SPARK-3860 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Priority: Critical This is an umbrella ticket for improving performance for joining multiple dimension tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212852#comment-14212852 ] Anson Abraham commented on SPARK-1867: -- Yes. i added 3 data nodes just for this. And I had 2 of them as my worker node and the other as my master. Still getting that issue. Also the jar files were all supplied by Cloudera. Spark Documentation Error causes java.lang.IllegalStateException: unread block data --- Key: SPARK-1867 URL: https://issues.apache.org/jira/browse/SPARK-1867 Project: Spark Issue Type: Bug Reporter: sam I've employed two System Administrators on a contract basis (for quite a bit of money), and both contractors have independently hit the following exception. What we are doing is: 1. Installing Spark 0.9.1 according to the documentation on the website, along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. 2. Building a fat jar with a Spark app with sbt then trying to run it on the cluster I've also included code snippets, and sbt deps at the bottom. When I've Googled this, there seems to be two somewhat vague responses: a) Mismatching spark versions on nodes/user code b) Need to add more jars to the SparkConf Now I know that (b) is not the problem having successfully run the same code on other clusters while only including one jar (it's a fat jar). But I have no idea how to check for (a) - it appears Spark doesn't have any version checks or anything - it would be nice if it checked versions and threw a mismatching version exception: you have user code using version X and node Y has version Z. I would be very grateful for advice on this. The exception: Exception in thread main org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 32 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to java.lang.IllegalStateException: unread block data [duplicate 59] My code snippet: val conf = new SparkConf() .setMaster(clusterMaster) .setAppName(appName) .setSparkHome(sparkHome) .setJars(SparkContext.jarOfClass(this.getClass)) println(count = + new SparkContext(conf).textFile(someHdfsPath).count()) My SBT dependencies: // relevant org.apache.spark % spark-core_2.10 % 0.9.1, org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0, // standard, probably unrelated com.github.seratch %% awscala % [0.2,), org.scalacheck %% scalacheck % 1.10.1 % test, org.specs2 %% specs2 % 1.14 % test, org.scala-lang % scala-reflect % 2.10.3, org.scalaz %% scalaz-core % 7.0.5, net.minidev % json-smart % 1.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4380) Executor full of log spilling in-memory map of 0 MB to disk
[ https://issues.apache.org/jira/browse/SPARK-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4380: - Assignee: Hong Shen Executor full of log spilling in-memory map of 0 MB to disk - Key: SPARK-4380 URL: https://issues.apache.org/jira/browse/SPARK-4380 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Hong Shen Assignee: Hong Shen When I set spark.shuffle.manager = sort, Executor full of log, It confuse me a lot: 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling in-memory batch of 0 MB to disk (277 spills so far) 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling in-memory batch of 0 MB to disk (278 spills so far) 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling in-memory batch of 0 MB to disk (279 spills so far) 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling in-memory batch of 0 MB to disk (280 spills so far) 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling in-memory batch of 0 MB to disk (281 spills so far) 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling in-memory batch of 0 MB to disk (282 spills so far) 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling in-memory batch of 0 MB to disk (283 spills so far) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4380) Executor full of log spilling in-memory map of 0 MB to disk
[ https://issues.apache.org/jira/browse/SPARK-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212857#comment-14212857 ] Andrew Or commented on SPARK-4380: -- In general it's pretty worrying that it's spilling so much. Nevertheless this is a good change. We should look into why it's spilling all the time separately. Executor full of log spilling in-memory map of 0 MB to disk - Key: SPARK-4380 URL: https://issues.apache.org/jira/browse/SPARK-4380 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Hong Shen Assignee: Hong Shen When I set spark.shuffle.manager = sort, Executor full of log, It confuse me a lot: 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling in-memory batch of 0 MB to disk (277 spills so far) 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling in-memory batch of 0 MB to disk (278 spills so far) 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling in-memory batch of 0 MB to disk (279 spills so far) 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling in-memory batch of 0 MB to disk (280 spills so far) 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling in-memory batch of 0 MB to disk (281 spills so far) 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling in-memory batch of 0 MB to disk (282 spills so far) 14/11/13 17:35:18 INFO collection.ExternalSorter: Thread 111 spilling in-memory batch of 0 MB to disk (283 spills so far) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4313) Thread Dump link is broken in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4313. Resolution: Fixed Fix Version/s: 1.2.0 Target Version/s: 1.2.0 Thread Dump link is broken in yarn-cluster mode - Key: SPARK-4313 URL: https://issues.apache.org/jira/browse/SPARK-4313 Project: Spark Issue Type: Bug Components: Web UI, YARN Affects Versions: 1.2.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Minor Fix For: 1.2.0 In yarn-cluster mode, the Web UI is running behind a yarn proxy server. Some features(or bugs?) of yarn proxy server will break the links for thread dump. 1. Yarn proxy server will do http redirect internally, so if opening http://example.com:8088/cluster/app/application_1415344371838_0012/executors;, it will fetch http://example.com:8088/cluster/app/application_1415344371838_0012/executors/; and return the content but won't change the link in the browser. Then when a user clicks Thread Dump, it will jump to http://example.com:8088/proxy/application_1415344371838_0012/threadDump/?executorId=2;. This is a wrong link. The correct link should be http://example.com:8088/proxy/application_1415344371838_0012/executors/threadDump/?executorId=2;. 2. Yarn proxy server has a bug about the URL encode/decode. When a user accesses http://example.com:8088/proxy/application_1415344371838_0006/executors/threadDump/?executorId=%3Cdriver%3E;, the yarn proxy server will require http://example.com:36429/executors/threadDump/?executorId=%25253Cdriver%25253E;. But Spark web server expects http://example.com:36429/executors/threadDump/?executorId=%3Cdriver%3E;. I will report this issue to Hadoop community later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4313) Thread Dump link is broken in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4313: - Assignee: Shixiong Zhu Thread Dump link is broken in yarn-cluster mode - Key: SPARK-4313 URL: https://issues.apache.org/jira/browse/SPARK-4313 Project: Spark Issue Type: Bug Components: Web UI, YARN Affects Versions: 1.2.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Minor Fix For: 1.2.0 In yarn-cluster mode, the Web UI is running behind a yarn proxy server. Some features(or bugs?) of yarn proxy server will break the links for thread dump. 1. Yarn proxy server will do http redirect internally, so if opening http://example.com:8088/cluster/app/application_1415344371838_0012/executors;, it will fetch http://example.com:8088/cluster/app/application_1415344371838_0012/executors/; and return the content but won't change the link in the browser. Then when a user clicks Thread Dump, it will jump to http://example.com:8088/proxy/application_1415344371838_0012/threadDump/?executorId=2;. This is a wrong link. The correct link should be http://example.com:8088/proxy/application_1415344371838_0012/executors/threadDump/?executorId=2;. 2. Yarn proxy server has a bug about the URL encode/decode. When a user accesses http://example.com:8088/proxy/application_1415344371838_0006/executors/threadDump/?executorId=%3Cdriver%3E;, the yarn proxy server will require http://example.com:36429/executors/threadDump/?executorId=%25253Cdriver%25253E;. But Spark web server expects http://example.com:36429/executors/threadDump/?executorId=%3Cdriver%3E;. I will report this issue to Hadoop community later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4404) SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-process ends
[ https://issues.apache.org/jira/browse/SPARK-4404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4404: - Affects Version/s: 1.1.0 SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-process ends - Key: SPARK-4404 URL: https://issues.apache.org/jira/browse/SPARK-4404 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: WangTaoTheTonic When we have spark.driver.extra* or spark.driver.memory in SPARK_SUBMIT_PROPERTIES_FILE, spark-class will use SparkSubmitDriverBootstrapper to launch driver. If we get process id of SparkSubmitDriverBootstrapper and wanna kill it during its running, we expect its SparkSubmit sub-process stop also. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag
[ https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212875#comment-14212875 ] Ilya Ganelin commented on SPARK-3694: - There is also task serialization that happens within the DAG Scheduler. Do we want to print that? Allow printing object graph of tasks/RDD's with a debug flag Key: SPARK-3694 URL: https://issues.apache.org/jira/browse/SPARK-3694 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Ilya Ganelin Labels: starter This would be useful for debugging extra references inside of RDD's Here is an example for inspiration: http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html We'd want to print this trace for both the RDD serialization inside of the DAGScheduler and the task serialization in the TaskSetManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2321) Design a proper progress reporting event listener API
[ https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212884#comment-14212884 ] Apache Spark commented on SPARK-2321: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/3197 Design a proper progress reporting event listener API --- Key: SPARK-2321 URL: https://issues.apache.org/jira/browse/SPARK-2321 Project: Spark Issue Type: Improvement Components: Java API, Spark Core Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Josh Rosen Priority: Critical Fix For: 1.2.0 This is a ticket to track progress on redesigning the SparkListener and JobProgressListener API. There are multiple problems with the current design, including: 0. I'm not sure if the API is usable in Java (there are at least some enums we used in Scala and a bunch of case classes that might complicate things). 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of attention to it yet. Something as important as progress reporting deserves a more stable API. 2. There is no easy way to connect jobs with stages. Similarly, there is no easy way to connect job groups with jobs / stages. 3. JobProgressListener itself has no encapsulation at all. States can be arbitrarily mutated by external programs. Variable names are sort of randomly decided and inconsistent. We should just revisit these and propose a new, concrete design. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4239) support view in HiveQL
[ https://issues.apache.org/jira/browse/SPARK-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4239. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3131 [https://github.com/apache/spark/pull/3131] support view in HiveQL -- Key: SPARK-4239 URL: https://issues.apache.org/jira/browse/SPARK-4239 Project: Spark Issue Type: New Feature Components: SQL Reporter: Adrian Wang Assignee: Adrian Wang Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4245) Fix containsNull of the result ArrayType of CreateArray expression.
[ https://issues.apache.org/jira/browse/SPARK-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4245. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3110 [https://github.com/apache/spark/pull/3110] Fix containsNull of the result ArrayType of CreateArray expression. --- Key: SPARK-4245 URL: https://issues.apache.org/jira/browse/SPARK-4245 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Fix For: 1.2.0 The {{containsNull}} of the result {{ArrayType}} of {{CreateArray}} should be true only if the children is empty or there exists nullable child. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4333) Correctly log number of iterations in RuleExecutor
[ https://issues.apache.org/jira/browse/SPARK-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4333. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3180 [https://github.com/apache/spark/pull/3180] Correctly log number of iterations in RuleExecutor -- Key: SPARK-4333 URL: https://issues.apache.org/jira/browse/SPARK-4333 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: DoingDone9 Priority: Minor Fix For: 1.2.0 RuleExecutor breaks, num of iteration should be $(iteration -1) not $(iteration) . Log looks like Fixed point reached for batch $(batch.name) after 3 iterations., but it did 2 iterations really! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4062) Improve KafkaReceiver to prevent data loss
[ https://issues.apache.org/jira/browse/SPARK-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-4062. -- Resolution: Fixed Improve KafkaReceiver to prevent data loss -- Key: SPARK-4062 URL: https://issues.apache.org/jira/browse/SPARK-4062 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Saisai Shao Assignee: Saisai Shao Attachments: RefactoredKafkaReceiver.pdf Current KafkaReceiver has data loss and data re-consuming problem. Here we propose a ReliableKafkaReceiver to improving its reliability and fault tolerance with the power of Spark Streaming's WAL mechanism. This is a follow up work of SPARK-3129. Design doc is posted, any comments would be greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3129) Prevent data loss in Spark Streaming on driver failure
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-3129. -- Resolution: Fixed Fix Version/s: 1.2.0 I am marking this as fixed, as all non-test related issues have been merged. The one sub-task left is related to unit-tests that uses the WAL to do end-to-end tests and verify no data loss. Prevent data loss in Spark Streaming on driver failure -- Key: SPARK-3129 URL: https://issues.apache.org/jira/browse/SPARK-3129 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.0.3 Reporter: Hari Shreedharan Assignee: Tathagata Das Priority: Critical Fix For: 1.2.0 Attachments: SecurityFix.diff Spark Streaming can small amounts of data when the driver goes down - and the sending system cannot re-send the data (or the data has already expired on the sender side). This currently affects all receivers. The solution we propose is to reliably store all the received data into HDFS. This will allow the data to persist through driver failures, and therefore can be processed when the driver gets restarted. The high level design doc for this feature is given here. https://docs.google.com/document/d/1vTCB5qVfyxQPlHuv8rit9-zjdttlgaSrMgfCDQlCJIM/edit?usp=sharing This major task has been divided in sub-tasks - Implementing a write ahead log management system that can manage rolling write ahead logs - write to log, recover on failure and clean up old logs - Implementing a HDFS backed block RDD that can read data either from Spark's BlockManager or from HDFS files - Implementing a ReceivedBlockHandler interface that abstracts out the functionality of saving received blocks - Implementing a ReceivedBlockTracker and other associated changes in the driver that allows metadata of received blocks and block-to-batch allocations to be recovered upon driver retart -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4246) Add testsuite with end-to-end testing of driver failure
[ https://issues.apache.org/jira/browse/SPARK-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-4246: - Target Version/s: 1.2.0 Add testsuite with end-to-end testing of driver failure Key: SPARK-4246 URL: https://issues.apache.org/jira/browse/SPARK-4246 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3129) Prevent data loss in Spark Streaming on driver failure using Write Ahead Logs
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-3129: - Summary: Prevent data loss in Spark Streaming on driver failure using Write Ahead Logs (was: Prevent data loss in Spark Streaming on driver failure) Prevent data loss in Spark Streaming on driver failure using Write Ahead Logs - Key: SPARK-3129 URL: https://issues.apache.org/jira/browse/SPARK-3129 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.0.3 Reporter: Hari Shreedharan Assignee: Tathagata Das Priority: Critical Fix For: 1.2.0 Attachments: SecurityFix.diff Spark Streaming can small amounts of data when the driver goes down - and the sending system cannot re-send the data (or the data has already expired on the sender side). This currently affects all receivers. The solution we propose is to reliably store all the received data into HDFS. This will allow the data to persist through driver failures, and therefore can be processed when the driver gets restarted. The high level design doc for this feature is given here. https://docs.google.com/document/d/1vTCB5qVfyxQPlHuv8rit9-zjdttlgaSrMgfCDQlCJIM/edit?usp=sharing This major task has been divided in sub-tasks - Implementing a write ahead log management system that can manage rolling write ahead logs - write to log, recover on failure and clean up old logs - Implementing a HDFS backed block RDD that can read data either from Spark's BlockManager or from HDFS files - Implementing a ReceivedBlockHandler interface that abstracts out the functionality of saving received blocks - Implementing a ReceivedBlockTracker and other associated changes in the driver that allows metadata of received blocks and block-to-batch allocations to be recovered upon driver retart -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4391) Parquet Filter pushdown flag should be set with SQLConf
[ https://issues.apache.org/jira/browse/SPARK-4391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4391. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3258 [https://github.com/apache/spark/pull/3258] Parquet Filter pushdown flag should be set with SQLConf --- Key: SPARK-4391 URL: https://issues.apache.org/jira/browse/SPARK-4391 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4322) Struct fields can't be used as sub-expression of grouping fields
[ https://issues.apache.org/jira/browse/SPARK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4322. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3248 [https://github.com/apache/spark/pull/3248] Struct fields can't be used as sub-expression of grouping fields Key: SPARK-4322 URL: https://issues.apache.org/jira/browse/SPARK-4322 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Fix For: 1.2.0 Some examples: {code} sqlContext.jsonRDD(sc.parallelize({a: {b: [{c: 1}]}} :: Nil)).registerTempTable(data1) sqlContext.sql(SELECT a.b[0].c FROM data1 GROUP BY a.b[0].c).collect() sqlContext.jsonRDD(sc.parallelize({a: {b: 1}} :: Nil)).registerTempTable(data2) sqlContext.sql(SELECT a.b + 1 FROM data2 GROUP BY a.b + 1).collect() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4386) Parquet file write performance improvement
[ https://issues.apache.org/jira/browse/SPARK-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4386. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3254 [https://github.com/apache/spark/pull/3254] Parquet file write performance improvement -- Key: SPARK-4386 URL: https://issues.apache.org/jira/browse/SPARK-4386 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: Jim Carroll Fix For: 1.2.0 If you profile the writing of a Parquet file, the single worst time consuming call inside of org.apache.spark.sql.parquet.MutableRowWriteSupport.write is actually in the scala.collection.AbstractSequence.size call. This is because the size call actually ends up COUNTING the elements in a scala.collection.LinearSeqOptimized (optimized?). This doesn't need to be done. size is called repeatedly where needed rather than called once at the top of the method and stored in a 'val'. I have a PR for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4365) Remove unnecessary filter call on records returned from parquet library
[ https://issues.apache.org/jira/browse/SPARK-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4365. - Resolution: Fixed Issue resolved by pull request 3229 [https://github.com/apache/spark/pull/3229] Remove unnecessary filter call on records returned from parquet library --- Key: SPARK-4365 URL: https://issues.apache.org/jira/browse/SPARK-4365 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Yash Datta Priority: Minor Fix For: 1.2.0 Since parquet library has been updated , we no longer need to filter the records returned from parquet library for null records , as now the library skips those : from parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java public boolean nextKeyValue() throws IOException, InterruptedException { boolean recordFound = false; while (!recordFound) { // no more records left if (current = total) { return false; } try { checkRead(); currentValue = recordReader.read(); current ++; if (recordReader.shouldSkipCurrentRecord()) { // this record is being filtered via the filter2 package if (DEBUG) LOG.debug(skipping record); continue; } if (currentValue == null) { // only happens with FilteredRecordReader at end of block current = totalCountLoadedSoFar; if (DEBUG) LOG.debug(filtered record reader reached end of block); continue; } recordFound = true; if (DEBUG) LOG.debug(read value: + currentValue); } catch (RuntimeException e) { throw new ParquetDecodingException(format(Can not read value at %d in block %d in file %s, current, currentBlock, file), e); } } return true; } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4412) Parquet logger cannot be configured
[ https://issues.apache.org/jira/browse/SPARK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4412. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3271 [https://github.com/apache/spark/pull/3271] Parquet logger cannot be configured --- Key: SPARK-4412 URL: https://issues.apache.org/jira/browse/SPARK-4412 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jim Carroll Fix For: 1.2.0 The Spark ParquetRelation.scala code makes the assumption that the parquet.Log class has already been loaded. If ParquetRelation.enableLogForwarding executes prior to the parquet.Log class being loaded then the code in enableLogForwarding has no affect. ParquetRelation.scala attempts to override the parquet logger but, at least currently (and if your application simply reads a parquet file before it does anything else with Parquet), the parquet.Log class hasn't been loaded yet. Therefore the code in ParquetRelation.enableLogForwarding has no affect. If you look at the code in parquet.Log there's a static initializer that needs to be called prior to enableLogForwarding or whatever enableLogForwarding does gets undone by this static initializer. The fix would be to force the static initializer to get called in parquet.Log as part of enableForwardLogging. PR will be forthcomming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4415) Driver did not exit after python driver had exited.
[ https://issues.apache.org/jira/browse/SPARK-4415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213076#comment-14213076 ] Apache Spark commented on SPARK-4415: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3274 Driver did not exit after python driver had exited. --- Key: SPARK-4415 URL: https://issues.apache.org/jira/browse/SPARK-4415 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Davies Liu Priority: Critical If we have spark_driver_memory in spark-default.cfg, and start the spark job by {code} $ python xxx.py {code} Then the spark driver will not exit after the python process had exited. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4214) With dynamic allocation, avoid outstanding requests for more executors than pending tasks need
[ https://issues.apache.org/jira/browse/SPARK-4214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4214. Resolution: Fixed Fix Version/s: 1.2.0 With dynamic allocation, avoid outstanding requests for more executors than pending tasks need -- Key: SPARK-4214 URL: https://issues.apache.org/jira/browse/SPARK-4214 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 1.2.0 Dynamic allocation tries to allocate more executors while we have pending tasks remaining. Our current policy can end up with more outstanding executor requests than needed to fulfill all the pending tasks. Capping the executor requests to the number of cores needed to fulfill all pending tasks would make dynamic allocation behavior less sensitive to settings for maxExecutors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4345) Spark SQL Hive throws exception when drop a none-exist table
[ https://issues.apache.org/jira/browse/SPARK-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Liu updated SPARK-4345: Comment: was deleted (was: Swallow NoSuchObjectException exception when drop a none-exist hive table. pull @https://github.com/apache/spark/pull/3211) Spark SQL Hive throws exception when drop a none-exist table Key: SPARK-4345 URL: https://issues.apache.org/jira/browse/SPARK-4345 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Alex Liu Priority: Minor When drop a none-exist hive table, an exception is thrown. log {code} scala val t = hql(drop table if exists test_table); warning: there were 1 deprecation warning(s); re-run with -deprecation for details t: org.apache.spark.sql.SchemaRDD = SchemaRDD[13] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == DropTable test_table, true scala val t = hql(drop table if exists test_table); warning: there were 1 deprecation warning(s); re-run with -deprecation for details 14/11/11 10:21:49 ERROR Hive: NoSuchObjectException(message:default.test_table table not found) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1373) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103) at com.sun.proxy.$Proxy14.get_table(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:854) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at com.sun.proxy.$Proxy15.getTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:892) at org.apache.hadoop.hive.ql.exec.DDLTask.dropTable(DDLTask.java:3276) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:277) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:298) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:65) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:63) at org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:71) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103) at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:106) at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:110) at $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:69) at $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:74) at $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:76) at
[jira] [Commented] (SPARK-4345) Spark SQL Hive throws exception when drop a none-exist table
[ https://issues.apache.org/jira/browse/SPARK-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213121#comment-14213121 ] Alex Liu commented on SPARK-4345: - It looks like a bug in Hive https://issues.apache.org/jira/browse/HIVE-8564 Spark SQL Hive throws exception when drop a none-exist table Key: SPARK-4345 URL: https://issues.apache.org/jira/browse/SPARK-4345 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Alex Liu Priority: Minor When drop a none-exist hive table, an exception is thrown. log {code} scala val t = hql(drop table if exists test_table); warning: there were 1 deprecation warning(s); re-run with -deprecation for details t: org.apache.spark.sql.SchemaRDD = SchemaRDD[13] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == DropTable test_table, true scala val t = hql(drop table if exists test_table); warning: there were 1 deprecation warning(s); re-run with -deprecation for details 14/11/11 10:21:49 ERROR Hive: NoSuchObjectException(message:default.test_table table not found) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1373) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103) at com.sun.proxy.$Proxy14.get_table(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:854) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at com.sun.proxy.$Proxy15.getTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:892) at org.apache.hadoop.hive.ql.exec.DDLTask.dropTable(DDLTask.java:3276) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:277) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:298) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:65) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:63) at org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:71) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103) at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:106) at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:110) at $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:69) at $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:74) at $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:76) at
[jira] [Resolved] (SPARK-4345) Spark SQL Hive throws exception when drop a none-exist table
[ https://issues.apache.org/jira/browse/SPARK-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Liu resolved SPARK-4345. - Resolution: Won't Fix Spark SQL Hive throws exception when drop a none-exist table Key: SPARK-4345 URL: https://issues.apache.org/jira/browse/SPARK-4345 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Alex Liu Priority: Minor When drop a none-exist hive table, an exception is thrown. log {code} scala val t = hql(drop table if exists test_table); warning: there were 1 deprecation warning(s); re-run with -deprecation for details t: org.apache.spark.sql.SchemaRDD = SchemaRDD[13] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == DropTable test_table, true scala val t = hql(drop table if exists test_table); warning: there were 1 deprecation warning(s); re-run with -deprecation for details 14/11/11 10:21:49 ERROR Hive: NoSuchObjectException(message:default.test_table table not found) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1373) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103) at com.sun.proxy.$Proxy14.get_table(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:854) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at com.sun.proxy.$Proxy15.getTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:892) at org.apache.hadoop.hive.ql.exec.DDLTask.dropTable(DDLTask.java:3276) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:277) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:298) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:65) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:63) at org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:71) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103) at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:106) at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:110) at $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:69) at $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:74) at $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:76) at $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:78) at
[jira] [Commented] (SPARK-4349) Spark driver hangs on sc.parallelize() if exception is thrown during serialization
[ https://issues.apache.org/jira/browse/SPARK-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213128#comment-14213128 ] Apache Spark commented on SPARK-4349: - User 'mccheah' has created a pull request for this issue: https://github.com/apache/spark/pull/3275 Spark driver hangs on sc.parallelize() if exception is thrown during serialization -- Key: SPARK-4349 URL: https://issues.apache.org/jira/browse/SPARK-4349 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Matt Cheah Fix For: 1.3.0 Executing the following in the Spark Shell will lead to the Spark Shell hanging after a stack trace is printed. The serializer is set to the Kryo serializer. {code} scala import com.esotericsoftware.kryo.io.Input import com.esotericsoftware.kryo.io.Input scala import com.esotericsoftware.kryo.io.Output import com.esotericsoftware.kryo.io.Output scala class MyKryoSerializable extends com.esotericsoftware.kryo.KryoSerializable { def write (kryo: com.esotericsoftware.kryo.Kryo, output: Output) { throw new com.esotericsoftware.kryo.KryoException; } ; def read (kryo: com.esotericsoftware.kryo.Kryo, input: Input) { throw new com.esotericsoftware.kryo.KryoException; } } defined class MyKryoSerializable scala sc.parallelize(Seq(new MyKryoSerializable, new MyKryoSerializable)).collect {code} A stack trace is printed during serialization as expected, but another stack trace is printed afterwards, indicating that the driver can't recover: {code} 14/11/11 14:10:03 ERROR OneForOneStrategy: actor name [ExecutorActor] is not unique! akka.actor.PostRestartException: exception post restart (class java.io.IOException) at akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:249) at akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:247) at akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:302) at akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:297) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:247) at akka.actor.dungeon.FaultHandling$class.faultRecreate(FaultHandling.scala:76) at akka.actor.ActorCell.faultRecreate(ActorCell.scala:369) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:459) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: akka.actor.InvalidActorNameException: actor name [ExecutorActor] is not unique! at akka.actor.dungeon.ChildrenContainer$NormalChildrenContainer.reserve(ChildrenContainer.scala:130) at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77) at akka.actor.ActorCell.reserveChild(ActorCell.scala:369) at akka.actor.dungeon.Children$class.makeChild(Children.scala:202) at akka.actor.dungeon.Children$class.attachChild(Children.scala:42) at akka.actor.ActorCell.attachChild(ActorCell.scala:369) at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:552) at org.apache.spark.executor.Executor.init(Executor.scala:97) at org.apache.spark.scheduler.local.LocalActor.init(LocalBackend.scala:53) at org.apache.spark.scheduler.local.LocalBackend$$anonfun$start$1.apply(LocalBackend.scala:96) at org.apache.spark.scheduler.local.LocalBackend$$anonfun$start$1.apply(LocalBackend.scala:96) at akka.actor.TypedCreatorFunctionConsumer.produce(Props.scala:343) at akka.actor.Props.newActor(Props.scala:252) at akka.actor.ActorCell.newActor(ActorCell.scala:552) at akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:234) ... 11 more {code} -- This message was sent by Atlassian JIRA
[jira] [Created] (SPARK-4417) New API: sample RDD to fixed number of items
Davies Liu created SPARK-4417: - Summary: New API: sample RDD to fixed number of items Key: SPARK-4417 URL: https://issues.apache.org/jira/browse/SPARK-4417 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core Reporter: Davies Liu Sometimes, we just want to a fixed number of items randomly selected from an RDD, for example, before sort an RDD we need to gather a fixed number of keys from each partitions. In order to do this, we need to two pass on the RDD, get the total number, then calculate the right ratio for sampling. In fact, we could do this in one pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3445) Deprecate and later remove YARN alpha support
[ https://issues.apache.org/jira/browse/SPARK-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213300#comment-14213300 ] Guoqiang Li commented on SPARK-3445: I think there are a lot of people are using hadoop 1.x. We should support at least until 2015. Deprecate and later remove YARN alpha support - Key: SPARK-3445 URL: https://issues.apache.org/jira/browse/SPARK-3445 Project: Spark Issue Type: Improvement Components: YARN Reporter: Patrick Wendell This will depend a bit on both user demand and the commitment level of maintainers, but I'd like to propose the following timeline for yarn-alpha support. Spark 1.2: Deprecate YARN-alpha Spark 1.3: Remove YARN-alpha (i.e. require YARN-stable) Since YARN-alpha is clearly identified as an alpha API, it seems reasonable to drop support for it in a minor release. However, it does depend a bit whether anyone uses this outside of Yahoo!, and that I'm not sure of. In the past this API has been used and maintained by Yahoo, but they'll be migrating soon to the stable API's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4404) SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-process ends
[ https://issues.apache.org/jira/browse/SPARK-4404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4404. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: WangTaoTheTonic Target Version/s: 1.2.0 SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-process ends - Key: SPARK-4404 URL: https://issues.apache.org/jira/browse/SPARK-4404 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: WangTaoTheTonic Assignee: WangTaoTheTonic Fix For: 1.2.0 When we have spark.driver.extra* or spark.driver.memory in SPARK_SUBMIT_PROPERTIES_FILE, spark-class will use SparkSubmitDriverBootstrapper to launch driver. If we get process id of SparkSubmitDriverBootstrapper and wanna kill it during its running, we expect its SparkSubmit sub-process stop also. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4415) Driver did not exit after python driver had exited.
[ https://issues.apache.org/jira/browse/SPARK-4415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4415. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Davies Liu Target Version/s: 1.2.0 (was: 1.1.1, 1.2.0) Driver did not exit after python driver had exited. --- Key: SPARK-4415 URL: https://issues.apache.org/jira/browse/SPARK-4415 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Critical Fix For: 1.2.0 If we have spark_driver_memory in spark-default.cfg, and start the spark job by {code} $ python xxx.py {code} Then the spark driver will not exit after the python process had exited. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4420) Change nullability of Cast from DoubleType/FloatType to DecimalType.
Takuya Ueshin created SPARK-4420: Summary: Change nullability of Cast from DoubleType/FloatType to DecimalType. Key: SPARK-4420 URL: https://issues.apache.org/jira/browse/SPARK-4420 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin This is follow-up of SPARK-4390. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org