[jira] [Resolved] (SPARK-1779) Warning when spark.storage.memoryFraction is not between 0 and 1
[ https://issues.apache.org/jira/browse/SPARK-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1779. Resolution: Fixed Fixed via: https://github.com/apache/spark/pull/714 Warning when spark.storage.memoryFraction is not between 0 and 1 Key: SPARK-1779 URL: https://issues.apache.org/jira/browse/SPARK-1779 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.9.0, 1.0.0 Reporter: wangfei Fix For: 1.1.0 There should be a warning when memoryFraction is lower than 0 or greater than 1 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2859) Update url of Kryo project in tuning.md
Guancheng Chen created SPARK-2859: - Summary: Update url of Kryo project in tuning.md Key: SPARK-2859 URL: https://issues.apache.org/jira/browse/SPARK-2859 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Guancheng Chen Priority: Trivial Kryo project has been migrated from googlecode to github, hence we need to update its URL in related docs such as tuning.md. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2862) DoubleRDDFunctions.histogram() throws exception for some inputs
[ https://issues.apache.org/jira/browse/SPARK-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086182#comment-14086182 ] Apache Spark commented on SPARK-2862: - User 'nrchandan' has created a pull request for this issue: https://github.com/apache/spark/pull/1787 DoubleRDDFunctions.histogram() throws exception for some inputs --- Key: SPARK-2862 URL: https://issues.apache.org/jira/browse/SPARK-2862 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0, 0.9.1, 1.0.0 Environment: Scala version 2.9.2 (OpenJDK 64-Bit Server VM, Java 1.7.0_55) running on Ubuntu 14.04 Reporter: Chandan Kumar histogram method call throws the below stack trace when the choice of bucketCount partitions the RDD in irrational increments e.g. scala val r = sc.parallelize(6 to 99) r: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console:12 scala r.histogram(9) java.lang.IndexOutOfBoundsException: 9 at scala.collection.immutable.NumericRange.apply(NumericRange.scala:124) at scala.collection.immutable.NumericRange$$anon$1.apply(NumericRange.scala:176) at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:66) at scala.collection.IterableLike$class.copyToArray(IterableLike.scala:237) at scala.collection.AbstractIterable.copyToArray(Iterable.scala:54) at scala.collection.TraversableOnce$class.copyToArray(TraversableOnce.scala:241) at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:105) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:249) at scala.collection.AbstractTraversable.toArray(Traversable.scala:105) at org.apache.spark.rdd.DoubleRDDFunctions.histogram(DoubleRDDFunctions.scala:116) at $iwC$$iwC$$iwC$$iwC.init(console:15) at $iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC.init(console:22) at $iwC.init(console:24) at init(console:26) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2862) DoubleRDDFunctions.histogram() throws exception for some inputs
[ https://issues.apache.org/jira/browse/SPARK-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-2862: --- Affects Version/s: 1.0.1 DoubleRDDFunctions.histogram() throws exception for some inputs --- Key: SPARK-2862 URL: https://issues.apache.org/jira/browse/SPARK-2862 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.0.1 Environment: Scala version 2.9.2 (OpenJDK 64-Bit Server VM, Java 1.7.0_55) running on Ubuntu 14.04 Reporter: Chandan Kumar histogram method call throws the below stack trace when the choice of bucketCount partitions the RDD in irrational increments e.g. scala val r = sc.parallelize(6 to 99) r: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console:12 scala r.histogram(9) java.lang.IndexOutOfBoundsException: 9 at scala.collection.immutable.NumericRange.apply(NumericRange.scala:124) at scala.collection.immutable.NumericRange$$anon$1.apply(NumericRange.scala:176) at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:66) at scala.collection.IterableLike$class.copyToArray(IterableLike.scala:237) at scala.collection.AbstractIterable.copyToArray(Iterable.scala:54) at scala.collection.TraversableOnce$class.copyToArray(TraversableOnce.scala:241) at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:105) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:249) at scala.collection.AbstractTraversable.toArray(Traversable.scala:105) at org.apache.spark.rdd.DoubleRDDFunctions.histogram(DoubleRDDFunctions.scala:116) at $iwC$$iwC$$iwC$$iwC.init(console:15) at $iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC.init(console:22) at $iwC.init(console:24) at init(console:26) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2861) Doc comment of DoubleRDDFunctions.histogram is incorrect
[ https://issues.apache.org/jira/browse/SPARK-2861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandan Kumar updated SPARK-2861: - Description: The documentation comment of histogram method of DoubleRDDFunctions class in source file DoubleRDDFunctions.scala is inconsistent. This might confuse somebody reading the documentation. Comment in question: {code} /** * Compute a histogram using the provided buckets. The buckets are all open * to the left except for the last which is closed * e.g. for the array * [1, 10, 20, 50] the buckets are [1, 10) [10, 20) [20, 50] * e.g 1=x10 , 10=x20, 20=x50 * And on the input of 1 and 50 we would have a histogram of 1, 0, 0 {code} The buckets are all open to the right (NOT left) except for the last which is closed For the example quoted, the last bucket should be 20=x=50. Also, the histogram result on input of 1 and 50 would be 1, 0, 1 (NOT 1, 0, 0). This works correctly in Spark but the doc comment is incorrect. was:The documentation comment of histogram method of DoubleRDDFunctions class in source file DoubleRDDFunctions.scala is partially incorrect, hence inconsistent. This might confuse somebody reading the documentation. Doc comment of DoubleRDDFunctions.histogram is incorrect Key: SPARK-2861 URL: https://issues.apache.org/jira/browse/SPARK-2861 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0, 0.9.1, 1.0.0 Reporter: Chandan Kumar Priority: Trivial The documentation comment of histogram method of DoubleRDDFunctions class in source file DoubleRDDFunctions.scala is inconsistent. This might confuse somebody reading the documentation. Comment in question: {code} /** * Compute a histogram using the provided buckets. The buckets are all open * to the left except for the last which is closed * e.g. for the array * [1, 10, 20, 50] the buckets are [1, 10) [10, 20) [20, 50] * e.g 1=x10 , 10=x20, 20=x50 * And on the input of 1 and 50 we would have a histogram of 1, 0, 0 {code} The buckets are all open to the right (NOT left) except for the last which is closed For the example quoted, the last bucket should be 20=x=50. Also, the histogram result on input of 1 and 50 would be 1, 0, 1 (NOT 1, 0, 0). This works correctly in Spark but the doc comment is incorrect. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2863) Emulate Hive type coercion in native reimplementations of Hive UDFs
William Benton created SPARK-2863: - Summary: Emulate Hive type coercion in native reimplementations of Hive UDFs Key: SPARK-2863 URL: https://issues.apache.org/jira/browse/SPARK-2863 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: William Benton Native reimplementations of Hive functions no longer have the same type-coercion behavior as they would if executed via Hive. As a href=https://github.com/apache/spark/pull/1750#discussion_r15790970; Michael Armbrust points out/a, queries like {{SELECT SQRT(2) FROM src LIMIT 1}} succeed in Hive but fail if {{SQRT}} is implemented natively. Spark SQL should have Hive-compatible type coercions for arguments to natively-implemented functions. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2636) no where to get job identifier while submit spark job through spark API
[ https://issues.apache.org/jira/browse/SPARK-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086494#comment-14086494 ] Marcelo Vanzin commented on SPARK-2636: --- (BTW, just checked SPARK-2321, so if you really mean the {{Job}} id, ignore my comments, since yes, it's kind of a pain to know the ID of a job you're submitting to the context.) no where to get job identifier while submit spark job through spark API --- Key: SPARK-2636 URL: https://issues.apache.org/jira/browse/SPARK-2636 Project: Spark Issue Type: New Feature Reporter: Chengxiang Li In Hive on Spark, we want to track spark job status through Spark API, the basic idea is as following: # create an hive-specified spark listener and register it to spark listener bus. # hive-specified spark listener generate job status by spark listener events. # hive driver track job status through hive-specified spark listener. the current problem is that hive driver need job identifier to track specified job status through spark listener, but there is no spark API to get job identifier(like job id) while submit spark job. I think other project whoever try to track job status with spark API would suffer from this as well. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2864) fix random seed in Word2Vec
Xiangrui Meng created SPARK-2864: Summary: fix random seed in Word2Vec Key: SPARK-2864 URL: https://issues.apache.org/jira/browse/SPARK-2864 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng The random seed is not fixed in word2vec, making the unit tests fail randomly. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2622) Add Jenkins build numbers to SparkQA messages
[ https://issues.apache.org/jira/browse/SPARK-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086513#comment-14086513 ] Xiangrui Meng commented on SPARK-2622: -- The build number is included in the SparkQA message, for example: https://github.com/apache/spark/pull/1788 The build number 17941 is in the URL https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17941/consoleFull;. Just need to be careful to match the number. Add Jenkins build numbers to SparkQA messages - Key: SPARK-2622 URL: https://issues.apache.org/jira/browse/SPARK-2622 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.0.1 Reporter: Xiangrui Meng Priority: Minor It takes Jenkins 2 hours to finish testing. It is possible to have the following: {code} Build 1 started. PR updated. Build 2 started. Build 1 finished successfully. A committer merged the PR because the last build seemed to be okay. Build 2 failed. {code} It would be nice to put the build number in the SparkQA message so it is easy to match the result with the build. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2622) Add Jenkins build numbers to SparkQA messages
[ https://issues.apache.org/jira/browse/SPARK-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-2622. Resolution: Fixed Add Jenkins build numbers to SparkQA messages - Key: SPARK-2622 URL: https://issues.apache.org/jira/browse/SPARK-2622 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.0.1 Reporter: Xiangrui Meng Priority: Minor It takes Jenkins 2 hours to finish testing. It is possible to have the following: {code} Build 1 started. PR updated. Build 2 started. Build 1 finished successfully. A committer merged the PR because the last build seemed to be okay. Build 2 failed. {code} It would be nice to put the build number in the SparkQA message so it is easy to match the result with the build. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1890) add modify acls to the web UI for the kill button
[ https://issues.apache.org/jira/browse/SPARK-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-1890. -- Resolution: Fixed add modify acls to the web UI for the kill button --- Key: SPARK-1890 URL: https://issues.apache.org/jira/browse/SPARK-1890 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.0.0 Reporter: Thomas Graves Assignee: Thomas Graves Priority: Critical Fix For: 1.1.0 A kill button has been added to the UI to allow you to kill tasks. Currently this is either enabled or disabled. I think we should add another set of acls to control who has permission to use this. We currently have view acls in the Security Manager which take affect if you have a servlet filter that does authentication installed. We should add another set of acls modify acls, that control who has permission to use the kill button. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2854) Finalize _acceptable_types in pyspark.sql
[ https://issues.apache.org/jira/browse/SPARK-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086560#comment-14086560 ] Yin Huai commented on SPARK-2854: - Since we have already do conversions for ByteType and ShortType when we have int values, to be consistent, we can also support long values for ByteType, ShortType and IntegerType. For datetime.time and datetime.date values, because there are SQL Time and Date types, datetime.time and datetime.date will not be allowed as value types for TimestampType. So, here will be the updated _acceptable_types {code} _acceptable_types = { BooleanType: (bool,), ByteType: (int, long), ShortType: (int, long), IntegerType: (int, long), LongType: (int, long), FloatType: (float,), DoubleType: (float,), DecimalType: (decimal.Decimal,), StringType: (str, unicode), TimestampType: (datetime.datetime,), ArrayType: (list, tuple, array), MapType: (dict,), StructType: (tuple, list), } {code} Finalize _acceptable_types in pyspark.sql - Key: SPARK-2854 URL: https://issues.apache.org/jira/browse/SPARK-2854 Project: Spark Issue Type: Task Components: SQL Reporter: Yin Huai Priority: Blocker In PySpark, _acceptable_types defines accepted Python data types for every Spark SQL data type. The list is shown below. {code} _acceptable_types = { BooleanType: (bool,), ByteType: (int, long), ShortType: (int, long), IntegerType: (int, long), LongType: (int, long), FloatType: (float,), DoubleType: (float,), DecimalType: (decimal.Decimal,), StringType: (str, unicode), TimestampType: (datetime.datetime, datetime.time, datetime.date), ArrayType: (list, tuple, array), MapType: (dict,), StructType: (tuple, list), } {code} Let's double check this mapping before 1.1 release. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2865) Potential deadlock: tasks could hang forever waiting to fetch a remote block even though most tasks finish
Zongheng Yang created SPARK-2865: Summary: Potential deadlock: tasks could hang forever waiting to fetch a remote block even though most tasks finish Key: SPARK-2865 URL: https://issues.apache.org/jira/browse/SPARK-2865 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.0.1, 1.1.0 Environment: 16-node EC2 r3.2xlarge cluster Reporter: Zongheng Yang Priority: Blocker In the application I tested, most of the tasks out of 128 tasks could finish, but sometimes (pretty deterministically) either 1 or 3 tasks would just hang forever with the following stack trace. There were no apparent failures from the UI, also the nodes where the stuck tasks were running had no apparent memory/CPU/disk pressures. {noformat} Executor task launch worker-0 daemon prio=10 tid=0x7f32ec003800 nid=0xaac waiting on condition [0x7f33f4428000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x7f3e0d7198e8 (a scala.concurrent.impl.Promise$CompletionLatch) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.network.ConnectionManager.sendMessageReliablySync(ConnectionManager.scala:832) at org.apache.spark.storage.BlockManagerWorker$.syncGetBlock(BlockManagerWorker.scala:122) at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:497) at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:495) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:495) at org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:481) at org.apache.spark.storage.BlockManager.get(BlockManager.scala:524) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44) at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {noformat} This behavior does *not* appear on 1.0 (reusing the same cluster), but appears on the master branch as of Aug 4, 2014 *and* 1.0.1. Further, I tried out [this patch|https://github.com/apache/spark/pull/1758], and it didn't fix the behavior. Further, when this behavior happened, the driver printed out the following line repeatedly: {noformat} 14/08/04 23:32:42 WARN storage.BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-172-31-6-74.us-west-1.compute.internal, 59408, 0) with no recent heart beats: 67331ms exceeds 45000ms {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2865) Potential deadlock: tasks could hang forever waiting to fetch a remote block even though most tasks finish
[ https://issues.apache.org/jira/browse/SPARK-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zongheng Yang updated SPARK-2865: - Description: In the application I tested, most of the tasks out of 128 tasks could finish, but sometimes (pretty deterministically) either 1 or 3 tasks would just hang forever ( 5 hrs with no progress at all) with the following stack trace. There were no apparent failures from the UI, also the nodes where the stuck tasks were running had no apparent memory/CPU/disk pressures. {noformat} Executor task launch worker-0 daemon prio=10 tid=0x7f32ec003800 nid=0xaac waiting on condition [0x7f33f4428000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x7f3e0d7198e8 (a scala.concurrent.impl.Promise$CompletionLatch) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.network.ConnectionManager.sendMessageReliablySync(ConnectionManager.scala:832) at org.apache.spark.storage.BlockManagerWorker$.syncGetBlock(BlockManagerWorker.scala:122) at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:497) at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:495) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:495) at org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:481) at org.apache.spark.storage.BlockManager.get(BlockManager.scala:524) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44) at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {noformat} This behavior does *not* appear on 1.0 (reusing the same cluster), but appears on the master branch as of Aug 4, 2014 *and* 1.0.1. Further, I tried out [this patch|https://github.com/apache/spark/pull/1758], and it didn't fix the behavior. When this behavior happened, the driver printed out the following line repeatedly: {noformat} 14/08/04 23:32:42 WARN storage.BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-172-31-6-74.us-west-1.compute.internal, 59408, 0) with no recent heart beats: 67331ms exceeds 45000ms {noformat} was: In the application I tested, most of the tasks out of 128 tasks could finish, but sometimes (pretty deterministically) either 1 or 3 tasks would just hang forever with the following stack trace. There were no apparent failures from the UI, also the nodes where the stuck tasks were running had no apparent memory/CPU/disk pressures. {noformat} Executor task launch worker-0 daemon prio=10 tid=0x7f32ec003800
[jira] [Updated] (SPARK-2865) Potential deadlock: tasks could hang forever waiting to fetch a remote block even though most tasks finish
[ https://issues.apache.org/jira/browse/SPARK-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zongheng Yang updated SPARK-2865: - Description: In the application I tested, most of the tasks out of 128 tasks could finish, but sometimes (pretty deterministically) either 1 or 3 tasks would just hang forever with the following stack trace. There were no apparent failures from the UI, also the nodes where the stuck tasks were running had no apparent memory/CPU/disk pressures. {noformat} Executor task launch worker-0 daemon prio=10 tid=0x7f32ec003800 nid=0xaac waiting on condition [0x7f33f4428000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x7f3e0d7198e8 (a scala.concurrent.impl.Promise$CompletionLatch) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.network.ConnectionManager.sendMessageReliablySync(ConnectionManager.scala:832) at org.apache.spark.storage.BlockManagerWorker$.syncGetBlock(BlockManagerWorker.scala:122) at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:497) at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:495) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:495) at org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:481) at org.apache.spark.storage.BlockManager.get(BlockManager.scala:524) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44) at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {noformat} This behavior does *not* appear on 1.0 (reusing the same cluster), but appears on the master branch as of Aug 4, 2014 *and* 1.0.1. Further, I tried out [this patch|https://github.com/apache/spark/pull/1758], and it didn't fix the behavior. When this behavior happened, the driver printed out the following line repeatedly: {noformat} 14/08/04 23:32:42 WARN storage.BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-172-31-6-74.us-west-1.compute.internal, 59408, 0) with no recent heart beats: 67331ms exceeds 45000ms {noformat} was: In the application I tested, most of the tasks out of 128 tasks could finish, but sometimes (pretty deterministically) either 1 or 3 tasks would just hang forever with the following stack trace. There were no apparent failures from the UI, also the nodes where the stuck tasks were running had no apparent memory/CPU/disk pressures. {noformat} Executor task launch worker-0 daemon prio=10 tid=0x7f32ec003800 nid=0xaac waiting on condition
[jira] [Resolved] (SPARK-2860) Resolving CASE WHEN throws None.get exception
[ https://issues.apache.org/jira/browse/SPARK-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2860. - Resolution: Fixed Fix Version/s: 1.1.0 Resolving CASE WHEN throws None.get exception - Key: SPARK-2860 URL: https://issues.apache.org/jira/browse/SPARK-2860 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2866) ORDER BY attributes must appear in SELECT clause
Michael Armbrust created SPARK-2866: --- Summary: ORDER BY attributes must appear in SELECT clause Key: SPARK-2866 URL: https://issues.apache.org/jira/browse/SPARK-2866 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2859) Update url of Kryo project in related docs
[ https://issues.apache.org/jira/browse/SPARK-2859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2859. Resolution: Fixed Fix Version/s: 1.1.0 1.0.3 Issue resolved by pull request 1782 [https://github.com/apache/spark/pull/1782] Update url of Kryo project in related docs -- Key: SPARK-2859 URL: https://issues.apache.org/jira/browse/SPARK-2859 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Guancheng Chen Priority: Trivial Fix For: 1.0.3, 1.1.0 Kryo project has been migrated from googlecode to github, hence we need to update its URL in related docs such as tuning.md. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2534) Avoid pulling in the entire RDD or PairRDDFunctions in various operators
[ https://issues.apache.org/jira/browse/SPARK-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2534: - Component/s: Spark Core Avoid pulling in the entire RDD or PairRDDFunctions in various operators Key: SPARK-2534 URL: https://issues.apache.org/jira/browse/SPARK-2534 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.1.0, 1.0.2 The way groupByKey is written actually pulls the entire PairRDDFunctions into the 3 closures, sometimes resulting in gigantic task sizes: {code} def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = { // groupByKey shouldn't use map side combine because map side combine does not // reduce the amount of data shuffled and requires all map side data be inserted // into a hash table, leading to more objects in the old gen. def createCombiner(v: V) = ArrayBuffer(v) def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v def mergeCombiners(c1: ArrayBuffer[V], c2: ArrayBuffer[V]) = c1 ++ c2 val bufs = combineByKey[ArrayBuffer[V]]( createCombiner _, mergeValue _, mergeCombiners _, partitioner, mapSideCombine=false) bufs.mapValues(_.toIterable) } {code} Changing the functions from def to val would solve it. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2867) saveAsHadoopFile() in PairRDDFunction.scala should allow use other OutputCommiter class
Joseph Su created SPARK-2867: Summary: saveAsHadoopFile() in PairRDDFunction.scala should allow use other OutputCommiter class Key: SPARK-2867 URL: https://issues.apache.org/jira/browse/SPARK-2867 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Joseph Su Priority: Minor The saveAsHadoopFile() in PairRDDFunction.scala hard-coded the OutputCommitter class as FileOutputCommitter because of the following code in the source: hadoopConf.setOutputCommitter(classOf[FileOutputCommitter]) However, OutputCommitter is a changeable option in regular Hadoop MapReduce program. Users can specify mapred.output.committer.class to change the committer class used by other Hadoop programs. The saveAsHadoopFile() function should remove this hard-coded assignment and provide a way to specify the OutputCommitte used here. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1977: - Fix Version/s: (was: 1.0.1) 1.0.2 mutable.BitSet in ALS not serializable with KryoSerializer -- Key: SPARK-1977 URL: https://issues.apache.org/jira/browse/SPARK-1977 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Neville Li Priority: Minor Fix For: 1.1.0, 1.0.2 OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member. KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't register mutable.BitSet. Right now we have to register mutable.BitSet manually. A proper fix would be using immutable.BitSet in ALS or register mutable.BitSet in upstream chill. {code} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: scala.collection.mutable.HashSet Serialization trace: shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at
[jira] [Commented] (SPARK-1834) NoSuchMethodError when invoking JavaPairRDD.reduce() in Java
[ https://issues.apache.org/jira/browse/SPARK-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086649#comment-14086649 ] Sean Owen commented on SPARK-1834: -- Ah, you're right: https://github.com/apache/spark/commit/181ec5030792a10f3ce77e997d0e2eda9bcd6139 It was unlikely to be the problem anyway. Very strange. NoSuchMethodError when invoking JavaPairRDD.reduce() in Java Key: SPARK-1834 URL: https://issues.apache.org/jira/browse/SPARK-1834 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.1 Environment: Redhat Linux, Java 7, Hadoop 2.2, Scala 2.10.4 Reporter: John Snodgrass I get a java.lang.NoSuchMethod error when invoking JavaPairRDD.reduce(). Here is the partial stack trace: Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:39) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2; at JavaPairRDDReduceTest.main(JavaPairRDDReduceTest.java:49)... I'm using Spark 0.9.1. I checked to ensure that I'm compiling with the same version of Spark as I am running on the cluster. The reduce() method works fine with JavaRDD, just not with JavaPairRDD. Here is a code snippet that exhibits the problem: ArrayListInteger array = new ArrayList(); for (int i = 0; i 10; ++i) { array.add(i); } JavaRDDInteger rdd = javaSparkContext.parallelize(array); JavaPairRDDString, Integer testRDD = rdd.map(new PairFunctionInteger, String, Integer() { @Override public Tuple2String, Integer call(Integer t) throws Exception { return new Tuple2( + t, t); } }).cache(); testRDD.reduce(new Function2Tuple2String, Integer, Tuple2String, Integer, Tuple2String, Integer() { @Override public Tuple2String, Integer call(Tuple2String, Integer arg0, Tuple2String, Integer arg1) throws Exception { return new Tuple2(arg0._1 + arg1._1, arg0._2 * 10 + arg0._2); } }); -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086657#comment-14086657 ] Zhan Zhang commented on SPARK-1537: --- I am also interested in it and trying to integrate spark to yarn timeline server. Do you have any concrete plan in mind? I can start prototype it and then we can work together on this topic. How do you think? Add integration with Yarn's Application Timeline Server --- Key: SPARK-1537 URL: https://issues.apache.org/jira/browse/SPARK-1537 Project: Spark Issue Type: New Feature Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin It would be nice to have Spark integrate with Yarn's Application Timeline Server (see YARN-321, YARN-1530). This would allow users running Spark on Yarn to have a single place to go for all their history needs, and avoid having to manage a separate service (Spark's built-in server). At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, although there is still some ongoing work. But the basics are there, and I wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2848) Shade Guava in Spark deliverables
[ https://issues.apache.org/jira/browse/SPARK-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086670#comment-14086670 ] Marcelo Vanzin commented on SPARK-2848: --- Question for others ([~pwendell], [~sowen], maybe others): how important do you think it is to support this from the sbt side of the build? This is trivial to do on the maven side (just a few pom file changes). But I can't seem to find any sbt plugin that does class relocation like maven-shade-plugin. I could write the code, but that seems to go in the wrong direction of keeping the sbt build code small-ish. Shade Guava in Spark deliverables - Key: SPARK-2848 URL: https://issues.apache.org/jira/browse/SPARK-2848 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin As discussed in SPARK-2420, this task covers the work of shading Guava in Spark deliverables so that they don't conflict with the Hadoop classpath (nor user's classpath). Since one Guava class is exposed through Spark's API, that class will be forked from 14.0.1 (current version used by Spark) and excluded from any shading. The end result is that Spark's Guava won't be exposed to users anymore. This has the side-effect of effectively downgrading to version 11 (the one used by Hadoop) for those that do not explicitly depend on / package Guava with their apps. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2699) Improve compatibility with parquet file/table
[ https://issues.apache.org/jira/browse/SPARK-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2699: Target Version/s: 1.2.0 (was: 1.1.0) Improve compatibility with parquet file/table - Key: SPARK-2699 URL: https://issues.apache.org/jira/browse/SPARK-2699 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Teng Qiu after SPARK-2446, the compatibility with parquet file created by old spark release (spark 1.0.x) and by impala (all of versions until now: 1.4.x-cdh5) is broken. strings in those parquet files are not annotated with UTF8 or are just only ASCII char set (impala doesn't support UTF8 yet) this ticket aims to add a configuration option or some version check to support those parquet files. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2380) Support displaying accumulator contents in the web UI
[ https://issues.apache.org/jira/browse/SPARK-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2380: --- Fix Version/s: 1.1.0 Support displaying accumulator contents in the web UI - Key: SPARK-2380 URL: https://issues.apache.org/jira/browse/SPARK-2380 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Critical Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2380) Support displaying accumulator contents in the web UI
[ https://issues.apache.org/jira/browse/SPARK-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2380. Resolution: Fixed Resolved by: https://github.com/apache/spark/pull/1309 Support displaying accumulator contents in the web UI - Key: SPARK-2380 URL: https://issues.apache.org/jira/browse/SPARK-2380 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2868) Support named accumulators in Python
Patrick Wendell created SPARK-2868: -- Summary: Support named accumulators in Python Key: SPARK-2868 URL: https://issues.apache.org/jira/browse/SPARK-2868 Project: Spark Issue Type: New Feature Components: PySpark Reporter: Patrick Wendell SPARK-2380 added this for Java/Scala. To allow this in Python we'll need to make some additional changes. One potential path is to have a 1:1 correspondence with Scala accumulators (instead of a one-to-many). A challenge is exposing the stringified values of the accumulators to the Scala code. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2583) ConnectionManager cannot distinguish whether error occurred or not
[ https://issues.apache.org/jira/browse/SPARK-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-2583: - Assignee: Josh Rosen (was: Kousuke Saruta) ConnectionManager cannot distinguish whether error occurred or not -- Key: SPARK-2583 URL: https://issues.apache.org/jira/browse/SPARK-2583 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Kousuke Saruta Assignee: Josh Rosen Priority: Critical ConnectionManager#handleMessage sent empty messages to another peer if some error occurred or not in onReceiveCalback. {code} val ackMessage = if (onReceiveCallback != null) { logDebug(Calling back) onReceiveCallback(bufferMessage, connectionManagerId) } else { logDebug(Not calling back as callback is null) None } if (ackMessage.isDefined) { if (!ackMessage.get.isInstanceOf[BufferMessage]) { logDebug(Response to + bufferMessage + is not a buffer message, it is of type + ackMessage.get.getClass) } else if (!ackMessage.get.asInstanceOf[BufferMessage].hasAckId) { logDebug(Response to + bufferMessage + does not have ack id set) ackMessage.get.asInstanceOf[BufferMessage].ackId = bufferMessage.id } } // We have no way to tell peer whether error occurred or not sendMessage(connectionManagerId, ackMessage.getOrElse { Message.createBufferMessage(bufferMessage.id) }) } {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1856) Standardize MLlib interfaces
[ https://issues.apache.org/jira/browse/SPARK-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086697#comment-14086697 ] Xiangrui Meng commented on SPARK-1856: -- Yes, MLI and MLbase are research projects at AMPLab. They are exploring the frontier of practical machine learning. Stable ideas/features from MLI and MLbase will be migrated into MLlib, and this is part of the effort. Standardize MLlib interfaces Key: SPARK-1856 URL: https://issues.apache.org/jira/browse/SPARK-1856 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Blocker Instead of expanding MLlib based on the current class naming scheme (ProblemWithAlgorithm), we should standardize MLlib's interfaces that clearly separate datasets, formulations, algorithms, parameter sets, and models. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1680) Clean up use of setExecutorEnvs in SparkConf
[ https://issues.apache.org/jira/browse/SPARK-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-1680. -- Resolution: Fixed Clean up use of setExecutorEnvs in SparkConf - Key: SPARK-1680 URL: https://issues.apache.org/jira/browse/SPARK-1680 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Patrick Wendell Assignee: Thomas Graves Priority: Blocker Fix For: 1.1.0 We should make this consistent between YARN and Standalone. Basically, YARN mode should just use the executorEnvs from the Spark conf and not need SPARK_YARN_USER_ENV. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2863) Emulate Hive type coercion in native reimplementations of Hive functions
[ https://issues.apache.org/jira/browse/SPARK-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2863: Assignee: William Benton Emulate Hive type coercion in native reimplementations of Hive functions Key: SPARK-2863 URL: https://issues.apache.org/jira/browse/SPARK-2863 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: William Benton Assignee: William Benton Native reimplementations of Hive functions no longer have the same type-coercion behavior as they would if executed via Hive. As [Michael Armbrust points out|https://github.com/apache/spark/pull/1750#discussion_r15790970], queries like {{SELECT SQRT(2) FROM src LIMIT 1}} succeed in Hive but fail if {{SQRT}} is implemented natively. Spark SQL should have Hive-compatible type coercions for arguments to natively-implemented functions. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2844) Existing JVM Hive Context not correctly used in Python Hive Context
[ https://issues.apache.org/jira/browse/SPARK-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2844: Priority: Major (was: Minor) Target Version/s: 1.1.0 Existing JVM Hive Context not correctly used in Python Hive Context --- Key: SPARK-2844 URL: https://issues.apache.org/jira/browse/SPARK-2844 Project: Spark Issue Type: Bug Components: PySpark, SQL Reporter: Ahir Reddy Assignee: Ahir Reddy Unlike the SQLContext, assing an existing JVM HiveContext object into the Python HiveContext constructor does not actually re-use that object. Instead it will create a new HiveContext. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086808#comment-14086808 ] Zhan Zhang commented on SPARK-1537: --- Do you mind sharing your thoughts, design document or prototype code? Thanks. Add integration with Yarn's Application Timeline Server --- Key: SPARK-1537 URL: https://issues.apache.org/jira/browse/SPARK-1537 Project: Spark Issue Type: New Feature Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin It would be nice to have Spark integrate with Yarn's Application Timeline Server (see YARN-321, YARN-1530). This would allow users running Spark on Yarn to have a single place to go for all their history needs, and avoid having to manage a separate service (Spark's built-in server). At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, although there is still some ongoing work. But the basics are there, and I wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086817#comment-14086817 ] Marcelo Vanzin commented on SPARK-1537: --- Currently busy with other more urgent tasks, but I'll push to my repo and post a link when I get some time. Add integration with Yarn's Application Timeline Server --- Key: SPARK-1537 URL: https://issues.apache.org/jira/browse/SPARK-1537 Project: Spark Issue Type: New Feature Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin It would be nice to have Spark integrate with Yarn's Application Timeline Server (see YARN-321, YARN-1530). This would allow users running Spark on Yarn to have a single place to go for all their history needs, and avoid having to manage a separate service (Spark's built-in server). At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, although there is still some ongoing work. But the basics are there, and I wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2854) Finalize _acceptable_types in pyspark.sql
[ https://issues.apache.org/jira/browse/SPARK-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086877#comment-14086877 ] Apache Spark commented on SPARK-2854: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/1793 Finalize _acceptable_types in pyspark.sql - Key: SPARK-2854 URL: https://issues.apache.org/jira/browse/SPARK-2854 Project: Spark Issue Type: Task Components: SQL Reporter: Yin Huai Priority: Blocker In PySpark, _acceptable_types defines accepted Python data types for every Spark SQL data type. The list is shown below. {code} _acceptable_types = { BooleanType: (bool,), ByteType: (int, long), ShortType: (int, long), IntegerType: (int, long), LongType: (int, long), FloatType: (float,), DoubleType: (float,), DecimalType: (decimal.Decimal,), StringType: (str, unicode), TimestampType: (datetime.datetime, datetime.time, datetime.date), ArrayType: (list, tuple, array), MapType: (dict,), StructType: (tuple, list), } {code} Let's double check this mapping before 1.1 release. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2650) Wrong initial sizes for in-memory column buffers
[ https://issues.apache.org/jira/browse/SPARK-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-2650: -- Target Version/s: 1.2.0 (was: 1.1.0) Wrong initial sizes for in-memory column buffers Key: SPARK-2650 URL: https://issues.apache.org/jira/browse/SPARK-2650 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0, 1.0.1 Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Critical The logic for setting up the initial column buffers is different for Spark SQL compared to Shark and I'm seeing OOMs when caching tables that are larger than available memory (where shark was okay). Two suspicious things: the intialSize is always set to 0 so we always go with the default. The default looks like it was copied from code like 10 * 1024 * 1024... but in Spark SQL its 10 * 102 * 1024. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2866) ORDER BY attributes must appear in SELECT clause
[ https://issues.apache.org/jira/browse/SPARK-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087077#comment-14087077 ] Apache Spark commented on SPARK-2866: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/1795 ORDER BY attributes must appear in SELECT clause Key: SPARK-2866 URL: https://issues.apache.org/jira/browse/SPARK-2866 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2636) no where to get job identifier while submit spark job through spark API
[ https://issues.apache.org/jira/browse/SPARK-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087102#comment-14087102 ] Chengxiang Li commented on SPARK-2636: -- {quote} There are two ways I think. One is for DAGScheduler.runJob to return an integer (or long) id for the job. An alternative, which I think is better and relates to SPARK-2321, is for runJob to return some Job object that has information about the id and can be queried about progress. {quote} DAGScheduler is Spark internal class, User can hardly use it directly. I like your second idea, return a Job info object while submit spark job in SparkContext(JavaSparkContext in this case) or RDD level. Actually AsyncRDDActions has done part of this work, I think it maybe a good place to fix this issue. no where to get job identifier while submit spark job through spark API --- Key: SPARK-2636 URL: https://issues.apache.org/jira/browse/SPARK-2636 Project: Spark Issue Type: New Feature Reporter: Chengxiang Li In Hive on Spark, we want to track spark job status through Spark API, the basic idea is as following: # create an hive-specified spark listener and register it to spark listener bus. # hive-specified spark listener generate job status by spark listener events. # hive driver track job status through hive-specified spark listener. the current problem is that hive driver need job identifier to track specified job status through spark listener, but there is no spark API to get job identifier(like job id) while submit spark job. I think other project whoever try to track job status with spark API would suffer from this as well. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2872) Fix conflict between code and doc in YarnClientSchedulerBackend
Zhihui created SPARK-2872: - Summary: Fix conflict between code and doc in YarnClientSchedulerBackend Key: SPARK-2872 URL: https://issues.apache.org/jira/browse/SPARK-2872 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Zhihui Doc say: system properties override environment variables. https://github.com/apache/spark/blob/master/yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala#L71 But code is conflict with it. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2872) Fix conflict between code and doc in YarnClientSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087120#comment-14087120 ] Zhihui commented on SPARK-2872: --- PR https://github.com/apache/spark/pull/1684 Fix conflict between code and doc in YarnClientSchedulerBackend --- Key: SPARK-2872 URL: https://issues.apache.org/jira/browse/SPARK-2872 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Zhihui Doc say: system properties override environment variables. https://github.com/apache/spark/blob/master/yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala#L71 But code is conflict with it. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2848) Shade Guava in Spark deliverables
[ https://issues.apache.org/jira/browse/SPARK-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087179#comment-14087179 ] Marcelo Vanzin commented on SPARK-2848: --- Nevermind the question, I got code mostly working to do this on the sbt side. Shade Guava in Spark deliverables - Key: SPARK-2848 URL: https://issues.apache.org/jira/browse/SPARK-2848 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin As discussed in SPARK-2420, this task covers the work of shading Guava in Spark deliverables so that they don't conflict with the Hadoop classpath (nor user's classpath). Since one Guava class is exposed through Spark's API, that class will be forked from 14.0.1 (current version used by Spark) and excluded from any shading. The end result is that Spark's Guava won't be exposed to users anymore. This has the side-effect of effectively downgrading to version 11 (the one used by Hadoop) for those that do not explicitly depend on / package Guava with their apps. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2874) Spark SQL related scripts don't show complete usage message
Cheng Lian created SPARK-2874: - Summary: Spark SQL related scripts don't show complete usage message Key: SPARK-2874 URL: https://issues.apache.org/jira/browse/SPARK-2874 Project: Spark Issue Type: Bug Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian Priority: Minor Due to [SPARK-2678|https://issues.apache.org/jira/browse/SPARK-2678], {{--help}} is shadowed by {{spark-submit}}, thus {{bin/spark-sql}} and {{sbin/start-thriftserver2.sh}} can't show application customized usage messages. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org