[jira] [Commented] (SPARK-4715) ShuffleMemoryManager.tryToAcquire may return a negative value
[ https://issues.apache.org/jira/browse/SPARK-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232738#comment-14232738 ] Apache Spark commented on SPARK-4715: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/3575 ShuffleMemoryManager.tryToAcquire may return a negative value - Key: SPARK-4715 URL: https://issues.apache.org/jira/browse/SPARK-4715 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Shixiong Zhu Here is a unit test to demonstrate it: {code} test(threads should not be granted a negative size) { val manager = new ShuffleMemoryManager(1000L) manager.tryToAcquire(700L) val latch = new CountDownLatch(1) startThread(t1) { manager.tryToAcquire(300L) latch.countDown() } latch.await() // Wait until `t1` calls `tryToAcquire` val granted = manager.tryToAcquire(300L) assert(0 === granted, granted is negative) } {code} It outputs 0 did not equal -200 granted is negative -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4720) Remainder should also return null if the divider is 0.
Takuya Ueshin created SPARK-4720: Summary: Remainder should also return null if the divider is 0. Key: SPARK-4720 URL: https://issues.apache.org/jira/browse/SPARK-4720 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin This is a follow-up of SPARK-4593. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4397) Reorganize 'implicit's to improve the API convenience
[ https://issues.apache.org/jira/browse/SPARK-4397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232840#comment-14232840 ] Apache Spark commented on SPARK-4397: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/3580 Reorganize 'implicit's to improve the API convenience - Key: SPARK-4397 URL: https://issues.apache.org/jira/browse/SPARK-4397 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Minor Fix For: 1.3.0 As I said here, http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3CCAPn6-YTeUwGqvGady=vUjX=9bl_re7wb5-delbvfja842qm...@mail.gmail.com%3E many people asked how to convert a RDD to a PairRDDFunctions. If we can reorganize the `implicit`s properly, the API will more convenient without importing SparkContext._ explicitly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4694) Long-run user thread(such as HiveThriftServer2) causes the 'process leak' in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232778#comment-14232778 ] Apache Spark commented on SPARK-4694: - User 'SaintBacchus' has created a pull request for this issue: https://github.com/apache/spark/pull/3576 Long-run user thread(such as HiveThriftServer2) causes the 'process leak' in yarn-client mode - Key: SPARK-4694 URL: https://issues.apache.org/jira/browse/SPARK-4694 Project: Spark Issue Type: Bug Components: YARN Reporter: SaintBacchus Recently when I use the Yarn HA mode to test the HiveThriftServer2 I found a problem that the driver can't exit by itself. To reappear it, you can do as fellow: 1.use yarn HA mode and set am.maxAttemp = 1for convenience 2.kill the active resouce manager in cluster The expect result is just failed, because the maxAttemp was 1. But the actual result is that: all executor was ended but the driver was still there and never close. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3391) Support attaching more than 1 EBS volumes
[ https://issues.apache.org/jira/browse/SPARK-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3391. Resolution: Fixed Fix Version/s: 1.2.0 This was merged. Support attaching more than 1 EBS volumes - Key: SPARK-3391 URL: https://issues.apache.org/jira/browse/SPARK-3391 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.2.0 Currently spark-ec2 supports attaching only one EBS volume. Attaching multiple EBS volumes can increase the aggregate throughput of EBS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop
[ https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232802#comment-14232802 ] Micael Capitão commented on SPARK-3553: --- I confirm the weird behaviour running in HDFS too. I have the Spark Streaming app with a filestream on dir hdfs:///user/altaia/cdrs/stream. It is running on YARN and uses checkpointing. For now the application only reads the files and prints the number of read lines. Having initially these files: [1] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_6_06_20.txt.gz [2] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_7_11_01.txt.gz [3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz [4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz [5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz [6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz When I start the application, they are processed. When I add a new file [7] by renaming it to end with .gz it is processed too. [7] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_36_34.txt.gz But right after the [7], Spark Streaming reprocesses some of the initially present files: [3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz [4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz [5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz [6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz And does not repeat anything else on the next batches. When adding yet another file, it is not detected and stays like that. Spark Streaming app streams files that have already been streamed in an endless loop Key: SPARK-3553 URL: https://issues.apache.org/jira/browse/SPARK-3553 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.1 Environment: Ec2 cluster - YARN Reporter: Ezequiel Bella Labels: S3, Streaming, YARN We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB of RAM each. The app streams from a directory in S3 which is constantly being written; this is the line of code that achieves that: val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](Settings.S3RequestsHost , (f:Path)= true, true ) The purpose of using fileStream instead of textFileStream is to customize the way that spark handles existing files when the process starts. We want to process just the new files that are added after the process launched and omit the existing ones. We configured a batch duration of 10 seconds. The process goes fine while we add a small number of files to s3, let's say 4 or 5. We can see in the streaming UI how the stages are executed successfully in the executors, one for each file that is processed. But when we try to add a larger number of files, we face a strange behavior; the application starts streaming files that have already been streamed. For example, I add 20 files to s3. The files are processed in 3 batches. The first batch processes 7 files, the second 8 and the third 5. No more files are added to S3 at this point, but spark start repeating these phases endlessly with the same files. Any thoughts what can be causing this? Regards, Easyb -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4718) spark-ec2 script creates empty spark folder
[ https://issues.apache.org/jira/browse/SPARK-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232795#comment-14232795 ] Ignacio Blasco Lopez commented on SPARK-4718: - Tested with spark-1.1.1-bin-hadoop2.3 spark-1.1.1-bin-hadoop2.4 spark-ec2 script creates empty spark folder --- Key: SPARK-4718 URL: https://issues.apache.org/jira/browse/SPARK-4718 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.1 Reporter: Ignacio Blasco Lopez spark-ec2 script in version 1.1.1 creates a spark folder with only conf directory -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4714) Checking block is null or not after having gotten info.lock in remove block method
[ https://issues.apache.org/jira/browse/SPARK-4714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232717#comment-14232717 ] Apache Spark commented on SPARK-4714: - User 'suyanNone' has created a pull request for this issue: https://github.com/apache/spark/pull/3574 Checking block is null or not after having gotten info.lock in remove block method -- Key: SPARK-4714 URL: https://issues.apache.org/jira/browse/SPARK-4714 Project: Spark Issue Type: Improvement Components: Block Manager Affects Versions: 1.1.0 Reporter: SuYan Priority: Minor in removeBlock()/ dropOldBlock()/ dropFromMemory() all have the same logic: 1. info = blockInfo.get(id) 2. if (info != null) 3. info.synchronized there may be a possibility that while one thread got info.lock while the previous thread already removed from blockinfo in info.lock. but one thing in current code, That not check info is null or not, while get info.lock to remove block, will not cause any errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop
[ https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micael Capitão updated SPARK-3553: -- Comment: was deleted (was: I confirm the weird behaviour running in HDFS too. I have the Spark Streaming app with a filestream on dir hdfs:///user/altaia/cdrs/stream Having initially these files: [1] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_6_06_20.txt.gz [2] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_7_11_01.txt.gz [3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz [4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz [5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz [6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz When I start the application, they are processed. When I add a new file [7] by renaming it to end with .gz it is processed too. [7] hdfs://blade2.ct.ptin.corppt.com:8020/user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_36_34.txt.gz But right after the [7], Spark Streaming reprocesses some of the initially present files: [3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz [4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz [5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz [6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz And does not repeat anything else on the next batches. When adding yet another file, it is not detected and stays like that.) Spark Streaming app streams files that have already been streamed in an endless loop Key: SPARK-3553 URL: https://issues.apache.org/jira/browse/SPARK-3553 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.1 Environment: Ec2 cluster - YARN Reporter: Ezequiel Bella Labels: S3, Streaming, YARN We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB of RAM each. The app streams from a directory in S3 which is constantly being written; this is the line of code that achieves that: val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](Settings.S3RequestsHost , (f:Path)= true, true ) The purpose of using fileStream instead of textFileStream is to customize the way that spark handles existing files when the process starts. We want to process just the new files that are added after the process launched and omit the existing ones. We configured a batch duration of 10 seconds. The process goes fine while we add a small number of files to s3, let's say 4 or 5. We can see in the streaming UI how the stages are executed successfully in the executors, one for each file that is processed. But when we try to add a larger number of files, we face a strange behavior; the application starts streaming files that have already been streamed. For example, I add 20 files to s3. The files are processed in 3 batches. The first batch processes 7 files, the second 8 and the third 5. No more files are added to S3 at this point, but spark start repeating these phases endlessly with the same files. Any thoughts what can be causing this? Regards, Easyb -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error
[ https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232734#comment-14232734 ] Ankur Dave commented on SPARK-4672: --- [~jerrylead] Thanks for investigating this bug and the excellent explanation. Now that the PRs are merged, can you confirm that the bug is fixed for you? I haven't yet been able to reproduce it locally. Cut off the super long serialization chain in GraphX to avoid the StackOverflow error - Key: SPARK-4672 URL: https://issues.apache.org/jira/browse/SPARK-4672 Project: Spark Issue Type: Bug Components: GraphX, Spark Core Affects Versions: 1.1.0 Reporter: Lijie Xu Priority: Critical Fix For: 1.2.0 While running iterative algorithms in GraphX, a StackOverflow error will stably occur in the serialization phase at about 300th iteration. In general, these kinds of algorithms have two things in common: # They have a long computing chain. {code:borderStyle=solid} (e.g., “degreeGraph=subGraph=degreeGraph=subGraph=…=”) {code} # They will iterate many times to converge. An example: {code:borderStyle=solid} //K-Core Algorithm val kNum = 5 var degreeGraph = graph.outerJoinVertices(graph.degrees) { (vid, vd, degree) = degree.getOrElse(0) }.cache() do { val subGraph = degreeGraph.subgraph( vpred = (vid, degree) = degree = KNum ).cache() val newDegreeGraph = subGraph.degrees degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) { (vid, vd, degree) = degree.getOrElse(0) }.cache() isConverged = check(degreeGraph) } while(isConverged == false) {code} After about 300 iterations, StackOverflow will definitely occur with the following stack trace: {code:borderStyle=solid} Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275) java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) {code} It is a very tricky bug, which only occurs with enough iterations. Since it took us a long time to find out its causes, we will detail the causes in the following 3 paragraphs. h3. Phase 1: Try using checkpoint() to shorten the lineage It's easy to come to the thought that the long lineage may be the cause. For some RDDs, their lineages may grow with the iterations. Also, for some magical references, their lineage lengths never decrease and finally become very long. As a result, the call stack of task's serialization()/deserialization() method will be very long too, which finally exhausts the whole JVM stack. In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 OneToOne dependencies in each iteration in the above example. Lineage length refers to the maximum length of OneToOne dependencies (e.g., from the finalRDD to the ShuffledRDD) in each stage. To shorten the lineage, a checkpoint() is performed every N (e.g., 10) iterations. Then, the lineage will drop down when it reaches a certain length (e.g., 33). However, StackOverflow error still occurs after 300+ iterations! h3. Phase 2: Abnormal f closure function leads to a unbreakable serialization chain After a long-time debug, we found that an abnormal _*f*_ function closure and a potential bug in GraphX (will be detailed in Phase 3) are the Suspect Zero. They together build another serialization chain that can bypass the broken lineage cut by checkpoint() (as shown in Figure 1). In other words, the serialization chain can be as long as the original lineage before checkpoint(). Figure 1 shows how the unbreakable serialization chain is generated. Yes, the OneToOneDep can be cut off by checkpoint(). However, the serialization chain can still access the previous RDDs through the (1)-(2) reference chain. As a result, the checkpoint() action is meaningless and the lineage is as long as that before. !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g1.png|width=100%! The (1)-(2) chain can be observed in the debug view (in Figure 2). {code:borderStyle=solid} _rdd (i.e., A in Figure 1, checkpointed) - f - $outer (VertexRDD) - partitionsRDD:MapPartitionsRDD - RDDs in the previous iterations {code} !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g2.png|width=100%! More description: While a RDD is being serialized, its f function {code:borderStyle=solid} e.g., f:
[jira] [Updated] (SPARK-2456) Scheduler refactoring
[ https://issues.apache.org/jira/browse/SPARK-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2456: --- Assignee: (was: Reynold Xin) Scheduler refactoring - Key: SPARK-2456 URL: https://issues.apache.org/jira/browse/SPARK-2456 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin This is an umbrella ticket to track scheduler refactoring. We want to clearly define semantics and responsibilities of each component, and define explicit public interfaces for them so it is easier to understand and to contribute (also less buggy). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4718) spark-ec2 script creates empty spark folder
Ignacio Blasco Lopez created SPARK-4718: --- Summary: spark-ec2 script creates empty spark folder Key: SPARK-4718 URL: https://issues.apache.org/jira/browse/SPARK-4718 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.1 Reporter: Ignacio Blasco Lopez spark-ec2 script in version 1.1.1 creates a spark folder with only conf directory -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4720) Remainder should also return null if the divider is 0.
[ https://issues.apache.org/jira/browse/SPARK-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232848#comment-14232848 ] Apache Spark commented on SPARK-4720: - User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/3581 Remainder should also return null if the divider is 0. -- Key: SPARK-4720 URL: https://issues.apache.org/jira/browse/SPARK-4720 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin This is a follow-up of SPARK-4593. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4719) Consolidate various narrow dep RDD classes with MapPartitionsRDD
[ https://issues.apache.org/jira/browse/SPARK-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232804#comment-14232804 ] Apache Spark commented on SPARK-4719: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/3578 Consolidate various narrow dep RDD classes with MapPartitionsRDD Key: SPARK-4719 URL: https://issues.apache.org/jira/browse/SPARK-4719 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Reynold Xin Assignee: Reynold Xin Seems like we don't really need MappedRDD, MappedValuesRDD, FlatMappedValuesRDD, FilteredRDD, GlommedRDD. They can all be implemented directly using MapPartitionsRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4716) Avoid shuffle when all-to-all operation has single input and output partition
Sandy Ryza created SPARK-4716: - Summary: Avoid shuffle when all-to-all operation has single input and output partition Key: SPARK-4716 URL: https://issues.apache.org/jira/browse/SPARK-4716 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza I encountered an application that performs joins on a bunch of small RDDs, unions the results, and then performs larger aggregations across them. Many of these small RDDs fit in a single partition. For these operations with only a single partition, there's no reason to write data to disk and then fetch it over a socket. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4717) Optimize BLAS library to avoid de-reference multiple times in loop
[ https://issues.apache.org/jira/browse/SPARK-4717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232783#comment-14232783 ] Apache Spark commented on SPARK-4717: - User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/3577 Optimize BLAS library to avoid de-reference multiple times in loop -- Key: SPARK-4717 URL: https://issues.apache.org/jira/browse/SPARK-4717 Project: Spark Issue Type: Improvement Components: MLlib Reporter: DB Tsai Have a local reference to `values` and `indices` array in the `Vector` object so JVM can locate the value with one operation call. See `SPARK-4581` for similar optimization, and the bytecode analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error
[ https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-4672. --- Resolution: Fixed Issue resolved by pull request 3545 [https://github.com/apache/spark/pull/3545] Cut off the super long serialization chain in GraphX to avoid the StackOverflow error - Key: SPARK-4672 URL: https://issues.apache.org/jira/browse/SPARK-4672 Project: Spark Issue Type: Bug Components: GraphX, Spark Core Affects Versions: 1.1.0 Reporter: Lijie Xu Priority: Critical Fix For: 1.2.0 While running iterative algorithms in GraphX, a StackOverflow error will stably occur in the serialization phase at about 300th iteration. In general, these kinds of algorithms have two things in common: # They have a long computing chain. {code:borderStyle=solid} (e.g., “degreeGraph=subGraph=degreeGraph=subGraph=…=”) {code} # They will iterate many times to converge. An example: {code:borderStyle=solid} //K-Core Algorithm val kNum = 5 var degreeGraph = graph.outerJoinVertices(graph.degrees) { (vid, vd, degree) = degree.getOrElse(0) }.cache() do { val subGraph = degreeGraph.subgraph( vpred = (vid, degree) = degree = KNum ).cache() val newDegreeGraph = subGraph.degrees degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) { (vid, vd, degree) = degree.getOrElse(0) }.cache() isConverged = check(degreeGraph) } while(isConverged == false) {code} After about 300 iterations, StackOverflow will definitely occur with the following stack trace: {code:borderStyle=solid} Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275) java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) {code} It is a very tricky bug, which only occurs with enough iterations. Since it took us a long time to find out its causes, we will detail the causes in the following 3 paragraphs. h3. Phase 1: Try using checkpoint() to shorten the lineage It's easy to come to the thought that the long lineage may be the cause. For some RDDs, their lineages may grow with the iterations. Also, for some magical references, their lineage lengths never decrease and finally become very long. As a result, the call stack of task's serialization()/deserialization() method will be very long too, which finally exhausts the whole JVM stack. In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 OneToOne dependencies in each iteration in the above example. Lineage length refers to the maximum length of OneToOne dependencies (e.g., from the finalRDD to the ShuffledRDD) in each stage. To shorten the lineage, a checkpoint() is performed every N (e.g., 10) iterations. Then, the lineage will drop down when it reaches a certain length (e.g., 33). However, StackOverflow error still occurs after 300+ iterations! h3. Phase 2: Abnormal f closure function leads to a unbreakable serialization chain After a long-time debug, we found that an abnormal _*f*_ function closure and a potential bug in GraphX (will be detailed in Phase 3) are the Suspect Zero. They together build another serialization chain that can bypass the broken lineage cut by checkpoint() (as shown in Figure 1). In other words, the serialization chain can be as long as the original lineage before checkpoint(). Figure 1 shows how the unbreakable serialization chain is generated. Yes, the OneToOneDep can be cut off by checkpoint(). However, the serialization chain can still access the previous RDDs through the (1)-(2) reference chain. As a result, the checkpoint() action is meaningless and the lineage is as long as that before. !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g1.png|width=100%! The (1)-(2) chain can be observed in the debug view (in Figure 2). {code:borderStyle=solid} _rdd (i.e., A in Figure 1, checkpointed) - f - $outer (VertexRDD) - partitionsRDD:MapPartitionsRDD - RDDs in the previous iterations {code} !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g2.png|width=100%! More description: While a RDD is being serialized, its f function {code:borderStyle=solid} e.g., f: (Iterator[A], Iterator[B]) = Iterator[V]) in ZippedPartitionsRDD2 {code} will be serialized too. This action will be very dangerous if the f closure has a member
[jira] [Updated] (SPARK-2253) Disable partial aggregation automatically when reduction factor is low
[ https://issues.apache.org/jira/browse/SPARK-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2253: --- Fix Version/s: 1.3.0 Disable partial aggregation automatically when reduction factor is low -- Key: SPARK-2253 URL: https://issues.apache.org/jira/browse/SPARK-2253 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.3.0 Once we see enough number of rows in partial aggregation and don't observe any reduction, Aggregator should just turn off partial aggregation. This reduces memory usage for high cardinality aggregations. This one is for Spark core. There is another ticket tracking this for SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3638) Commons HTTP client dependency conflict in extras/kinesis-asl module
[ https://issues.apache.org/jira/browse/SPARK-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232719#comment-14232719 ] Aniket Bhatnagar commented on SPARK-3638: - Yes. You may want to open another JIRA ticket for having kinesis pre build packages in spark download page. Commons HTTP client dependency conflict in extras/kinesis-asl module Key: SPARK-3638 URL: https://issues.apache.org/jira/browse/SPARK-3638 Project: Spark Issue Type: Bug Components: Examples, Streaming Affects Versions: 1.1.0 Reporter: Aniket Bhatnagar Labels: dependencies Fix For: 1.1.1, 1.2.0 Followed instructions as mentioned @ https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md and when running the example, I get the following error: {code} Caused by: java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V at org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:114) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:99) at com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29) at com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97) at com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:181) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:103) at com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:136) at com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:117) at com.amazonaws.services.kinesis.AmazonKinesisAsyncClient.init(AmazonKinesisAsyncClient.java:132) {code} I believe this is due to the dependency conflict as described @ http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccajob8btdxks-7-spjj5jmnw0xsnrjwdpcqqtjht1hun6j4z...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4717) Optimize BLAS library to avoid de-reference multiple times in loop
DB Tsai created SPARK-4717: -- Summary: Optimize BLAS library to avoid de-reference multiple times in loop Key: SPARK-4717 URL: https://issues.apache.org/jira/browse/SPARK-4717 Project: Spark Issue Type: Improvement Components: MLlib Reporter: DB Tsai Have a local reference to `values` and `indices` array in the `Vector` object so JVM can locate the value with one operation call. See `SPARK-4581` for similar optimization, and the bytecode analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4694) Long-run user thread(such as HiveThriftServer2) causes the 'process leak' in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232724#comment-14232724 ] SaintBacchus commented on SPARK-4694: - Thanks for reply. [~vanzin] the problem is very sure: the scheduler backend was aware of the AM had been exited so it call sc.stop to exit the driver process but there was a user thread which was still alive and cause this problem. To fix this, just using System.exit(-1) instead of the sc.stop so that jvm will not wait all the user threads being down and exit clearly. Can I use System.exit() in spark code? Long-run user thread(such as HiveThriftServer2) causes the 'process leak' in yarn-client mode - Key: SPARK-4694 URL: https://issues.apache.org/jira/browse/SPARK-4694 Project: Spark Issue Type: Bug Components: YARN Reporter: SaintBacchus Recently when I use the Yarn HA mode to test the HiveThriftServer2 I found a problem that the driver can't exit by itself. To reappear it, you can do as fellow: 1.use yarn HA mode and set am.maxAttemp = 1for convenience 2.kill the active resouce manager in cluster The expect result is just failed, because the maxAttemp was 1. But the actual result is that: all executor was ended but the driver was still there and never close. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop
[ https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232801#comment-14232801 ] Micael Capitão commented on SPARK-3553: --- I confirm the weird behaviour running in HDFS too. I have the Spark Streaming app with a filestream on dir hdfs:///user/altaia/cdrs/stream Having initially these files: [1] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_6_06_20.txt.gz [2] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_7_11_01.txt.gz [3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz [4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz [5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz [6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz When I start the application, they are processed. When I add a new file [7] by renaming it to end with .gz it is processed too. [7] hdfs://blade2.ct.ptin.corppt.com:8020/user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_36_34.txt.gz But right after the [7], Spark Streaming reprocesses some of the initially present files: [3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz [4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz [5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz [6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz And does not repeat anything else on the next batches. When adding yet another file, it is not detected and stays like that. Spark Streaming app streams files that have already been streamed in an endless loop Key: SPARK-3553 URL: https://issues.apache.org/jira/browse/SPARK-3553 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.1 Environment: Ec2 cluster - YARN Reporter: Ezequiel Bella Labels: S3, Streaming, YARN We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB of RAM each. The app streams from a directory in S3 which is constantly being written; this is the line of code that achieves that: val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](Settings.S3RequestsHost , (f:Path)= true, true ) The purpose of using fileStream instead of textFileStream is to customize the way that spark handles existing files when the process starts. We want to process just the new files that are added after the process launched and omit the existing ones. We configured a batch duration of 10 seconds. The process goes fine while we add a small number of files to s3, let's say 4 or 5. We can see in the streaming UI how the stages are executed successfully in the executors, one for each file that is processed. But when we try to add a larger number of files, we face a strange behavior; the application starts streaming files that have already been streamed. For example, I add 20 files to s3. The files are processed in 3 batches. The first batch processes 7 files, the second 8 and the third 5. No more files are added to S3 at this point, but spark start repeating these phases endlessly with the same files. Any thoughts what can be causing this? Regards, Easyb -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2253) Disable partial aggregation automatically when reduction factor is low
[ https://issues.apache.org/jira/browse/SPARK-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2253: --- Assignee: (was: Reynold Xin) Disable partial aggregation automatically when reduction factor is low -- Key: SPARK-2253 URL: https://issues.apache.org/jira/browse/SPARK-2253 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Fix For: 1.3.0 Once we see enough number of rows in partial aggregation and don't observe any reduction, Aggregator should just turn off partial aggregation. This reduces memory usage for high cardinality aggregations. This one is for Spark core. There is another ticket tracking this for SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4085) Job will fail if a shuffle file that's read locally gets deleted
[ https://issues.apache.org/jira/browse/SPARK-4085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232839#comment-14232839 ] Apache Spark commented on SPARK-4085: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/3579 Job will fail if a shuffle file that's read locally gets deleted Key: SPARK-4085 URL: https://issues.apache.org/jira/browse/SPARK-4085 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kay Ousterhout Assignee: Reynold Xin Priority: Critical This commit: https://github.com/apache/spark/commit/665e71d14debb8a7fc1547c614867a8c3b1f806a changed the behavior of fetching local shuffle blocks such that if a shuffle block is not found locally, the shuffle block is no longer marked as failed, and a fetch failed exception is not thrown (this is because the catch block here won't ever be invoked: https://github.com/apache/spark/commit/665e71d14debb8a7fc1547c614867a8c3b1f806a#diff-e6e1631fa01e17bf851f49d30d028823R202 because the exception called from getLocalFromDisk() doesn't get thrown until next() gets called on the iterator). [~rxin] [~matei] it looks like you guys changed the test for this to catch the new exception that gets thrown (https://github.com/apache/spark/commit/665e71d14debb8a7fc1547c614867a8c3b1f806a#diff-9c2e1918319de967045d04caf813a7d1R93). Was that intentional? Because the new exception is a SparkException and not a FetchFailedException, jobs with missing local shuffle data will now fail, rather than having the map stage get retried. This problem is reproducible with this test case: {code} test(hash shuffle manager recovers when local shuffle files get deleted) { val conf = new SparkConf(false) conf.set(spark.shuffle.manager, hash) sc = new SparkContext(local, test, conf) val rdd = sc.parallelize(1 to 10, 2).map((_, 1)).reduceByKey(_+_) rdd.count() // Delete one of the local shuffle blocks. sc.env.blockManager.diskBlockManager.getFile(new ShuffleBlockId(0, 0, 0)).delete() rdd.count() } {code} which will fail on the second rdd.count(). This is a regression from 1.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4715) ShuffleMemoryManager.tryToAcquire may return a negative value
Shixiong Zhu created SPARK-4715: --- Summary: ShuffleMemoryManager.tryToAcquire may return a negative value Key: SPARK-4715 URL: https://issues.apache.org/jira/browse/SPARK-4715 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Shixiong Zhu Here is a unit test to demonstrate it: {code} test(threads should not be granted a negative size) { val manager = new ShuffleMemoryManager(1000L) manager.tryToAcquire(700L) val latch = new CountDownLatch(1) startThread(t1) { manager.tryToAcquire(300L) latch.countDown() } latch.await() // Wait until `t1` calls `tryToAcquire` val granted = manager.tryToAcquire(300L) assert(0 === granted, granted is negative) } {code} It outputs 0 did not equal -200 granted is negative -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error
[ https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232845#comment-14232845 ] Lijie Xu commented on SPARK-4672: - Thank you [~ankurdave]. Yes, the StackOverflow error disappears once the PRs are merged and checkpoint() is performed every N iterations. However, If the lineage of iterative algorithms are too long, we still need to do checkpoint() manually to cut off the lineage to avoid this error. Moreover, the fix suggestion given by [~jason.dai] is fine since it is a general problem. We need more elegant methods to avoid the long chain of task's serialization(). Cut off the super long serialization chain in GraphX to avoid the StackOverflow error - Key: SPARK-4672 URL: https://issues.apache.org/jira/browse/SPARK-4672 Project: Spark Issue Type: Bug Components: GraphX, Spark Core Affects Versions: 1.1.0 Reporter: Lijie Xu Priority: Critical Fix For: 1.2.0 While running iterative algorithms in GraphX, a StackOverflow error will stably occur in the serialization phase at about 300th iteration. In general, these kinds of algorithms have two things in common: # They have a long computing chain. {code:borderStyle=solid} (e.g., “degreeGraph=subGraph=degreeGraph=subGraph=…=”) {code} # They will iterate many times to converge. An example: {code:borderStyle=solid} //K-Core Algorithm val kNum = 5 var degreeGraph = graph.outerJoinVertices(graph.degrees) { (vid, vd, degree) = degree.getOrElse(0) }.cache() do { val subGraph = degreeGraph.subgraph( vpred = (vid, degree) = degree = KNum ).cache() val newDegreeGraph = subGraph.degrees degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) { (vid, vd, degree) = degree.getOrElse(0) }.cache() isConverged = check(degreeGraph) } while(isConverged == false) {code} After about 300 iterations, StackOverflow will definitely occur with the following stack trace: {code:borderStyle=solid} Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275) java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) {code} It is a very tricky bug, which only occurs with enough iterations. Since it took us a long time to find out its causes, we will detail the causes in the following 3 paragraphs. h3. Phase 1: Try using checkpoint() to shorten the lineage It's easy to come to the thought that the long lineage may be the cause. For some RDDs, their lineages may grow with the iterations. Also, for some magical references, their lineage lengths never decrease and finally become very long. As a result, the call stack of task's serialization()/deserialization() method will be very long too, which finally exhausts the whole JVM stack. In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 OneToOne dependencies in each iteration in the above example. Lineage length refers to the maximum length of OneToOne dependencies (e.g., from the finalRDD to the ShuffledRDD) in each stage. To shorten the lineage, a checkpoint() is performed every N (e.g., 10) iterations. Then, the lineage will drop down when it reaches a certain length (e.g., 33). However, StackOverflow error still occurs after 300+ iterations! h3. Phase 2: Abnormal f closure function leads to a unbreakable serialization chain After a long-time debug, we found that an abnormal _*f*_ function closure and a potential bug in GraphX (will be detailed in Phase 3) are the Suspect Zero. They together build another serialization chain that can bypass the broken lineage cut by checkpoint() (as shown in Figure 1). In other words, the serialization chain can be as long as the original lineage before checkpoint(). Figure 1 shows how the unbreakable serialization chain is generated. Yes, the OneToOneDep can be cut off by checkpoint(). However, the serialization chain can still access the previous RDDs through the (1)-(2) reference chain. As a result, the checkpoint() action is meaningless and the lineage is as long as that before. !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g1.png|width=100%! The (1)-(2) chain can be observed in the debug view (in Figure 2). {code:borderStyle=solid} _rdd (i.e., A in Figure 1, checkpointed) - f - $outer (VertexRDD) -
[jira] [Created] (SPARK-4719) Consolidate various narrow dep RDD classes with MapPartitionsRDD
Reynold Xin created SPARK-4719: -- Summary: Consolidate various narrow dep RDD classes with MapPartitionsRDD Key: SPARK-4719 URL: https://issues.apache.org/jira/browse/SPARK-4719 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Reynold Xin Assignee: Reynold Xin Seems like we don't really need MappedRDD, MappedValuesRDD, FlatMappedValuesRDD, FilteredRDD, GlommedRDD. They can all be implemented directly using MapPartitionsRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4710) Fix MLlib compilation warnings
[ https://issues.apache.org/jira/browse/SPARK-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4710. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3568 [https://github.com/apache/spark/pull/3568] Fix MLlib compilation warnings -- Key: SPARK-4710 URL: https://issues.apache.org/jira/browse/SPARK-4710 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Trivial Fix For: 1.2.0 MLlib has 2 compilation warnings from DecisionTreeRunner and StreamingKMeans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4710) Fix MLlib compilation warnings
[ https://issues.apache.org/jira/browse/SPARK-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4710: - Assignee: Joseph K. Bradley Fix MLlib compilation warnings -- Key: SPARK-4710 URL: https://issues.apache.org/jira/browse/SPARK-4710 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Trivial Fix For: 1.2.0 MLlib has 2 compilation warnings from DecisionTreeRunner and StreamingKMeans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4708) Make k-mean runs two/three times faster with dense/sparse sample
[ https://issues.apache.org/jira/browse/SPARK-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4708: - Assignee: DB Tsai Make k-mean runs two/three times faster with dense/sparse sample Key: SPARK-4708 URL: https://issues.apache.org/jira/browse/SPARK-4708 Project: Spark Issue Type: Improvement Components: MLlib Reporter: DB Tsai Assignee: DB Tsai Fix For: 1.2.0 Note that the usage of `breezeSquaredDistance` in `org.apache.spark.mllib.util.MLUtils.fastSquaredDistance` is in the critical path, and breezeSquaredDistance is slow. We should replace it with our own implementation. Here is the benchmark against mnist8m dataset. Before DenseVector: 70.04secs SparseVector: 59.05secs With this PR DenseVector: 30.58secs SparseVector: 21.14secs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4708) Make k-mean runs two/three times faster with dense/sparse sample
[ https://issues.apache.org/jira/browse/SPARK-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4708. -- Resolution: Implemented Fix Version/s: 1.2.0 Target Version/s: 1.2.0 Make k-mean runs two/three times faster with dense/sparse sample Key: SPARK-4708 URL: https://issues.apache.org/jira/browse/SPARK-4708 Project: Spark Issue Type: Improvement Components: MLlib Reporter: DB Tsai Assignee: DB Tsai Fix For: 1.2.0 Note that the usage of `breezeSquaredDistance` in `org.apache.spark.mllib.util.MLUtils.fastSquaredDistance` is in the critical path, and breezeSquaredDistance is slow. We should replace it with our own implementation. Here is the benchmark against mnist8m dataset. Before DenseVector: 70.04secs SparseVector: 59.05secs With this PR DenseVector: 30.58secs SparseVector: 21.14secs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4721) Improve first thread to put block failed
SuYan created SPARK-4721: Summary: Improve first thread to put block failed Key: SPARK-4721 URL: https://issues.apache.org/jira/browse/SPARK-4721 Project: Spark Issue Type: Improvement Reporter: SuYan In current code, it assumes that multi-thread try to put same blockID block in blockManager, the thread that first put info in blockinfos to do the put process, and others will wait until the put in failed or success. it's ok in put success, but if fails, have some problem: 1. the failed thread will remove info from blockinfo 2. other threads wake up, and use the old info.synchronized to try put 3. and if success, mark success will tell not in pending status, and “mark success” failed. all other remaining threads will do the same thing: got info.syn and mark success or failed even that have one success. first, I can't understand why remove info from blockinfos while there have other threads was wait. the comment tell us is for other threads to create new block info. but block info is just a ID and level, use the old one and the new one is doesn't matters if there any waits threads. second, how about if there first threads is failed, other waits thread can do the same process one by one but need less than all . or just if first thread is failed, all other threads log a warning and return after waking up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4721) Improve first thread to put block failed
[ https://issues.apache.org/jira/browse/SPARK-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232889#comment-14232889 ] Apache Spark commented on SPARK-4721: - User 'suyanNone' has created a pull request for this issue: https://github.com/apache/spark/pull/3582 Improve first thread to put block failed Key: SPARK-4721 URL: https://issues.apache.org/jira/browse/SPARK-4721 Project: Spark Issue Type: Improvement Reporter: SuYan In current code, it assumes that multi-thread try to put same blockID block in blockManager, the thread that first put info in blockinfos to do the put process, and others will wait until the put in failed or success. it's ok in put success, but if fails, have some problem: 1. the failed thread will remove info from blockinfo 2. other threads wake up, and use the old info.synchronized to try put 3. and if success, mark success will tell not in pending status, and “mark success” failed. all other remaining threads will do the same thing: got info.syn and mark success or failed even that have one success. first, I can't understand why remove info from blockinfos while there have other threads was wait. the comment tell us is for other threads to create new block info. but block info is just a ID and level, use the old one and the new one is doesn't matters if there any waits threads. second, how about if there first threads is failed, other waits thread can do the same process one by one but need less than all . or just if first thread is failed, all other threads log a warning and return after waking up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4722) StreamingLinearRegression should return a DStream of weights when calling trainOn
Arthur Andres created SPARK-4722: Summary: StreamingLinearRegression should return a DStream of weights when calling trainOn Key: SPARK-4722 URL: https://issues.apache.org/jira/browse/SPARK-4722 Project: Spark Issue Type: Improvement Components: MLlib, Streaming Reporter: Arthur Andres Priority: Minor When training a model with a stream of new data (Spark Streaming + Spark Mlllib), the weights (and the other part of the regression model) update at every iterations. At the moment the only output we can get is the prediction when calling predictOn (class StreamingLinearRegression) It would be a nice improvement if trainOn would return a Dstream of weights (and any other underlying model data) so we can access it and see it evolve. At the moment they are only outputted in the log For example this could then be saved so when reloading the application we can access this information without having to train the model again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4722) StreamingLinearRegression should return a DStream of weights when calling trainOn
[ https://issues.apache.org/jira/browse/SPARK-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232920#comment-14232920 ] Xiangrui Meng commented on SPARK-4722: -- [~Arthur][ You can use `StreamingLinearRegression.model` to get the latest model. It may be expensive and unnecessary to make predictOn return a DStream of model weights. If you want to re-use the previously trained model, you save the last model coefficients in the first run and then set initial weights in the second run. StreamingLinearRegression should return a DStream of weights when calling trainOn - Key: SPARK-4722 URL: https://issues.apache.org/jira/browse/SPARK-4722 Project: Spark Issue Type: Improvement Components: MLlib, Streaming Reporter: Arthur Andres Priority: Minor Labels: mllib, regression, streaming When training a model with a stream of new data (Spark Streaming + Spark Mlllib), the weights (and the other part of the regression model) update at every iterations. At the moment the only output we can get is the prediction when calling predictOn (class StreamingLinearRegression) It would be a nice improvement if trainOn would return a Dstream of weights (and any other underlying model data) so we can access it and see it evolve. At the moment they are only outputted in the log For example this could then be saved so when reloading the application we can access this information without having to train the model again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4722) StreamingLinearRegression should return a DStream of weights when calling trainOn
[ https://issues.apache.org/jira/browse/SPARK-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232929#comment-14232929 ] Arthur Andres commented on SPARK-4722: -- I understand your point about it being heavy but still believe that it'd be useful Maybe a function could be added that creates and returns a stream of weights? As far as the model() function is concerned, is it safe/possible to access it after the streaming has started ? Thanks. StreamingLinearRegression should return a DStream of weights when calling trainOn - Key: SPARK-4722 URL: https://issues.apache.org/jira/browse/SPARK-4722 Project: Spark Issue Type: Improvement Components: MLlib, Streaming Reporter: Arthur Andres Priority: Minor Labels: mllib, regression, streaming When training a model with a stream of new data (Spark Streaming + Spark Mlllib), the weights (and the other part of the regression model) update at every iterations. At the moment the only output we can get is the prediction when calling predictOn (class StreamingLinearRegression) It would be a nice improvement if trainOn would return a Dstream of weights (and any other underlying model data) so we can access it and see it evolve. At the moment they are only outputted in the log For example this could then be saved so when reloading the application we can access this information without having to train the model again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4723) To abort the stages which have attempted some times
YanTang Zhai created SPARK-4723: --- Summary: To abort the stages which have attempted some times Key: SPARK-4723 URL: https://issues.apache.org/jira/browse/SPARK-4723 Project: Spark Issue Type: Improvement Reporter: YanTang Zhai Priority: Minor For some reason, some stages may attempt many times. A threshold could be added and the stages which have attempted more than the threshold could be aborted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4710) Fix MLlib compilation warnings
[ https://issues.apache.org/jira/browse/SPARK-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233007#comment-14233007 ] Sean Owen commented on SPARK-4710: -- (PS, I have a JIRA/PR to clean up most all of the current build warnings: http://issues.apache.org/jira/browse/SPARK-4297 ) Fix MLlib compilation warnings -- Key: SPARK-4710 URL: https://issues.apache.org/jira/browse/SPARK-4710 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Trivial Fix For: 1.2.0 MLlib has 2 compilation warnings from DecisionTreeRunner and StreamingKMeans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4724) JavaNetworkWordCount.java has a wrong import
Emre Sevinç created SPARK-4724: -- Summary: JavaNetworkWordCount.java has a wrong import Key: SPARK-4724 URL: https://issues.apache.org/jira/browse/SPARK-4724 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.1.0 Reporter: Emre Sevinç Priority: Minor JavaNetworkWordCount.java has a wrong import. [Line 28|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L28] in [spark/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java] reads as {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} import org.apache.spark.streaming.Durations; {code} But according to the [documentation|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/package-summary.html], it should be [Duration|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Duration.html] (or [Seconds|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Seconds.html]), not **Durations** . Also [line 60|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L60] should change accordingly because currently it reads as: {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1)); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1
[ https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233031#comment-14233031 ] Yana Kadiyska commented on SPARK-4702: -- I'm investigating the possibility that this is caused by Hive being switched to Hive0.13 by default. Building with 0.12 profile now, will close if this goes away Querying non-existent partition produces exception in v1.2.0-rc1 - Key: SPARK-4702 URL: https://issues.apache.org/jira/browse/SPARK-4702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Using HiveThriftServer2, when querying a non-existent partition I get an exception rather than an empty result set. This seems to be a regression -- I had an older build of master branch where this works. Build off of RC1.2 tag produces the following: 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4725) Re-think custom shuffle serializers for vertex messages
Takeshi Yamamuro created SPARK-4725: --- Summary: Re-think custom shuffle serializers for vertex messages Key: SPARK-4725 URL: https://issues.apache.org/jira/browse/SPARK-4725 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Takeshi Yamamuro Priority: Minor These serializers are removed in Spark-3649 because some type mismatch errors occur in SortShuffleWriter. https://www.mail-archive.com/commits@spark.apache.org/msg04125.html However, messages between executors might be of critical performance issues in PageRank and other communication-intensive graph tasks. Ankur reported that the removal caused a slowdown and the increase of per-iteration communications in the commit log. I made a patch to avoid the type-mismatch error in https://github.com/maropu/spark/commit/20e74f0e41ed99cb0a89ec5e5fc0e3c9e3f1038e#diff-68f4d319d5a58cbe0729476e0cb8594aR39 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4724) JavaNetworkWordCount.java has a wrong import
[ https://issues.apache.org/jira/browse/SPARK-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233034#comment-14233034 ] Sean Owen commented on SPARK-4724: -- No, this is correct. {{Durations}} is an object, with helper methods that create a {{Duration}}. They both exist. JavaNetworkWordCount.java has a wrong import Key: SPARK-4724 URL: https://issues.apache.org/jira/browse/SPARK-4724 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.1.0 Reporter: Emre Sevinç Priority: Minor Labels: examples JavaNetworkWordCount.java has a wrong import. [Line 28|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L28] in [spark/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java] reads as {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} import org.apache.spark.streaming.Durations; {code} But according to the [documentation|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/package-summary.html], it should be [Duration|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Duration.html] (or [Seconds|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Seconds.html]), not **Durations** . Also [line 60|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L60] should change accordingly because currently it reads as: {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1)); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4717) Optimize BLAS library to avoid de-reference multiple times in loop
[ https://issues.apache.org/jira/browse/SPARK-4717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4717. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3577 [https://github.com/apache/spark/pull/3577] Optimize BLAS library to avoid de-reference multiple times in loop -- Key: SPARK-4717 URL: https://issues.apache.org/jira/browse/SPARK-4717 Project: Spark Issue Type: Improvement Components: MLlib Reporter: DB Tsai Fix For: 1.2.0 Have a local reference to `values` and `indices` array in the `Vector` object so JVM can locate the value with one operation call. See `SPARK-4581` for similar optimization, and the bytecode analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4717) Optimize BLAS library to avoid de-reference multiple times in loop
[ https://issues.apache.org/jira/browse/SPARK-4717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4717: - Assignee: DB Tsai Optimize BLAS library to avoid de-reference multiple times in loop -- Key: SPARK-4717 URL: https://issues.apache.org/jira/browse/SPARK-4717 Project: Spark Issue Type: Improvement Components: MLlib Reporter: DB Tsai Assignee: DB Tsai Fix For: 1.2.0 Have a local reference to `values` and `indices` array in the `Vector` object so JVM can locate the value with one operation call. See `SPARK-4581` for similar optimization, and the bytecode analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4724) JavaNetworkWordCount.java has a wrong import
[ https://issues.apache.org/jira/browse/SPARK-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233053#comment-14233053 ] Emre Sevinç commented on SPARK-4724: Then how do I import {{org.apache.spark.streaming.Durations}} and then use it as {{JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.second(1));}} ? Because I can't import and compile it as above (I receive {{cannot find symbol}} errors when I try to), whereas I can import {{org.apache.spark.streaming.Durations}} or {{org.apache.spark.streaming.Seconds}} and then use them. JavaNetworkWordCount.java has a wrong import Key: SPARK-4724 URL: https://issues.apache.org/jira/browse/SPARK-4724 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.1.0 Reporter: Emre Sevinç Priority: Minor Labels: examples JavaNetworkWordCount.java has a wrong import. [Line 28|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L28] in [spark/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java] reads as {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} import org.apache.spark.streaming.Durations; {code} But according to the [documentation|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/package-summary.html], it should be [Duration|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Duration.html] (or [Seconds|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Seconds.html]), not **Durations** . Also [line 60|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L60] should change accordingly because currently it reads as: {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1)); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233067#comment-14233067 ] Xiangrui Meng commented on SPARK-4001: -- [~jackylk] Could you share some performance testing results? Add Apriori algorithm to Spark MLlib Key: SPARK-4001 URL: https://issues.apache.org/jira/browse/SPARK-4001 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jacky Li Assignee: Jacky Li Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4710) Fix MLlib compilation warnings
[ https://issues.apache.org/jira/browse/SPARK-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233071#comment-14233071 ] Xiangrui Meng commented on SPARK-4710: -- [~srowen] Sorry I didn't see your PR. Since Joseph's was merged, do you mind updating yours? Thanks! Fix MLlib compilation warnings -- Key: SPARK-4710 URL: https://issues.apache.org/jira/browse/SPARK-4710 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Trivial Fix For: 1.2.0 MLlib has 2 compilation warnings from DecisionTreeRunner and StreamingKMeans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4710) Fix MLlib compilation warnings
[ https://issues.apache.org/jira/browse/SPARK-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233075#comment-14233075 ] Sean Owen commented on SPARK-4710: -- [~mengxr] It looks like it didn't overlap -- maybe these were newer than my PR. This is the PR I'm suggesting at the moment, which still looks mergeable: https://github.com/apache/spark/pull/3157 Fix MLlib compilation warnings -- Key: SPARK-4710 URL: https://issues.apache.org/jira/browse/SPARK-4710 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Trivial Fix For: 1.2.0 MLlib has 2 compilation warnings from DecisionTreeRunner and StreamingKMeans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4724) JavaNetworkWordCount.java has a wrong import
[ https://issues.apache.org/jira/browse/SPARK-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233078#comment-14233078 ] Sean Owen commented on SPARK-4724: -- I believe the class is new in 1.2. You are looking at the example code as of master, but the class documentaiton as of 1.1. JavaNetworkWordCount.java has a wrong import Key: SPARK-4724 URL: https://issues.apache.org/jira/browse/SPARK-4724 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.1.0 Reporter: Emre Sevinç Priority: Minor Labels: examples JavaNetworkWordCount.java has a wrong import. [Line 28|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L28] in [spark/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java] reads as {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} import org.apache.spark.streaming.Durations; {code} But according to the [documentation|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/package-summary.html], it should be [Duration|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Duration.html] (or [Seconds|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Seconds.html]), not **Durations** . Also [line 60|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L60] should change accordingly because currently it reads as: {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1)); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1
[ https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233081#comment-14233081 ] Yana Kadiyska commented on SPARK-4702: -- Unfortunately I still see this error after building with ./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phive-thriftserver -Phive-0.12.0 I do have a working build from master branch from October 24th where this scenario works. We are running CDH4.6 Hive0.10 In particular, the query I tried is select count(*) from mytable where pkey ='some-non-existant-key'; Querying non-existent partition produces exception in v1.2.0-rc1 - Key: SPARK-4702 URL: https://issues.apache.org/jira/browse/SPARK-4702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Using HiveThriftServer2, when querying a non-existent partition I get an exception rather than an empty result set. This seems to be a regression -- I had an older build of master branch where this works. Build off of RC1.2 tag produces the following: 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4724) JavaNetworkWordCount.java has a wrong import
[ https://issues.apache.org/jira/browse/SPARK-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233100#comment-14233100 ] Emre Sevinç commented on SPARK-4724: OK, now I see. What confused me was the word _latest_ in URL of the Javadoc: https://spark.apache.org/docs/latest/api/java/index.html. I failed to realize that the title of the HTML page indicates **Spark 1.1.1 JavaDoc**, and that _latest_ referred to the latest release of 1.1.x series, and not the 1.2.x development branch (master) as of now. JavaNetworkWordCount.java has a wrong import Key: SPARK-4724 URL: https://issues.apache.org/jira/browse/SPARK-4724 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.1.0 Reporter: Emre Sevinç Priority: Minor Labels: examples JavaNetworkWordCount.java has a wrong import. [Line 28|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L28] in [spark/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java] reads as {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} import org.apache.spark.streaming.Durations; {code} But according to the [documentation|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/package-summary.html], it should be [Duration|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Duration.html] (or [Seconds|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Seconds.html]), not **Durations** . Also [line 60|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L60] should change accordingly because currently it reads as: {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1)); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4726) NotSerializableException thrown on SystemDefaultHttpClient with stack not related to my functions
Dmitriy Makarenko created SPARK-4726: Summary: NotSerializableException thrown on SystemDefaultHttpClient with stack not related to my functions Key: SPARK-4726 URL: https://issues.apache.org/jira/browse/SPARK-4726 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.1 Reporter: Dmitriy Makarenko I get this stacktrace that doesn't contain any of my function - Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.http.impl.client.SystemDefaultHttpClient at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:771) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:714) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:698) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1198) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) As I know SystemDefaultHttpClient is used inside the SolrJ library that I use, but it is in the separate Jar from my project. All of mine classes are Serializable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4724) JavaNetworkWordCount.java has a wrong import
[ https://issues.apache.org/jira/browse/SPARK-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emre Sevinç resolved SPARK-4724. Resolution: Not a Problem Not a Problem. It was a misunderstanding on my side. JavaNetworkWordCount.java has a wrong import Key: SPARK-4724 URL: https://issues.apache.org/jira/browse/SPARK-4724 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.1.0 Reporter: Emre Sevinç Priority: Minor Labels: examples JavaNetworkWordCount.java has a wrong import. [Line 28|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L28] in [spark/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java] reads as {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} import org.apache.spark.streaming.Durations; {code} But according to the [documentation|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/package-summary.html], it should be [Duration|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Duration.html] (or [Seconds|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Seconds.html]), not **Durations** . Also [line 60|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L60] should change accordingly because currently it reads as: {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1)); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4724) JavaNetworkWordCount.java has a wrong import
[ https://issues.apache.org/jira/browse/SPARK-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emre Sevinç closed SPARK-4724. -- JavaNetworkWordCount.java has a wrong import Key: SPARK-4724 URL: https://issues.apache.org/jira/browse/SPARK-4724 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.1.0 Reporter: Emre Sevinç Priority: Minor Labels: examples JavaNetworkWordCount.java has a wrong import. [Line 28|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L28] in [spark/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java] reads as {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} import org.apache.spark.streaming.Durations; {code} But according to the [documentation|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/package-summary.html], it should be [Duration|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Duration.html] (or [Seconds|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Seconds.html]), not **Durations** . Also [line 60|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L60] should change accordingly because currently it reads as: {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1)); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering
[ https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233107#comment-14233107 ] Travis Galoppo commented on SPARK-4156: --- I have modified the cluster initialization strategy to derive an initial covariance matrix from the sample points used to initialize the clusters; this initial covariance matrix has the element-wise variance of the sample points on the diagonal. The final computed covariance matrix is not constrained to be diagonal. I tested this with the S1 dataset [~MeethuMathew] referenced above; while it does fix the problem of effectively finding no clusters, I find that the results are still better when the input is scaled as I mentioned above. It might be worthwhile to allow the user to provide a pre-initialized model to accomodate various initialization strategies, and provide the current functionality as a default. Thoughts? Also, I have fixed the defect in DenseGmmEM whereby it was ignoring the delta parameter. Add expectation maximization for Gaussian mixture models to MLLib clustering Key: SPARK-4156 URL: https://issues.apache.org/jira/browse/SPARK-4156 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Travis Galoppo Assignee: Travis Galoppo As an additional clustering algorithm, implement expectation maximization for Gaussian mixture models -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4690) AppendOnlyMap seems not using Quadratic probing as the JavaDoc
[ https://issues.apache.org/jira/browse/SPARK-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233124#comment-14233124 ] Sean Owen commented on SPARK-4690: -- No, it is using quadratic probing. It adds {{delta}} each time, but {{delta}} grows by 1 each time too. The offsets are thus 1, 3, 6, 10 ... or the sequence n*(n+1)/2 which is quadratic. AppendOnlyMap seems not using Quadratic probing as the JavaDoc -- Key: SPARK-4690 URL: https://issues.apache.org/jira/browse/SPARK-4690 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 1.1.0, 1.2.0, 1.3.0 Reporter: Yijie Shen Priority: Minor org.apache.spark.util.collection.AppendOnlyMap's Documentation like this: This implementation uses quadratic probing with a power-of-2 However, the probe procedure in face with a hash collision is just using linear probing. the code below: val delta = i pos = (pos + delta) mask i += 1 Maybe a bug here? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1
[ https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233081#comment-14233081 ] Yana Kadiyska edited comment on SPARK-4702 at 12/3/14 3:53 PM: --- Unfortunately I still see this error after building with ./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phive-thriftserver -Phive-0.12.0 I do have a working build from master branch from October 24th where this scenario works. We are running CDH4.6 Hive0.10 Beeline output from October: Connected to: Hive (version 0.12.0-protobuf-2.5) Driver: null (version null) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline output from RC: Connected to: Hive (version 1.2.0) Driver: null (version null) Transaction isolation: TRANSACTION_REPEATABLE_READ I am wondering if -Phive-0.12.0 is fully sufficient --not sure why the Connected to: version prints differently? In particular, the query I tried is select count(*) from mytable where pkey ='some-non-existant-key'; was (Author: yanakad): Unfortunately I still see this error after building with ./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phive-thriftserver -Phive-0.12.0 I do have a working build from master branch from October 24th where this scenario works. We are running CDH4.6 Hive0.10 In particular, the query I tried is select count(*) from mytable where pkey ='some-non-existant-key'; Querying non-existent partition produces exception in v1.2.0-rc1 - Key: SPARK-4702 URL: https://issues.apache.org/jira/browse/SPARK-4702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Using HiveThriftServer2, when querying a non-existent partition I get an exception rather than an empty result set. This seems to be a regression -- I had an older build of master branch where this works. Build off of RC1.2 tag produces the following: 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (SPARK-4727) Add dimensional RDDs (time series, spatial)
RJ Nowling created SPARK-4727: - Summary: Add dimensional RDDs (time series, spatial) Key: SPARK-4727 URL: https://issues.apache.org/jira/browse/SPARK-4727 Project: Spark Issue Type: Brainstorming Components: Spark Core Affects Versions: 1.1.0 Reporter: RJ Nowling Certain types of data (times series, spatial) can benefit from specialized RDDs. I'd like to open a discussion about this. For example, time series data should be ordered by time and would benefit from operations like: * Subsampling (taking every n data points) * Signal processing (correlations, FFTs, filtering) * Windowing functions Spatial data benefits from ordering and partitioning along a 2D or 3D grid. For example, path finding algorithms can optimized by only comparing points within a set distance, which can be computed more efficiently by partitioning data into a grid. Although the operations on time series and spatial data may be different, there is some commonality in the sense of the data having ordered dimensions and the implementations may overlap. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4728) Add exponential, log normal, and gamma distributions to data generator to MLlib
RJ Nowling created SPARK-4728: - Summary: Add exponential, log normal, and gamma distributions to data generator to MLlib Key: SPARK-4728 URL: https://issues.apache.org/jira/browse/SPARK-4728 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.0 Reporter: RJ Nowling Priority: Minor MLlib supports sampling from normal, uniform, and Poisson distributions. I'd like to add support for sampling from exponential, gamma, and log normal distributions, using the features of math3 like the other generators. Please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4729) Add time series subsampling to MLlib
RJ Nowling created SPARK-4729: - Summary: Add time series subsampling to MLlib Key: SPARK-4729 URL: https://issues.apache.org/jira/browse/SPARK-4729 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.0 Reporter: RJ Nowling Priority: Minor MLlib supports several time series functions. The ability to subsample a time series (take every n data points) is missing. I'd like to add it, so please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1
[ https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233081#comment-14233081 ] Yana Kadiyska edited comment on SPARK-4702 at 12/3/14 6:09 PM: --- Unfortunately I still see this error after building with ./make-distribution.sh --tgz -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phive-thriftserver -Phive-0.12.0 I do have a working build from master branch from October 24th where this scenario works. We are running CDH4.6 Hive0.10 Beeline output from October: Connected to: Hive (version 0.12.0-protobuf-2.5) Driver: null (version null) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline output from RC: Connected to: Hive (version 1.2.0) Driver: null (version null) Transaction isolation: TRANSACTION_REPEATABLE_READ I am wondering if -Phive-0.12.0 is fully sufficient --not sure why the Connected to: version prints differently? In particular, the query I tried is select count(*) from mytable where pkey ='some-non-existant-key'; was (Author: yanakad): Unfortunately I still see this error after building with ./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phive-thriftserver -Phive-0.12.0 I do have a working build from master branch from October 24th where this scenario works. We are running CDH4.6 Hive0.10 Beeline output from October: Connected to: Hive (version 0.12.0-protobuf-2.5) Driver: null (version null) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline output from RC: Connected to: Hive (version 1.2.0) Driver: null (version null) Transaction isolation: TRANSACTION_REPEATABLE_READ I am wondering if -Phive-0.12.0 is fully sufficient --not sure why the Connected to: version prints differently? In particular, the query I tried is select count(*) from mytable where pkey ='some-non-existant-key'; Querying non-existent partition produces exception in v1.2.0-rc1 - Key: SPARK-4702 URL: https://issues.apache.org/jira/browse/SPARK-4702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Using HiveThriftServer2, when querying a non-existent partition I get an exception rather than an empty result set. This seems to be a regression -- I had an older build of master branch where this works. Build off of RC1.2 tag produces the following: 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at
[jira] [Assigned] (SPARK-4552) query for empty parquet table in spark sql hive get IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-4552: --- Assignee: Michael Armbrust query for empty parquet table in spark sql hive get IllegalArgumentException Key: SPARK-4552 URL: https://issues.apache.org/jira/browse/SPARK-4552 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Assignee: Michael Armbrust Fix For: 1.2.0 run create table test_parquet(key int, value string) stored as parquet; select * from test_parquet; get error as follow java.lang.IllegalArgumentException: Could not find Parquet metadata at path file:/user/hive/warehouse/test_parquet at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.sc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4552) query for empty parquet table in spark sql hive get IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4552: Priority: Blocker (was: Major) query for empty parquet table in spark sql hive get IllegalArgumentException Key: SPARK-4552 URL: https://issues.apache.org/jira/browse/SPARK-4552 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Assignee: Michael Armbrust Priority: Blocker Fix For: 1.2.0 run create table test_parquet(key int, value string) stored as parquet; select * from test_parquet; get error as follow java.lang.IllegalArgumentException: Could not find Parquet metadata at path file:/user/hive/warehouse/test_parquet at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.sc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4552) query for empty parquet table in spark sql hive get IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233314#comment-14233314 ] Michael Armbrust commented on SPARK-4552: - It turns out this manifests also when writing a query that doesn't match any partitions, which seems much more common. I'm going to do a quick surgical fix given the proximity to the release. query for empty parquet table in spark sql hive get IllegalArgumentException Key: SPARK-4552 URL: https://issues.apache.org/jira/browse/SPARK-4552 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Assignee: Michael Armbrust Priority: Blocker Fix For: 1.2.0 run create table test_parquet(key int, value string) stored as parquet; select * from test_parquet; get error as follow java.lang.IllegalArgumentException: Could not find Parquet metadata at path file:/user/hive/warehouse/test_parquet at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.sc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1
[ https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4702. - Resolution: Duplicate Querying non-existent partition produces exception in v1.2.0-rc1 - Key: SPARK-4702 URL: https://issues.apache.org/jira/browse/SPARK-4702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Using HiveThriftServer2, when querying a non-existent partition I get an exception rather than an empty result set. This seems to be a regression -- I had an older build of master branch where this works. Build off of RC1.2 tag produces the following: 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4694) Long-run user thread(such as HiveThriftServer2) causes the 'process leak' in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233321#comment-14233321 ] Marcelo Vanzin commented on SPARK-4694: --- To answer your question, you can call System.exit() if you want. It's just recommended that it's done after you properly shutdown the SparkContext, otherwise Yarn won't report your app status correctly. But it seems your patch doesn't use System.exit(), so this is kinda moot. Long-run user thread(such as HiveThriftServer2) causes the 'process leak' in yarn-client mode - Key: SPARK-4694 URL: https://issues.apache.org/jira/browse/SPARK-4694 Project: Spark Issue Type: Bug Components: YARN Reporter: SaintBacchus Recently when I use the Yarn HA mode to test the HiveThriftServer2 I found a problem that the driver can't exit by itself. To reappear it, you can do as fellow: 1.use yarn HA mode and set am.maxAttemp = 1for convenience 2.kill the active resouce manager in cluster The expect result is just failed, because the maxAttemp was 1. But the actual result is that: all executor was ended but the driver was still there and never close. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4690) AppendOnlyMap seems not using Quadratic probing as the JavaDoc
[ https://issues.apache.org/jira/browse/SPARK-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1429#comment-1429 ] Matei Zaharia commented on SPARK-4690: -- Yup, that's the definition of it. AppendOnlyMap seems not using Quadratic probing as the JavaDoc -- Key: SPARK-4690 URL: https://issues.apache.org/jira/browse/SPARK-4690 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 1.1.0, 1.2.0, 1.3.0 Reporter: Yijie Shen Priority: Minor org.apache.spark.util.collection.AppendOnlyMap's Documentation like this: This implementation uses quadratic probing with a power-of-2 However, the probe procedure in face with a hash collision is just using linear probing. the code below: val delta = i pos = (pos + delta) mask i += 1 Maybe a bug here? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4690) AppendOnlyMap seems not using Quadratic probing as the JavaDoc
[ https://issues.apache.org/jira/browse/SPARK-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia closed SPARK-4690. Resolution: Invalid AppendOnlyMap seems not using Quadratic probing as the JavaDoc -- Key: SPARK-4690 URL: https://issues.apache.org/jira/browse/SPARK-4690 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 1.1.0, 1.2.0, 1.3.0 Reporter: Yijie Shen Priority: Minor org.apache.spark.util.collection.AppendOnlyMap's Documentation like this: This implementation uses quadratic probing with a power-of-2 However, the probe procedure in face with a hash collision is just using linear probing. the code below: val delta = i pos = (pos + delta) mask i += 1 Maybe a bug here? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1
[ https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233344#comment-14233344 ] Michael Armbrust commented on SPARK-4702: - Thanks for reporting. As a workaround you should be able to SET spark.sql.hive.convertMetastoreParquet=false, but I'm going to try to fix this before the next RC. Querying non-existent partition produces exception in v1.2.0-rc1 - Key: SPARK-4702 URL: https://issues.apache.org/jira/browse/SPARK-4702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Using HiveThriftServer2, when querying a non-existent partition I get an exception rather than an empty result set. This seems to be a regression -- I had an older build of master branch where this works. Build off of RC1.2 tag produces the following: 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4697) System properties should override environment variables
[ https://issues.apache.org/jira/browse/SPARK-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233381#comment-14233381 ] Andrew Or commented on SPARK-4697: -- Hey did you search for a duplicate or related JIRA in the history? I'm pretty sure there's another one that suggests something similar System properties should override environment variables --- Key: SPARK-4697 URL: https://issues.apache.org/jira/browse/SPARK-4697 Project: Spark Issue Type: Bug Components: YARN Reporter: WangTaoTheTonic I found some arguments in yarn module take environment variables before system properties while the latter override the former in core module. This should be changed in org.apache.spark.deploy.yarn.ClientArguments and org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4687) SparkContext#addFile doesn't keep file folder information
[ https://issues.apache.org/jira/browse/SPARK-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233385#comment-14233385 ] Sandy Ryza commented on SPARK-4687: --- [~pwendell], do you think this is a reasonable API addition? If so, I'll try and add it. SparkContext#addFile doesn't keep file folder information - Key: SPARK-4687 URL: https://issues.apache.org/jira/browse/SPARK-4687 Project: Spark Issue Type: Bug Reporter: Jimmy Xiang Files added with SparkContext#addFile are loaded with Utils#fetchFile before a task starts. However, Utils#fetchFile puts all files under the Spart root on the worker node. We should have an option to keep the folder information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4552) query for empty parquet table in spark sql hive get IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233393#comment-14233393 ] Apache Spark commented on SPARK-4552: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/3586 query for empty parquet table in spark sql hive get IllegalArgumentException Key: SPARK-4552 URL: https://issues.apache.org/jira/browse/SPARK-4552 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Assignee: Michael Armbrust Priority: Blocker Fix For: 1.2.0 run create table test_parquet(key int, value string) stored as parquet; select * from test_parquet; get error as follow java.lang.IllegalArgumentException: Could not find Parquet metadata at path file:/user/hive/warehouse/test_parquet at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.sc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1
[ https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233443#comment-14233443 ] Yana Kadiyska commented on SPARK-4702: -- Michael, just wanted to point out that the workaround you suggested did indeed help when the partition is missing -- I get a count of 0. But it did break the otherwise working case when a partition is present: java.lang.IllegalStateException: All the offsets listed in the split should be found in the file. expected: [4, 4] found: {my schema dumped out here} out of: [4, 121017555, 242333553, 363518600] in range 0, 134217728 It's possible that this is a very corner case -- we've added columns to our schema so it's possible that the parquet files are likely not symmetric (not quite sure what convertMetastoreParquet does under the hood). But wanted to point out that in our case the bug is truly a blocker (I'm hoping it makes it in 1.2, don't care if it makes it in the next RC or later) Querying non-existent partition produces exception in v1.2.0-rc1 - Key: SPARK-4702 URL: https://issues.apache.org/jira/browse/SPARK-4702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Using HiveThriftServer2, when querying a non-existent partition I get an exception rather than an empty result set. This seems to be a regression -- I had an older build of master branch where this works. Build off of RC1.2 tag produces the following: 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-3926) result of JavaRDD collectAsMap() is not serializable
[ https://issues.apache.org/jira/browse/SPARK-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-3926: -- I am reopening as it is not actually serializable without a no-arg constructor. PR coming shortly, that copies the implementation from Scala's Wrappers.MapWrapper. result of JavaRDD collectAsMap() is not serializable Key: SPARK-3926 URL: https://issues.apache.org/jira/browse/SPARK-3926 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.2, 1.1.0, 1.2.0 Environment: CentOS / Spark 1.1 / Hadoop Hortonworks 2.4.0.2.1.2.0-402 Reporter: Antoine Amend Assignee: Sean Owen Fix For: 1.1.1, 1.2.0 Using the Java API, I want to collect the result of a RDDString, String as a HashMap using collectAsMap function: MapString, String map = myJavaRDD.collectAsMap(); This works fine, but when passing this map to another function, such as... myOtherJavaRDD.mapToPair(new CustomFunction(map)) ...this leads to the following error: Exception in thread main org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.map(RDD.scala:270) at org.apache.spark.api.java.JavaRDDLike$class.mapToPair(JavaRDDLike.scala:99) at org.apache.spark.api.java.JavaPairRDD.mapToPair(JavaPairRDD.scala:44) ../.. MY CLASS ../.. at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.NotSerializableException: scala.collection.convert.Wrappers$MapWrapper at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) This seems to be due to WrapAsJava.scala being non serializable ../.. implicit def mapAsJavaMap[A, B](m: Map[A, B]): ju.Map[A, B] = m match { //case JConcurrentMapWrapper(wrapped) = wrapped case JMapWrapper(wrapped) = wrapped.asInstanceOf[ju.Map[A, B]] case _ = new MapWrapper(m) } ../.. The workaround is to manually wrapper this map into another one (serialized) MapString, String map = myJavaRDD.collectAsMap(); MapString, String tmp = new HashMapString, String(map); myOtherJavaRDD.mapToPair(new CustomFunction(tmp)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3926) result of JavaRDD collectAsMap() is not serializable
[ https://issues.apache.org/jira/browse/SPARK-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233453#comment-14233453 ] Apache Spark commented on SPARK-3926: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/3587 result of JavaRDD collectAsMap() is not serializable Key: SPARK-3926 URL: https://issues.apache.org/jira/browse/SPARK-3926 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.2, 1.1.0, 1.2.0 Environment: CentOS / Spark 1.1 / Hadoop Hortonworks 2.4.0.2.1.2.0-402 Reporter: Antoine Amend Assignee: Sean Owen Fix For: 1.1.1, 1.2.0 Using the Java API, I want to collect the result of a RDDString, String as a HashMap using collectAsMap function: MapString, String map = myJavaRDD.collectAsMap(); This works fine, but when passing this map to another function, such as... myOtherJavaRDD.mapToPair(new CustomFunction(map)) ...this leads to the following error: Exception in thread main org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.map(RDD.scala:270) at org.apache.spark.api.java.JavaRDDLike$class.mapToPair(JavaRDDLike.scala:99) at org.apache.spark.api.java.JavaPairRDD.mapToPair(JavaPairRDD.scala:44) ../.. MY CLASS ../.. at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.NotSerializableException: scala.collection.convert.Wrappers$MapWrapper at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) This seems to be due to WrapAsJava.scala being non serializable ../.. implicit def mapAsJavaMap[A, B](m: Map[A, B]): ju.Map[A, B] = m match { //case JConcurrentMapWrapper(wrapped) = wrapped case JMapWrapper(wrapped) = wrapped.asInstanceOf[ju.Map[A, B]] case _ = new MapWrapper(m) } ../.. The workaround is to manually wrapper this map into another one (serialized) MapString, String map = myJavaRDD.collectAsMap(); MapString, String tmp = new HashMapString, String(map); myOtherJavaRDD.mapToPair(new CustomFunction(tmp)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1
[ https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233457#comment-14233457 ] Michael Armbrust commented on SPARK-4702: - Yana, I'm a little confused. Were both of these cases working in 1.1? As far as I understand, setting spark.sql.hive.convertMetastoreParquet=false should restore the behavior as of 1.1. When convertMetastoreParquet is true, there is not currently support for heterogeneous schema. Querying non-existent partition produces exception in v1.2.0-rc1 - Key: SPARK-4702 URL: https://issues.apache.org/jira/browse/SPARK-4702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Using HiveThriftServer2, when querying a non-existent partition I get an exception rather than an empty result set. This seems to be a regression -- I had an older build of master branch where this works. Build off of RC1.2 tag produces the following: 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4701) Typo in sbt/sbt
[ https://issues.apache.org/jira/browse/SPARK-4701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4701. Resolution: Fixed Fix Version/s: 1.1.2 1.2.0 Assignee: Masayoshi TSUZUKI Target Version/s: 1.1.1, 1.2.0 Typo in sbt/sbt --- Key: SPARK-4701 URL: https://issues.apache.org/jira/browse/SPARK-4701 Project: Spark Issue Type: Bug Components: Build Reporter: Masayoshi TSUZUKI Assignee: Masayoshi TSUZUKI Priority: Trivial Fix For: 1.2.0, 1.1.2 in sbt/sbt {noformat} -S-X add -X to sbt's scalacOptions (-J is stripped) {noformat} but {{(-S is stripped)}} is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4691) code optimization for judgement
[ https://issues.apache.org/jira/browse/SPARK-4691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4691: - Affects Version/s: 1.1.0 code optimization for judgement --- Key: SPARK-4691 URL: https://issues.apache.org/jira/browse/SPARK-4691 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: maji2014 Priority: Minor aggregator and mapSideCombine judgement in HashShuffleWriter.scala SortShuffleWriter.scala HashShuffleReader.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2143) Display Spark version on Driver web page
[ https://issues.apache.org/jira/browse/SPARK-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2143. -- Resolution: Fixed Fix Version/s: 1.2.0 Target Version/s: 1.2.0 (This went in for branch 1.2 but not for the 1.1 branch) Display Spark version on Driver web page Key: SPARK-2143 URL: https://issues.apache.org/jira/browse/SPARK-2143 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Jeff Hammerbacher Priority: Critical Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4715) ShuffleMemoryManager.tryToAcquire may return a negative value
[ https://issues.apache.org/jira/browse/SPARK-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4715: - Affects Version/s: 1.1.0 Assignee: Shixiong Zhu ShuffleMemoryManager.tryToAcquire may return a negative value - Key: SPARK-4715 URL: https://issues.apache.org/jira/browse/SPARK-4715 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Here is a unit test to demonstrate it: {code} test(threads should not be granted a negative size) { val manager = new ShuffleMemoryManager(1000L) manager.tryToAcquire(700L) val latch = new CountDownLatch(1) startThread(t1) { manager.tryToAcquire(300L) latch.countDown() } latch.await() // Wait until `t1` calls `tryToAcquire` val granted = manager.tryToAcquire(300L) assert(0 === granted, granted is negative) } {code} It outputs 0 did not equal -200 granted is negative -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4715) ShuffleMemoryManager.tryToAcquire may return a negative value
[ https://issues.apache.org/jira/browse/SPARK-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4715. Resolution: Fixed Fix Version/s: 1.1.2 1.2.0 Target Version/s: 1.2.0, 1.1.2 ShuffleMemoryManager.tryToAcquire may return a negative value - Key: SPARK-4715 URL: https://issues.apache.org/jira/browse/SPARK-4715 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Fix For: 1.2.0, 1.1.2 Here is a unit test to demonstrate it: {code} test(threads should not be granted a negative size) { val manager = new ShuffleMemoryManager(1000L) manager.tryToAcquire(700L) val latch = new CountDownLatch(1) startThread(t1) { manager.tryToAcquire(300L) latch.countDown() } latch.await() // Wait until `t1` calls `tryToAcquire` val granted = manager.tryToAcquire(300L) assert(0 === granted, granted is negative) } {code} It outputs 0 did not equal -200 granted is negative -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1
[ https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233499#comment-14233499 ] Michael Armbrust commented on SPARK-4702: - Also, have you tested with: https://github.com/apache/spark/pull/3586? Querying non-existent partition produces exception in v1.2.0-rc1 - Key: SPARK-4702 URL: https://issues.apache.org/jira/browse/SPARK-4702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Using HiveThriftServer2, when querying a non-existent partition I get an exception rather than an empty result set. This seems to be a regression -- I had an older build of master branch where this works. Build off of RC1.2 tag produces the following: 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1
[ https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233514#comment-14233514 ] Yana Kadiyska commented on SPARK-4702: -- Michael, I do not have a 1.1. In October I built master manually, I believe from commit d2987e8f7a2cb3bf971f381399d8efdccb51d3d2. At that time both of types of queries worked, without setting spark.sql.hive.convertMetastoreParquet=false (I tested on a smaller cluster, will drop on the same one now to make sure there's not some data weirdness) If you meant to say When convertMetastoreParquet is FALSE, there is not currently support for heterogeneous schema. then we are saying the same thing -- I didn't have to set this flag as the missing partitions where handled fine. Now missing partitions are broken, but setting SET spark.sql.hive.convertMetastoreParquet=false breaks 99% case because my files have different # columns. I have not tried the PR you mentioned, will try it now --in my case the issue is not an empty file, it's a missing directory -- our query does a partition=-mm query, where parquet files are laid out under -mm directories representing partitions. But will see if the issue is helped by that PR. In any case, I am just hoping this works before the final release, not a particular rush Querying non-existent partition produces exception in v1.2.0-rc1 - Key: SPARK-4702 URL: https://issues.apache.org/jira/browse/SPARK-4702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Using HiveThriftServer2, when querying a non-existent partition I get an exception rather than an empty result set. This seems to be a regression -- I had an older build of master branch where this works. Build off of RC1.2 tag produces the following: 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Commented] (SPARK-4575) Documentation for the pipeline features
[ https://issues.apache.org/jira/browse/SPARK-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233548#comment-14233548 ] Apache Spark commented on SPARK-4575: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/3588 Documentation for the pipeline features --- Key: SPARK-4575 URL: https://issues.apache.org/jira/browse/SPARK-4575 Project: Spark Issue Type: Improvement Components: Documentation, ML, MLlib Affects Versions: 1.2.0 Reporter: Xiangrui Meng Assignee: Joseph K. Bradley Add user guide for the newly added ML pipeline feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3431) Parallelize execution of tests
[ https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233568#comment-14233568 ] Nicholas Chammas commented on SPARK-3431: - [~joshrosen] I tried [that patch you posted earlier here|https://issues.apache.org/jira/browse/SPARK-3431?focusedCommentId=14168038page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14168038]. It appears to fork a JVM for every individual test (e.g. {{org.apache.spark.streaming.DurationSuite}}). When I tried it out on Jenkins, the tests [timed out after 2 hours|https://github.com/apache/spark/pull/3564#issuecomment-65349149]. Parallelize execution of tests -- Key: SPARK-3431 URL: https://issues.apache.org/jira/browse/SPARK-3431 Project: Spark Issue Type: Improvement Components: Build Reporter: Nicholas Chammas Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common strategy to cut test time down is to parallelize the execution of the tests. Doing that may in turn require some prerequisite changes to be made to how certain tests run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4730) Warn against deprecated YARN settings
Andrew Or created SPARK-4730: Summary: Warn against deprecated YARN settings Key: SPARK-4730 URL: https://issues.apache.org/jira/browse/SPARK-4730 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Yarn currently reads from SPARK_MASTER_MEMORY and SPARK_WORKER_MEMORY. If you have these settings leftover from a standalone cluster setup and you try to run Spark on Yarn on the same cluster, then your executors suddenly get the amount of memory specified through SPARK_WORKER_MEMORY. This behavior is due in large part to backward compatibility. However, we should log a warning against the use of these variables at the very least. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4730) Warn against deprecated YARN settings
[ https://issues.apache.org/jira/browse/SPARK-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233596#comment-14233596 ] Apache Spark commented on SPARK-4730: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/3590 Warn against deprecated YARN settings - Key: SPARK-4730 URL: https://issues.apache.org/jira/browse/SPARK-4730 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Yarn currently reads from SPARK_MASTER_MEMORY and SPARK_WORKER_MEMORY. If you have these settings leftover from a standalone cluster setup and you try to run Spark on Yarn on the same cluster, then your executors suddenly get the amount of memory specified through SPARK_WORKER_MEMORY. This behavior is due in large part to backward compatibility. However, we should log a warning against the use of these variables at the very least. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4552) query for empty parquet table in spark sql hive get IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4552. - Resolution: Fixed Issue resolved by pull request 3586 [https://github.com/apache/spark/pull/3586] query for empty parquet table in spark sql hive get IllegalArgumentException Key: SPARK-4552 URL: https://issues.apache.org/jira/browse/SPARK-4552 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Assignee: Michael Armbrust Priority: Blocker Fix For: 1.2.0 run create table test_parquet(key int, value string) stored as parquet; select * from test_parquet; get error as follow java.lang.IllegalArgumentException: Could not find Parquet metadata at path file:/user/hive/warehouse/test_parquet at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.sc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4731) Spark 1.1.1 launches broken EC2 clusters
Jey Kottalam created SPARK-4731: --- Summary: Spark 1.1.1 launches broken EC2 clusters Key: SPARK-4731 URL: https://issues.apache.org/jira/browse/SPARK-4731 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.1 Environment: Spark 1.1.1 on MacOS X Reporter: Jey Kottalam EC2 clusters launched using Spark 1.1.1 with the `-v 1.1.1` flag fail to initialize the master and workers correctly. The `/root/spark` directory contains only the `conf` directory and doesn't have the `bin` and other directories. [~joshrosen] suggested that [spark-ec2 #81](https://github.com/mesos/spark-ec2/pull/81) might have fixed it, but I still see this problem after that was merged. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4731) Spark 1.1.1 launches broken EC2 clusters
[ https://issues.apache.org/jira/browse/SPARK-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jey Kottalam updated SPARK-4731: Description: EC2 clusters launched using Spark 1.1.1's `spark-ec2` script with the `-v 1.1.1` flag fail to initialize the master and workers correctly. The `/root/spark` directory contains only the `conf` directory and doesn't have the `bin` and other directories. [~joshrosen] suggested that [spark-ec2 #81](https://github.com/mesos/spark-ec2/pull/81) might have fixed it, but I still see this problem after that was merged. was: EC2 clusters launched using Spark 1.1.1 with the `-v 1.1.1` flag fail to initialize the master and workers correctly. The `/root/spark` directory contains only the `conf` directory and doesn't have the `bin` and other directories. [~joshrosen] suggested that [spark-ec2 #81](https://github.com/mesos/spark-ec2/pull/81) might have fixed it, but I still see this problem after that was merged. Spark 1.1.1 launches broken EC2 clusters Key: SPARK-4731 URL: https://issues.apache.org/jira/browse/SPARK-4731 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.1 Environment: Spark 1.1.1 on MacOS X Reporter: Jey Kottalam EC2 clusters launched using Spark 1.1.1's `spark-ec2` script with the `-v 1.1.1` flag fail to initialize the master and workers correctly. The `/root/spark` directory contains only the `conf` directory and doesn't have the `bin` and other directories. [~joshrosen] suggested that [spark-ec2 #81](https://github.com/mesos/spark-ec2/pull/81) might have fixed it, but I still see this problem after that was merged. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4498. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3550 [https://github.com/apache/spark/pull/3550] Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Fix For: 1.2.0 Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification of the pattern in the Running Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png : the applications started before the blip at 1.9 hours ago still have active drivers. All the applications started 1.9 hours ago do not, and the applications started less than 1.9 hours ago (at the top of the table) do in fact have active drivers. Deploy mode: - PySpark drivers running on one node outside the cluster, scheduled by a cron-like application, not master supervised Other factoids: - In most places, we call sc.stop() explicitly before shutting down our driver process - Here's the sum total of spark configuration options we don't set to the default: {code} spark.cores.max: 30 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs spark.eventLog.enabled: true spark.executor.memory: 7g spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/ spark.io.compression.codec: lzf spark.ui.killEnabled: true {code} - The exception the executors die with is this: {code} 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started 14/11/19 19:42:37 INFO Remoting: Starting remoting 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682] 14/11/19 19:42:38 INFO Utils: Successfully started service 'driverPropsFetcher' on port 37682. 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: spark-etl1.chi.shopify.com/172.16.126.88:58849 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread main
[jira] [Commented] (SPARK-2188) Support sbt/sbt for Windows
[ https://issues.apache.org/jira/browse/SPARK-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233656#comment-14233656 ] Apache Spark commented on SPARK-2188: - User 'tsudukim' has created a pull request for this issue: https://github.com/apache/spark/pull/3591 Support sbt/sbt for Windows --- Key: SPARK-2188 URL: https://issues.apache.org/jira/browse/SPARK-2188 Project: Spark Issue Type: New Feature Components: Build Affects Versions: 1.0.0 Reporter: Pat McDonough Add the equivalent of sbt/sbt for Windows users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2188) Support sbt/sbt for Windows
[ https://issues.apache.org/jira/browse/SPARK-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233670#comment-14233670 ] Masayoshi TSUZUKI commented on SPARK-2188: -- I implemented the equivalent scripts of sbt for Windows with PowerShell. If you are using Windows, you can just try it. Support sbt/sbt for Windows --- Key: SPARK-2188 URL: https://issues.apache.org/jira/browse/SPARK-2188 Project: Spark Issue Type: New Feature Components: Build Affects Versions: 1.0.0 Reporter: Pat McDonough Add the equivalent of sbt/sbt for Windows users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4732) All application progress on the standalone scheduler can be halted by one systematically faulty node
Harry Brundage created SPARK-4732: - Summary: All application progress on the standalone scheduler can be halted by one systematically faulty node Key: SPARK-4732 URL: https://issues.apache.org/jira/browse/SPARK-4732 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Environment: - Spark Standalone scheduler Reporter: Harry Brundage We've experienced several cluster wide outages caused by unexpected system wide faults on one of our spark workers if that worker is failing systematically. By systematically, I mean that every executor launched by that worker will definitely fail due to some reason out of Spark's control like the log directory disk being completely out of space, or a permissions error for a file that's always read during executor launch. We screw up all the time on our team and cause stuff like this to happen, but because of the way the standalone scheduler allocates resources, our cluster doesn't recover gracefully from these failures. Correct me if I am wrong but when there are more tasks to do than executors, I am pretty sure the way the scheduler works is that it just waits for more resource offers and then allocates tasks from the queue to those resources. If an executor dies immediately after starting, the worker monitor process will notice that it's dead. The master will allocate that worker's now free cores/memory to a currently running application that is below its spark.cores.max, which in our case I've observed as usually the app that just had the executor die. A new executor gets spawned on the same worker that the last one just died on, gets allocated that one task that failed, and then the whole process fails again for the same systematic reason, and lather rinse repeat. This happens 10 times or whatever the max task failure count is, and then the whole app is deemed a failure by the driver and shut down completely. For us, we usually run roughly as many cores as we have hadoop nodes. We also usually have many more input splits than we have tasks, which means the locality of the first few tasks which I believe determines where our executors run is well spread out over the cluster, and often covers 90-100% of nodes. This means the likelihood of any application getting an executor scheduled any broken node is quite high. So, in my experience, after an application goes through the above mentioned process and dies, the next application to start or not be at it's requested max capacity gets an executor scheduled on the broken node, and is promptly taken down as well. This happens over and over as well, to the point where none of our spark jobs are making any progress because of one tiny permissions mistake on one node. Now, I totally understand this is usually an error between keyboard and screen kind of situation where it is the responsibility of the people deploying spark to ensure it is deployed correctly. The systematic issues we've encountered are almost always of this nature: permissions errors, disk full errors, one node not getting a new spark jar from a configuration error, configurations being out of sync, etc. That said, disks are going to fail or half fail, fill up, node rot is going to ruin configurations, etc etc etc, and as hadoop clusters scale in size this becomes more and more likely, so I think its reasonable to ask that Spark be resilient to this kind of failure and keep on truckin'. I think a good simple fix would be to have applications, or the master, blacklist workers (not executors) at a failure count lower than the task failure count. This would also serve as a belt and suspenders fix for SPARK-4498. If the scheduler stopped trying to schedule on nodes that fail a lot, we could still make progress. These blacklist events are really important and I think would need to be well logged and surfaced in the UI, but I'd rather log and carry on than fail hard. I think the tradeoff here is that you risk blacklisting ever worker as well if there is something systematically wrong with communication or whatever else I can't imagine. Please let me know if I've misunderstood how the scheduler works or you need more information or anything like that and I'll be happy to provide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4732) All application progress on the standalone scheduler can be halted by one systematically faulty node
[ https://issues.apache.org/jira/browse/SPARK-4732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harry Brundage updated SPARK-4732: -- Description: We've experienced several cluster wide outages caused by unexpected system wide faults on one of our spark workers if that worker is failing systematically. By systematically, I mean that every executor launched by that worker will definitely fail due to some reason out of Spark's control like the log directory disk being completely out of space, or a permissions error for a file that's always read during executor launch. We screw up all the time on our team and cause stuff like this to happen, but because of the way the standalone scheduler allocates resources, our cluster doesn't recover gracefully from these failures. When there are more tasks to do than executors, I am pretty sure the way the scheduler works is that it just waits for more resource offers and then allocates tasks from the queue to those resources. If an executor dies immediately after starting, the worker monitor process will notice that it's dead. The master will allocate that worker's now free cores/memory to a currently running application that is below its spark.cores.max, which in our case I've observed as usually the app that just had the executor die. A new executor gets spawned on the same worker that the last one just died on, gets allocated that one task that failed, and then the whole process fails again for the same systematic reason, and lather rinse repeat. This happens 10 times or whatever the max task failure count is, and then the whole app is deemed a failure by the driver and shut down completely. For us, we usually run roughly as many cores as we have hadoop nodes. We also usually have many more input splits than we have tasks, which means the locality of the first few tasks which I believe determines where our executors run is well spread out over the cluster, and often covers 90-100% of nodes. This means the likelihood of any application getting an executor scheduled any broken node is quite high. So, in my experience, after an application goes through the above mentioned process and dies, the next application to start or not be at it's requested max capacity gets an executor scheduled on the broken node, and is promptly taken down as well. This happens over and over as well, to the point where none of our spark jobs are making any progress because of one tiny permissions mistake on one node. Now, I totally understand this is usually an error between keyboard and screen kind of situation where it is the responsibility of the people deploying spark to ensure it is deployed correctly. The systematic issues we've encountered are almost always of this nature: permissions errors, disk full errors, one node not getting a new spark jar from a configuration error, configurations being out of sync, etc. That said, disks are going to fail or half fail, fill up, node rot is going to ruin configurations, etc etc etc, and as hadoop clusters scale in size this becomes more and more likely, so I think its reasonable to ask that Spark be resilient to this kind of failure and keep on truckin'. I think a good simple fix would be to have applications, or the master, blacklist workers (not executors) at a failure count lower than the task failure count. This would also serve as a belt and suspenders fix for SPARK-4498. If the scheduler stopped trying to schedule on nodes that fail a lot, we could still make progress. These blacklist events are really important and I think would need to be well logged and surfaced in the UI, but I'd rather log and carry on than fail hard. I think the tradeoff here is that you risk blacklisting ever worker as well if there is something systematically wrong with communication or whatever else I can't imagine. Please let me know if I've misunderstood how the scheduler works or you need more information or anything like that and I'll be happy to provide. was: We've experienced several cluster wide outages caused by unexpected system wide faults on one of our spark workers if that worker is failing systematically. By systematically, I mean that every executor launched by that worker will definitely fail due to some reason out of Spark's control like the log directory disk being completely out of space, or a permissions error for a file that's always read during executor launch. We screw up all the time on our team and cause stuff like this to happen, but because of the way the standalone scheduler allocates resources, our cluster doesn't recover gracefully from these failures. Correct me if I am wrong but when there are more tasks to do than executors, I am pretty sure the way the scheduler works is that it just waits for more resource offers and then allocates tasks from the queue to those resources. If an executor dies
[jira] [Updated] (SPARK-874) Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are finished
[ https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-874: -- Fix Version/s: (was: 1.2.0) Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are finished --- Key: SPARK-874 URL: https://issues.apache.org/jira/browse/SPARK-874 Project: Spark Issue Type: New Feature Components: Deploy Reporter: Patrick Wendell Assignee: Archit Thakur Priority: Minor Labels: starter When running benchmarking jobs, sometimes the cluster takes a long time to shut down. We should add a feature where it will ssh into all the workers every few seconds and check that the processes are dead, and won't return until they are all dead. This would help a lot with automating benchmarking scripts. There is some equivalent logic here written in python, we just need to add it to the shell script: https://github.com/pwendell/spark-perf/blob/master/bin/run#L117 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4732) All application progress on the standalone scheduler can be halted by one systematically faulty node
[ https://issues.apache.org/jira/browse/SPARK-4732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harry Brundage updated SPARK-4732: -- Description: We've experienced several cluster wide outages caused by unexpected system wide faults on one of our spark workers if that worker is failing systematically. By systematically, I mean that every executor launched by that worker will definitely fail due to some reason out of Spark's control like the log directory disk being completely out of space, or a permissions error for a file that's always read during executor launch. We screw up all the time on our team and cause stuff like this to happen, but because of the way the standalone scheduler allocates resources, our cluster doesn't recover gracefully from these failures. When there are more tasks to do than executors, I am pretty sure the way the scheduler works is that it just waits for more resource offers and then allocates tasks from the queue to those resources. If an executor dies immediately after starting, the worker monitor process will notice that it's dead. The master will allocate that worker's now free cores/memory to a currently running application that is below its spark.cores.max, which in our case I've observed as usually the app that just had the executor die. A new executor gets spawned on the same worker that the last one just died on, gets allocated that one task that failed, and then the whole process fails again for the same systematic reason, and lather rinse repeat. This happens 10 times or whatever the max task failure count is, and then the whole app is deemed a failure by the driver and shut down completely. This happens to us for all applications in the cluster as well. We usually run roughly as many cores as we have hadoop nodes. We also usually have many more input splits than we have tasks, which means the locality of the first few tasks which I believe determines where our executors run is well spread out over the cluster, and often covers 90-100% of nodes. This means the likelihood of any application getting an executor scheduled any broken node is quite high. After an old application goes through the above mentioned process and dies, the next application to start or not be at it's requested max capacity gets an executor scheduled on the broken node, and is promptly taken down as well. This happens over and over as well, to the point where none of our spark jobs are making any progress because of one tiny permissions mistake on one node. Now, I totally understand this is usually an error between keyboard and screen kind of situation where it is the responsibility of the people deploying spark to ensure it is deployed correctly. The systematic issues we've encountered are almost always of this nature: permissions errors, disk full errors, one node not getting a new spark jar from a configuration error, configurations being out of sync, etc. That said, disks are going to fail or half fail, fill up, node rot is going to ruin configurations, etc etc etc, and as hadoop clusters scale in size this becomes more and more likely, so I think its reasonable to ask that Spark be resilient to this kind of failure and keep on truckin'. I think a good simple fix would be to have applications, or the master, blacklist workers (not executors) at a failure count lower than the task failure count. This would also serve as a belt and suspenders fix for SPARK-4498. If the scheduler stopped trying to schedule on nodes that fail a lot, we could still make progress. These blacklist events are really important and I think would need to be well logged and surfaced in the UI, but I'd rather log and carry on than fail hard. I think the tradeoff here is that you risk blacklisting ever worker as well if there is something systematically wrong with communication or whatever else I can't imagine. Please let me know if I've misunderstood how the scheduler works or you need more information or anything like that and I'll be happy to provide. was: We've experienced several cluster wide outages caused by unexpected system wide faults on one of our spark workers if that worker is failing systematically. By systematically, I mean that every executor launched by that worker will definitely fail due to some reason out of Spark's control like the log directory disk being completely out of space, or a permissions error for a file that's always read during executor launch. We screw up all the time on our team and cause stuff like this to happen, but because of the way the standalone scheduler allocates resources, our cluster doesn't recover gracefully from these failures. When there are more tasks to do than executors, I am pretty sure the way the scheduler works is that it just waits for more resource offers and then allocates tasks from the queue to those resources. If an executor
[jira] [Resolved] (SPARK-4085) Job will fail if a shuffle file that's read locally gets deleted
[ https://issues.apache.org/jira/browse/SPARK-4085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4085. Resolution: Fixed Fix Version/s: 1.2.0 Job will fail if a shuffle file that's read locally gets deleted Key: SPARK-4085 URL: https://issues.apache.org/jira/browse/SPARK-4085 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kay Ousterhout Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 This commit: https://github.com/apache/spark/commit/665e71d14debb8a7fc1547c614867a8c3b1f806a changed the behavior of fetching local shuffle blocks such that if a shuffle block is not found locally, the shuffle block is no longer marked as failed, and a fetch failed exception is not thrown (this is because the catch block here won't ever be invoked: https://github.com/apache/spark/commit/665e71d14debb8a7fc1547c614867a8c3b1f806a#diff-e6e1631fa01e17bf851f49d30d028823R202 because the exception called from getLocalFromDisk() doesn't get thrown until next() gets called on the iterator). [~rxin] [~matei] it looks like you guys changed the test for this to catch the new exception that gets thrown (https://github.com/apache/spark/commit/665e71d14debb8a7fc1547c614867a8c3b1f806a#diff-9c2e1918319de967045d04caf813a7d1R93). Was that intentional? Because the new exception is a SparkException and not a FetchFailedException, jobs with missing local shuffle data will now fail, rather than having the map stage get retried. This problem is reproducible with this test case: {code} test(hash shuffle manager recovers when local shuffle files get deleted) { val conf = new SparkConf(false) conf.set(spark.shuffle.manager, hash) sc = new SparkContext(local, test, conf) val rdd = sc.parallelize(1 to 10, 2).map((_, 1)).reduceByKey(_+_) rdd.count() // Delete one of the local shuffle blocks. sc.env.blockManager.diskBlockManager.getFile(new ShuffleBlockId(0, 0, 0)).delete() rdd.count() } {code} which will fail on the second rdd.count(). This is a regression from 1.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4711) MLlib optimization: docs should suggest how to choose optimizer
[ https://issues.apache.org/jira/browse/SPARK-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4711. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3569 [https://github.com/apache/spark/pull/3569] MLlib optimization: docs should suggest how to choose optimizer --- Key: SPARK-4711 URL: https://issues.apache.org/jira/browse/SPARK-4711 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Trivial Fix For: 1.2.0 I have heard requests for the docs to include advice about choosing an optimization method. The programming guide could include a brief statement about this (so the user does not have to read the whole optimization section). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4711) MLlib optimization: docs should suggest how to choose optimizer
[ https://issues.apache.org/jira/browse/SPARK-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4711: - Assignee: Joseph K. Bradley MLlib optimization: docs should suggest how to choose optimizer --- Key: SPARK-4711 URL: https://issues.apache.org/jira/browse/SPARK-4711 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Trivial Fix For: 1.2.0 I have heard requests for the docs to include advice about choosing an optimization method. The programming guide could include a brief statement about this (so the user does not have to read the whole optimization section). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org