[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227371#comment-14227371 ] Yu Ishikawa commented on SPARK-2429: Hi [~rnowling], Thank you for replying. {quote} I'm having trouble finding the function to cut a dendrogram – I see the tests but not the implementation. {quote} I'm very sorry. [Here|https://github.com/yu-iskw/spark/blob/hierarchical/mllib%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fmllib%2Fclustering%2FHierarchicalClustering.scala#L390] is the implementation. I think the function is specialized to SciPy dendrogram much. {quote} If clusters at the same levels in the hierarchy do not overlap, you should be able to choose the closest cluster at each level until you find a leaf. I'm assuming that the children of a given cluster are contained within that cluster (spacially) – can you show this or find a reference for this? If so, then assignment should be faster for a larger number of clusters as Jun was saying above. {quote} Exactly. I agree with you. I implemented the function for assignment as a recursive function in O(log N) time. [https://github.com/yu-iskw/spark/blob/hierarchical/mllib%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fmllib%2Fclustering%2FHierarchicalClustering.scala#L482] Although I checked the performance for assignment of my implementation and that of KMenas, the elapsed time this assignment implementation is slower than that of KMeans. The result is [here|https://issues.apache.org/jira/browse/SPARK-2429?focusedCommentId=14190166page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14190166]. The elapsed time for assingment (predct) is not long from the beginning. For example, the time of assingnment of KMenas in O(N) is 0.011\[sec\], and that of my implementation in O(log N) is 0.307 \[sec\]. I'm very sorry if I misunderstand your question. thanks, Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4628) Put all external projects behind a build flag
[ https://issues.apache.org/jira/browse/SPARK-4628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227372#comment-14227372 ] Sean Owen commented on SPARK-4628: -- Here are all the non-Central repos currently used: {code} urlhttps://repository.apache.org/content/repositories/releases/url urlhttps://repository.jboss.org/nexus/content/repositories/releases/url urlhttps://repo.eclipse.org/content/repositories/paho-releases/url urlhttps://repository.cloudera.com/artifactory/cloudera-repos/url urlhttp://repository.mapr.com/maven/url urlhttps://repo.spring.io/libs-release/url urlhttps://oss.sonatype.org/content/repositories/orgspark-project-1085/url urlhttps://oss.sonatype.org/content/repositories/orgspark-project-1089//url urlhttps://repository.apache.org/content/repositories/orgapachespark-1038//url {code} Last 3 are temporary. The vendor repos, well, separate question. Might be interesting to do the same exercise with anything else in the secondary repos, like see what breaks from a clean local repository if these don't exist. Put all external projects behind a build flag - Key: SPARK-4628 URL: https://issues.apache.org/jira/browse/SPARK-4628 Project: Spark Issue Type: Improvement Reporter: Patrick Wendell Priority: Blocker This is something we talked about doing for convenience, but I'm escalating this based on realizing today that some of our external projects depend on code that is not in maven central. I.e. if one of these dependencies is taken down (as happened recently with mqtt), all Spark builds will fail. The proposal here is simple, have a profile -Pexternal-projects that enables these. This can follow the exact pattern of -Pkinesis-asl which was disabled by default due to a license issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4315) PySpark pickling of pyspark.sql.Row objects is extremely inefficient
[ https://issues.apache.org/jira/browse/SPARK-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227376#comment-14227376 ] Lv, Qi commented on SPARK-4315: --- I'm interested in this issue, but I can't reproduce your problem. I constructed a very simple workload according to your description, like this: from pyspark.sql import SQLContext from pyspark import SparkContext sc = SparkContext(appName=test) sqlContext = SQLContext(sc) lines = sc.parallelize(range(100), 12) people = lines.map(lambda x:{name: str(x % 1000), age:x}) schemaPeople = sqlContext.inferSchema(people) schemaPeople.registerAsTable(people) grouped = schemaPeople.groupBy(lambda x:x.name) grouped.collect() And tested over spark-1.1(2f9b2bd) and spark-master(0fe54cff). It finished in 3-4 seconds on both spark versions. After disabled _restore_object's cache ( adding return _create_cls(dataType)(obj) ), it becomes obviously slow(waited minutes, no need to wait more). Could you please give me more detailed information? PySpark pickling of pyspark.sql.Row objects is extremely inefficient Key: SPARK-4315 URL: https://issues.apache.org/jira/browse/SPARK-4315 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: Ubuntu, Python 2.7, Spark 1.1.0 Reporter: Adam Davison Working with an RDD of pyspark.sql.Row objects, created by reading a file with SQLContext in a local PySpark context. Operations on the RDD, such as: data.groupBy(lambda x: x.field_name) are extremely slow (more than 10x slower than an equivalent Scala/Spark implementation). Obviously I expected it to be somewhat slower, but I did a bit of digging given the difference was so huge. Luckily it's fairly easy to add profiling to the Python workers. I see that the vast majority of time is spent in: spark-1.1.0-bin-cdh4/python/pyspark/sql.py:757(_restore_object) It seems that this line attempts to accelerate pickling of Rows with the use of a cache. Some debugging reveals that this cache becomes quite big (100s of entries). Disabling the cache by adding: return _create_cls(dataType)(obj) as the first line of _restore_object made my query run 5x faster. Implying that the caching is not providing the desired acceleration... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client
[ https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227380#comment-14227380 ] Apache Spark commented on SPARK-4632: - User 'prabeesh' has created a pull request for this issue: https://github.com/apache/spark/pull/3495 Upgrade MQTT dependency to use latest mqtt-client - Key: SPARK-4632 URL: https://issues.apache.org/jira/browse/SPARK-4632 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.1 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is breaking Spark build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4170) Closure problems when running Scala app that extends App
[ https://issues.apache.org/jira/browse/SPARK-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227385#comment-14227385 ] Sean Owen commented on SPARK-4170: -- Thanks [~boyork], I will propose a PR that resolves this with a bit of documentation somewhere. Closure problems when running Scala app that extends App -- Key: SPARK-4170 URL: https://issues.apache.org/jira/browse/SPARK-4170 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Sean Owen Priority: Minor Michael Albert noted this problem on the mailing list (http://apache-spark-user-list.1001560.n3.nabble.com/BUG-when-running-as-quot-extends-App-quot-closures-don-t-capture-variables-td17675.html): {code} object DemoBug extends App { val conf = new SparkConf() val sc = new SparkContext(conf) val rdd = sc.parallelize(List(A,B,C,D)) val str1 = A val rslt1 = rdd.filter(x = { x != A }).count val rslt2 = rdd.filter(x = { str1 != null x != A }).count println(DemoBug: rslt1 = + rslt1 + rslt2 = + rslt2) } {code} This produces the output: {code} DemoBug: rslt1 = 3 rslt2 = 0 {code} If instead there is a proper main(), it works as expected. I also this week noticed that in a program which extends App, some values were inexplicably null in a closure. When changing to use main(), it was fine. I assume there is a problem with variables not being added to the closure when main() doesn't appear in the standard way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4636) Cluster By Distribute By output different with Hive
Cheng Hao created SPARK-4636: Summary: Cluster By Distribute By output different with Hive Key: SPARK-4636 URL: https://issues.apache.org/jira/browse/SPARK-4636 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao This is a very interesting bug. Semantically, Cluster By Distribute By will not cause a global ordering, as described in Hive wiki: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy However, the partition keys are sorted in MapReduce after shuffle, so from the user point of view, the partition key itself is global ordered, and it may looks like: http://stackoverflow.com/questions/13715044/hive-cluster-by-vs-order-by-vs-sort-by -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4631) Add real unit test for MQTT
[ https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227402#comment-14227402 ] Prabeesh K edited comment on SPARK-4631 at 11/27/14 8:49 AM: - MQTT is known as protocol of IoT(Internet of Things). It is widely used in IoT area. was (Author: prabeeshk): MQTT is know as protocol of IoT(Internet of Things). It is widely used in IoT area. Add real unit test for MQTT Key: SPARK-4631 URL: https://issues.apache.org/jira/browse/SPARK-4631 Project: Spark Issue Type: Test Components: Streaming Reporter: Tathagata Das Priority: Critical A real unit test that actually transfers data to ensure that the MQTTUtil is functional -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4636) Cluster By Distribute By output different with Hive
[ https://issues.apache.org/jira/browse/SPARK-4636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227401#comment-14227401 ] Apache Spark commented on SPARK-4636: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/3496 Cluster By Distribute By output different with Hive - Key: SPARK-4636 URL: https://issues.apache.org/jira/browse/SPARK-4636 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao This is a very interesting bug. Semantically, Cluster By Distribute By will not cause a global ordering, as described in Hive wiki: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy However, the partition keys are sorted in MapReduce after shuffle, so from the user point of view, the partition key itself is global ordered, and it may looks like: http://stackoverflow.com/questions/13715044/hive-cluster-by-vs-order-by-vs-sort-by -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4631) Add real unit test for MQTT
[ https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227402#comment-14227402 ] Prabeesh K commented on SPARK-4631: --- MQTT is know as protocol of IoT(Internet of Things). It is widely used in IoT area. Add real unit test for MQTT Key: SPARK-4631 URL: https://issues.apache.org/jira/browse/SPARK-4631 Project: Spark Issue Type: Test Components: Streaming Reporter: Tathagata Das Priority: Critical A real unit test that actually transfers data to ensure that the MQTTUtil is functional -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4637) spark-1.1.0 does not compile any more
Olaf Flebbe created SPARK-4637: -- Summary: spark-1.1.0 does not compile any more Key: SPARK-4637 URL: https://issues.apache.org/jira/browse/SPARK-4637 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0, 0.9.1 Reporter: Olaf Flebbe Priority: Critical Spark does not compile anymore since the dependency mqtt-client-0.4.0 has been removed from the eclipse repository. See yourself: https://repo.eclipse.org/content/repositories/paho-releases/org/eclipse/paho/mqtt-client/0.4.0/ and {code} spark-1.1.0$ grep -C2 mqtt-client ./external/mqtt/pom.xml dependency groupIdorg.eclipse.paho/groupId artifactIdmqtt-client/artifactId version0.4.0/version /dependency {code} I did not find a different repository providing it. Since I accidentially removed my maven cache I connot compile spark any more. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4315) PySpark pickling of pyspark.sql.Row objects is extremely inefficient
[ https://issues.apache.org/jira/browse/SPARK-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227427#comment-14227427 ] Adam Davison commented on SPARK-4315: - Sure, will try to say what I can. Unfortunately I don't think I can easily give you a sample of the data. If we can't figure it out I can try to produce a fake sample that still exhibits the problem. But first let me try to come up with a few possibly salient differences: 1. My data is very wide, about 80 columns. 2. This size of the resulting groups in the groupBy is very ragged, whereas yours here is very even. Probably exponentially distributed in my case or more extreme. I wonder if this is generating many different Row types somehow. 3. My Row objects are constructed via the parquet functions of SQLContext In my debugging I noticed that the cache size was reaching hundreds of entries or more from printing the number of items in the dict. I'll also include part of the code I was using: conf = pyspark.SparkConf().setMaster(local[24]).setAppName(test) sc = pyspark.SparkContext(conf = conf) sqlc = pyspark.sql.SQLContext(sc) data = sqlc.parquetFile(/home/adam/parquet_test) def getnow(): return int(round(time.time() * 1000)) def applyfunc2(data): some work which returns a list object print CHECKPOINT 1: %i % (getnow()) data.cache() junk = data.map(lambda x: 0).collect() # This part introduced to separate the timing of disk load and computation print CHECKPOINT 2: %i % (getnow()) grouped = data.groupBy(lambda x: x.unique_user_identifier) print CHECKPOINT 3: %i % (getnow()) calced = grouped.flatMap(applyfunc2) print CHECKPOINT 4: %i % (getnow()) counts = calced.collect() print CHECKPOINT 5: %i % (getnow()) PySpark pickling of pyspark.sql.Row objects is extremely inefficient Key: SPARK-4315 URL: https://issues.apache.org/jira/browse/SPARK-4315 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: Ubuntu, Python 2.7, Spark 1.1.0 Reporter: Adam Davison Working with an RDD of pyspark.sql.Row objects, created by reading a file with SQLContext in a local PySpark context. Operations on the RDD, such as: data.groupBy(lambda x: x.field_name) are extremely slow (more than 10x slower than an equivalent Scala/Spark implementation). Obviously I expected it to be somewhat slower, but I did a bit of digging given the difference was so huge. Luckily it's fairly easy to add profiling to the Python workers. I see that the vast majority of time is spent in: spark-1.1.0-bin-cdh4/python/pyspark/sql.py:757(_restore_object) It seems that this line attempts to accelerate pickling of Rows with the use of a cache. Some debugging reveals that this cache becomes quite big (100s of entries). Disabling the cache by adding: return _create_cls(dataType)(obj) as the first line of _restore_object made my query run 5x faster. Implying that the caching is not providing the desired acceleration... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4637) spark-1.1.0 does not compile any more
[ https://issues.apache.org/jira/browse/SPARK-4637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4637. -- Resolution: Duplicate spark-1.1.0 does not compile any more - Key: SPARK-4637 URL: https://issues.apache.org/jira/browse/SPARK-4637 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.1, 1.1.0 Reporter: Olaf Flebbe Priority: Critical Spark does not compile anymore since the dependency mqtt-client-0.4.0 has been removed from the eclipse repository. See yourself: https://repo.eclipse.org/content/repositories/paho-releases/org/eclipse/paho/mqtt-client/0.4.0/ and {code} spark-1.1.0$ grep -C2 mqtt-client ./external/mqtt/pom.xml dependency groupIdorg.eclipse.paho/groupId artifactIdmqtt-client/artifactId version0.4.0/version /dependency {code} I did not find a different repository providing it. Since I accidentially removed my maven cache I connot compile spark any more. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client
[ https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227432#comment-14227432 ] Olaf Flebbe commented on SPARK-4632: The patch uses a SNAPSHOT dependency which is a no-go for Release Builds Upgrade MQTT dependency to use latest mqtt-client - Key: SPARK-4632 URL: https://issues.apache.org/jira/browse/SPARK-4632 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.1 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is breaking Spark build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4170) Closure problems when running Scala app that extends App
[ https://issues.apache.org/jira/browse/SPARK-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227431#comment-14227431 ] Apache Spark commented on SPARK-4170: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/3497 Closure problems when running Scala app that extends App -- Key: SPARK-4170 URL: https://issues.apache.org/jira/browse/SPARK-4170 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Sean Owen Priority: Minor Michael Albert noted this problem on the mailing list (http://apache-spark-user-list.1001560.n3.nabble.com/BUG-when-running-as-quot-extends-App-quot-closures-don-t-capture-variables-td17675.html): {code} object DemoBug extends App { val conf = new SparkConf() val sc = new SparkContext(conf) val rdd = sc.parallelize(List(A,B,C,D)) val str1 = A val rslt1 = rdd.filter(x = { x != A }).count val rslt2 = rdd.filter(x = { str1 != null x != A }).count println(DemoBug: rslt1 = + rslt1 + rslt2 = + rslt2) } {code} This produces the output: {code} DemoBug: rslt1 = 3 rslt2 = 0 {code} If instead there is a proper main(), it works as expected. I also this week noticed that in a program which extends App, some values were inexplicably null in a closure. When changing to use main(), it was fine. I assume there is a problem with variables not being added to the closure when main() doesn't appear in the standard way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4631) Add real unit test for MQTT
[ https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227433#comment-14227433 ] Jinesh commented on SPARK-4631: --- MQTT is the one of the most popular queuing protcol in IoT.I am from Amrita University. In our IoT project, We are using it as a connector between sensor plaform and processing environment Add real unit test for MQTT Key: SPARK-4631 URL: https://issues.apache.org/jira/browse/SPARK-4631 Project: Spark Issue Type: Test Components: Streaming Reporter: Tathagata Das Priority: Critical A real unit test that actually transfers data to ensure that the MQTTUtil is functional -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client
[ https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227432#comment-14227432 ] Olaf Flebbe edited comment on SPARK-4632 at 11/27/14 9:34 AM: -- At lease one of patches linked uses SNAPSHOT dependencies which is a no-go for Release Builds was (Author: oflebbe): The patch uses a SNAPSHOT dependency which is a no-go for Release Builds Upgrade MQTT dependency to use latest mqtt-client - Key: SPARK-4632 URL: https://issues.apache.org/jira/browse/SPARK-4632 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.1 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is breaking Spark build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4637) spark-1.1.0 does not compile any more
[ https://issues.apache.org/jira/browse/SPARK-4637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227441#comment-14227441 ] Benjamin Cabé commented on SPARK-4637: -- bq. I did not find a different repository providing it. Since I accidentially removed my maven cache I connot compile spark any more. FWIW I downloaded 1.1.0 yesterday, and it built just fine, apparently getting mqtt 0.4.0 from spring.io repo. See http://repo.spring.io/webapp/search/artifact/?2q=mqtt and http://jcenter.bintray.com/org/eclipse/paho/mqtt-client/0.4.0/ spark-1.1.0 does not compile any more - Key: SPARK-4637 URL: https://issues.apache.org/jira/browse/SPARK-4637 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.1, 1.1.0 Reporter: Olaf Flebbe Priority: Critical Spark does not compile anymore since the dependency mqtt-client-0.4.0 has been removed from the eclipse repository. See yourself: https://repo.eclipse.org/content/repositories/paho-releases/org/eclipse/paho/mqtt-client/0.4.0/ and {code} spark-1.1.0$ grep -C2 mqtt-client ./external/mqtt/pom.xml dependency groupIdorg.eclipse.paho/groupId artifactIdmqtt-client/artifactId version0.4.0/version /dependency {code} I did not find a different repository providing it. Since I accidentially removed my maven cache I connot compile spark any more. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4631) Add real unit test for MQTT
[ https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227472#comment-14227472 ] Prabeesh K commented on SPARK-4631: --- They fixed the issue now we have data in the [repo|https://repo.eclipse.org/content/repositories/paho-releases/org/eclipse/paho/mqtt-client/0.4.0/] Add real unit test for MQTT Key: SPARK-4631 URL: https://issues.apache.org/jira/browse/SPARK-4631 Project: Spark Issue Type: Test Components: Streaming Reporter: Tathagata Das Priority: Critical A real unit test that actually transfers data to ensure that the MQTTUtil is functional -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client
[ https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227473#comment-14227473 ] Prabeesh K commented on SPARK-4632: --- They fixed the issue now we have data in the [repo|https://repo.eclipse.org/content/repositories/paho-releases/org/eclipse/paho/mqtt-client/0.4.0/]. Now the older one works perfectly Upgrade MQTT dependency to use latest mqtt-client - Key: SPARK-4632 URL: https://issues.apache.org/jira/browse/SPARK-4632 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.1 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is breaking Spark build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4631) Add real unit test for MQTT
[ https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227506#comment-14227506 ] Prabeesh K commented on SPARK-4631: --- [~tdas] refer [this links|http://dev.eclipse.org/mhonarc/lists/paho-dev/msg02291.html] for the discusion going on the paho dev form. Add real unit test for MQTT Key: SPARK-4631 URL: https://issues.apache.org/jira/browse/SPARK-4631 Project: Spark Issue Type: Test Components: Streaming Reporter: Tathagata Das Priority: Critical A real unit test that actually transfers data to ensure that the MQTTUtil is functional -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4638) Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries
madankumar s created SPARK-4638: --- Summary: Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries Key: SPARK-4638 URL: https://issues.apache.org/jira/browse/SPARK-4638 Project: Spark Issue Type: New Feature Components: MLlib Reporter: madankumar s SPARK MLlib Classification Module Add Kernel functionalities to SVM Classifier to find non linear patterns -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4638) Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries
[ https://issues.apache.org/jira/browse/SPARK-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] madankumar s updated SPARK-4638: Description: SPARK MLlib Classification Module: Add Kernel functionalities to SVM Classifier to find non linear patterns was: SPARK MLlib Classification Module Add Kernel functionalities to SVM Classifier to find non linear patterns Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries --- Key: SPARK-4638 URL: https://issues.apache.org/jira/browse/SPARK-4638 Project: Spark Issue Type: New Feature Components: MLlib Reporter: madankumar s Labels: Gaussian, Kernels, SVM SPARK MLlib Classification Module: Add Kernel functionalities to SVM Classifier to find non linear patterns -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4639) Pass maxIterations in as a parameter in Analyzer
Jacky Li created SPARK-4639: --- Summary: Pass maxIterations in as a parameter in Analyzer Key: SPARK-4639 URL: https://issues.apache.org/jira/browse/SPARK-4639 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Jacky Li Priority: Minor Fix For: 1.3.0 fix a TODO in Analyzer: // TODO: pass this in as a parameter val fixedPoint = FixedPoint(100) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4639) Pass maxIterations in as a parameter in Analyzer
[ https://issues.apache.org/jira/browse/SPARK-4639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227744#comment-14227744 ] Apache Spark commented on SPARK-4639: - User 'jackylk' has created a pull request for this issue: https://github.com/apache/spark/pull/3499 Pass maxIterations in as a parameter in Analyzer Key: SPARK-4639 URL: https://issues.apache.org/jira/browse/SPARK-4639 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Jacky Li Priority: Minor Fix For: 1.3.0 fix a TODO in Analyzer: // TODO: pass this in as a parameter val fixedPoint = FixedPoint(100) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4640) FixedRangePartitioner for partitioning items with a known range
Kevin Mader created SPARK-4640: -- Summary: FixedRangePartitioner for partitioning items with a known range Key: SPARK-4640 URL: https://issues.apache.org/jira/browse/SPARK-4640 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Kevin Mader For the large datasets I work with, it is common to have light-weight keys and very heavy values (integers and large double arrays for example). The key values are however known and unchanging. It would be nice if Spark had a built in partitioner which could take advantage of this. A FixedRangePartitioner[T](keys: Seq[T], partitions: Int) would be ideal. Furthermore this partitioner type could be extended to a PartitionerWithKnownKeys that had a getAllKeys function allowing for a list of keys to be obtained without querying through the entire RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4640) FixedRangePartitioner for partitioning items with a known range
[ https://issues.apache.org/jira/browse/SPARK-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227754#comment-14227754 ] Kevin Mader commented on SPARK-4640: I have code for both, that I could merge in, if there is interest. FixedRangePartitioner for partitioning items with a known range --- Key: SPARK-4640 URL: https://issues.apache.org/jira/browse/SPARK-4640 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Kevin Mader For the large datasets I work with, it is common to have light-weight keys and very heavy values (integers and large double arrays for example). The key values are however known and unchanging. It would be nice if Spark had a built in partitioner which could take advantage of this. A FixedRangePartitioner[T](keys: Seq[T], partitions: Int) would be ideal. Furthermore this partitioner type could be extended to a PartitionerWithKnownKeys that had a getAllKeys function allowing for a list of keys to be obtained without querying through the entire RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4170) Closure problems when running Scala app that extends App
[ https://issues.apache.org/jira/browse/SPARK-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-4170. --- Resolution: Fixed Assignee: Sean Owen Closure problems when running Scala app that extends App -- Key: SPARK-4170 URL: https://issues.apache.org/jira/browse/SPARK-4170 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Michael Albert noted this problem on the mailing list (http://apache-spark-user-list.1001560.n3.nabble.com/BUG-when-running-as-quot-extends-App-quot-closures-don-t-capture-variables-td17675.html): {code} object DemoBug extends App { val conf = new SparkConf() val sc = new SparkContext(conf) val rdd = sc.parallelize(List(A,B,C,D)) val str1 = A val rslt1 = rdd.filter(x = { x != A }).count val rslt2 = rdd.filter(x = { str1 != null x != A }).count println(DemoBug: rslt1 = + rslt1 + rslt2 = + rslt2) } {code} This produces the output: {code} DemoBug: rslt1 = 3 rslt2 = 0 {code} If instead there is a proper main(), it works as expected. I also this week noticed that in a program which extends App, some values were inexplicably null in a closure. When changing to use main(), it was fine. I assume there is a problem with variables not being added to the closure when main() doesn't appear in the standard way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks
[ https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227842#comment-14227842 ] Patrick Wendell commented on SPARK-4598: Having sorting with pagination seems very difficult to do correctly since we rely on javascript for sorting in the frontent. It would be helpful to understand the exact memory requirements of serving hundreds of thousands of tasks. Where is the memory from? Can we just optimize the use of memory? We need to store all of those tasks anyways in int he driver. Paginate stage page to avoid OOM with 100,000 tasks - Key: SPARK-4598 URL: https://issues.apache.org/jira/browse/SPARK-4598 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Priority: Critical In HistoryServer stage page, clicking the task href in Description, it occurs the GC error. The detail error message is: 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-352] | Error for /history/application_1416206401491_0010/stages/stage/ | org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590) java.lang.OutOfMemoryError: GC overhead limit exceeded 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-364] | handle failed | org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697) java.lang.OutOfMemoryError: GC overhead limit exceeded -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4634) Enable metrics for each application to be gathered in one node
[ https://issues.apache.org/jira/browse/SPARK-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227848#comment-14227848 ] Masayoshi TSUZUKI commented on SPARK-4634: -- Sorry, GraphiteSink has already got the option prefix and it works fine. Enable metrics for each application to be gathered in one node -- Key: SPARK-4634 URL: https://issues.apache.org/jira/browse/SPARK-4634 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Masayoshi TSUZUKI Metrics output is now like this: {noformat} - app_1.driver.jvm.somevalue - app_1.driver.jvm.somevalue - ... - app_2.driver.jvm.somevalue - app_2.driver.jvm.somevalue - ... {noformat} In current spark, application names come to top level, but we should be able to gather the application names under some top level node. For example, think of using graphite. When we use graphite, the application names are listed as top level node. Graphite can also collect OS metrics, and OS metrics are able to be put in some one node. But the current Spark metrics are not. So, with the current Spark, the tree structure of metrics shown in graphite web UI is like this. {noformat} - os - os.node1.somevalue - os.node2.somevalue - ... - app_1 - app_1.driver.jvm.somevalue - app_1.driver.jvm.somevalue - ... - app_2 - ... - app_3 - ... {noformat} We should be able to add some top level name before the application name (top level name may be cluster name for instance). If we make the name configurable by *.conf, it might be also convenience in case that 2 different spark clusters sink metrics to the same graphite server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4634) Enable metrics for each application to be gathered in one node
[ https://issues.apache.org/jira/browse/SPARK-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masayoshi TSUZUKI closed SPARK-4634. Resolution: Not a Problem GraphiteSink has already got the option prefix and it works fine. Enable metrics for each application to be gathered in one node -- Key: SPARK-4634 URL: https://issues.apache.org/jira/browse/SPARK-4634 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Masayoshi TSUZUKI Metrics output is now like this: {noformat} - app_1.driver.jvm.somevalue - app_1.driver.jvm.somevalue - ... - app_2.driver.jvm.somevalue - app_2.driver.jvm.somevalue - ... {noformat} In current spark, application names come to top level, but we should be able to gather the application names under some top level node. For example, think of using graphite. When we use graphite, the application names are listed as top level node. Graphite can also collect OS metrics, and OS metrics are able to be put in some one node. But the current Spark metrics are not. So, with the current Spark, the tree structure of metrics shown in graphite web UI is like this. {noformat} - os - os.node1.somevalue - os.node2.somevalue - ... - app_1 - app_1.driver.jvm.somevalue - app_1.driver.jvm.somevalue - ... - app_2 - ... - app_3 - ... {noformat} We should be able to add some top level name before the application name (top level name may be cluster name for instance). If we make the name configurable by *.conf, it might be also convenience in case that 2 different spark clusters sink metrics to the same graphite server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-4626: --- Description: {code} 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] OneForOneStrategy - key not found: 0 java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} This came on the heels of a lot of lost executors with error messages like: 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated was: 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] OneForOneStrategy - key not found: 0 java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) This came on the heels of a lot of lost executors with error messages like: 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated NoSuchElementException in CoarseGrainedSchedulerBackend --- Key: SPARK-4626 URL: https://issues.apache.org/jira/browse/SPARK-4626 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Victor Tso Assignee: Victor Tso {code} 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] OneForOneStrategy - key not found: 0 java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} This came on the heels of a lot of lost executors with error messages like: 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] TaskSchedulerImpl -
[jira] [Resolved] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-4626. Resolution: Fixed Fix Version/s: 1.2.0 NoSuchElementException in CoarseGrainedSchedulerBackend --- Key: SPARK-4626 URL: https://issues.apache.org/jira/browse/SPARK-4626 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Victor Tso Assignee: Victor Tso Fix For: 1.2.0 {code} 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] OneForOneStrategy - key not found: 0 java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} This came on the heels of a lot of lost executors with error messages like: 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-4626: --- Description: {code} 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] OneForOneStrategy - key not found: 0 java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} This came on the heels of a lot of lost executors with error messages like: {code} 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated {code} was: {code} 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] OneForOneStrategy - key not found: 0 java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} This came on the heels of a lot of lost executors with error messages like: 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated NoSuchElementException in CoarseGrainedSchedulerBackend --- Key: SPARK-4626 URL: https://issues.apache.org/jira/browse/SPARK-4626 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Victor Tso Assignee: Victor Tso Fix For: 1.2.0 {code} 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] OneForOneStrategy - key not found: 0 java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} This came on the heels of a lot of lost executors with error messages like: {code} 26 Nov 2014 06:38:20,330 ERROR
[jira] [Created] (SPARK-4641) A FileNotFoundException happened in Hash Shuffle Manager
hzw created SPARK-4641: -- Summary: A FileNotFoundException happened in Hash Shuffle Manager Key: SPARK-4641 URL: https://issues.apache.org/jira/browse/SPARK-4641 Project: Spark Issue Type: Bug Components: Input/Output, Shuffle Environment: A WordCount Example with some special text input(normal words text) Reporter: hzw Using Hash Shuffle without consolidateFiles, it will throw such exception: java.io.IOException: Error in reading org.apache.spark.network.FileSegmentManagedBuffer .. (actual file length 0) Caused by: java.io.FileNotFoundException: (No such file or directory) And using Hash Shuffle with consolidateFiles, it will throw another exception: java.io.IOException: PARSING_ERROR(2) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4642) Documents about running-on-YARN needs update
Masayoshi TSUZUKI created SPARK-4642: Summary: Documents about running-on-YARN needs update Key: SPARK-4642 URL: https://issues.apache.org/jira/browse/SPARK-4642 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.1.0 Reporter: Masayoshi TSUZUKI Priority: Minor Documents about running-on-YARN needs update There are some parameters missing in the document about running-on-YARN page. We need to add the descriptions about the following parameters: - spark.yarn.report.interval - spark.yarn.queue - spark.yarn.user.classpath.first - spark.yarn.scheduler.reporterThread.maxFailures And the description about this default parameter is not strictly accurate: - spark.yarn.submit.file.replication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4642) Documents about running-on-YARN needs update
[ https://issues.apache.org/jira/browse/SPARK-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228017#comment-14228017 ] Apache Spark commented on SPARK-4642: - User 'tsudukim' has created a pull request for this issue: https://github.com/apache/spark/pull/3500 Documents about running-on-YARN needs update Key: SPARK-4642 URL: https://issues.apache.org/jira/browse/SPARK-4642 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.1.0 Reporter: Masayoshi TSUZUKI Priority: Minor Documents about running-on-YARN needs update There are some parameters missing in the document about running-on-YARN page. We need to add the descriptions about the following parameters: - spark.yarn.report.interval - spark.yarn.queue - spark.yarn.user.classpath.first - spark.yarn.scheduler.reporterThread.maxFailures And the description about this default parameter is not strictly accurate: - spark.yarn.submit.file.replication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4613) Make JdbcRDD easier to use from Java
[ https://issues.apache.org/jira/browse/SPARK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-4613. -- Resolution: Fixed Fix Version/s: 1.2.0 Make JdbcRDD easier to use from Java Key: SPARK-4613 URL: https://issues.apache.org/jira/browse/SPARK-4613 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Matei Zaharia Assignee: Cheng Lian Fix For: 1.2.0 We might eventually deprecate it, but for now it would be nice to expose a Java wrapper that allows users to create this using the java function interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4613) Make JdbcRDD easier to use from Java
[ https://issues.apache.org/jira/browse/SPARK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4613: - Issue Type: Improvement (was: Bug) Make JdbcRDD easier to use from Java Key: SPARK-4613 URL: https://issues.apache.org/jira/browse/SPARK-4613 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matei Zaharia Assignee: Cheng Lian Fix For: 1.2.0 We might eventually deprecate it, but for now it would be nice to expose a Java wrapper that allows users to create this using the java function interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks
[ https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228043#comment-14228043 ] meiyoula commented on SPARK-4598: - Yearh, optimize the use of memory maybe can resolve the problem once, but it's not an effective solution. Sorting is before pagination, so it has no problem. Using paginationi in HistoryServerSparkUI can lower the memory requirements, why don't do this? It will be helpful to the spark cluster capabilities and good to spark users. Paginate stage page to avoid OOM with 100,000 tasks - Key: SPARK-4598 URL: https://issues.apache.org/jira/browse/SPARK-4598 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Priority: Critical In HistoryServer stage page, clicking the task href in Description, it occurs the GC error. The detail error message is: 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-352] | Error for /history/application_1416206401491_0010/stages/stage/ | org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590) java.lang.OutOfMemoryError: GC overhead limit exceeded 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-364] | handle failed | org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697) java.lang.OutOfMemoryError: GC overhead limit exceeded -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4533) SchemaRDD Api error: Can only subtract another SchemaRDD
[ https://issues.apache.org/jira/browse/SPARK-4533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shawn Guo updated SPARK-4533: - Summary: SchemaRDD Api error: Can only subtract another SchemaRDD (was: Can only subtract another SchemaRDD) SchemaRDD Api error: Can only subtract another SchemaRDD Key: SPARK-4533 URL: https://issues.apache.org/jira/browse/SPARK-4533 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: JDK6/7 Reporter: Shawn Guo Priority: Minor There are two unexpected validations in below SchemaRDD APIs. subtract(self, other, numPartitions=None) Can only subtract another SchemaRDD intersection(self, other) Can only intersect with another SchemaRDD Can only subtract another SchemaRDD will be thrown when SchemaRDD subtract other types of RDD. Reproduce Steps: A = SchemaRDD B = SchemaRDD A_APX= A.keyBy(lambda line: None) B_APX=B.keyBy(lambda line: None) {color:red} CROSSED = A_APX.join(B_APX).map(lambda line: line[1]).filter(filter condition).map(lambda line: line[0])) {color} C=A.subtract(CROSSED) {color:red}#ERROR:Can only subtract another SchemaRDD{color} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4533) SchemaRDD API error: Can only subtract another SchemaRDD
[ https://issues.apache.org/jira/browse/SPARK-4533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shawn Guo updated SPARK-4533: - Summary: SchemaRDD API error: Can only subtract another SchemaRDD (was: SchemaRDD Api error: Can only subtract another SchemaRDD) SchemaRDD API error: Can only subtract another SchemaRDD Key: SPARK-4533 URL: https://issues.apache.org/jira/browse/SPARK-4533 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: JDK6/7 Reporter: Shawn Guo Priority: Minor There are two unexpected validations in below SchemaRDD APIs. subtract(self, other, numPartitions=None) Can only subtract another SchemaRDD intersection(self, other) Can only intersect with another SchemaRDD Can only subtract another SchemaRDD will be thrown when SchemaRDD subtract other types of RDD. Reproduce Steps: A = SchemaRDD B = SchemaRDD A_APX= A.keyBy(lambda line: None) B_APX=B.keyBy(lambda line: None) {color:red} CROSSED = A_APX.join(B_APX).map(lambda line: line[1]).filter(filter condition).map(lambda line: line[0])) {color} C=A.subtract(CROSSED) {color:red}#ERROR:Can only subtract another SchemaRDD{color} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228105#comment-14228105 ] Apache Spark commented on SPARK-4626: - User 'roxchkplusony' has created a pull request for this issue: https://github.com/apache/spark/pull/3503 NoSuchElementException in CoarseGrainedSchedulerBackend --- Key: SPARK-4626 URL: https://issues.apache.org/jira/browse/SPARK-4626 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Victor Tso Assignee: Victor Tso Fix For: 1.2.0 {code} 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] OneForOneStrategy - key not found: 0 java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} This came on the heels of a lot of lost executors with error messages like: {code} 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228103#comment-14228103 ] Apache Spark commented on SPARK-4626: - User 'roxchkplusony' has created a pull request for this issue: https://github.com/apache/spark/pull/3502 NoSuchElementException in CoarseGrainedSchedulerBackend --- Key: SPARK-4626 URL: https://issues.apache.org/jira/browse/SPARK-4626 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Victor Tso Assignee: Victor Tso Fix For: 1.2.0 {code} 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] OneForOneStrategy - key not found: 0 java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} This came on the heels of a lot of lost executors with error messages like: {code} 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4643) spark staging repository location outdated
[ https://issues.apache.org/jira/browse/SPARK-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228108#comment-14228108 ] Apache Spark commented on SPARK-4643: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/3504 spark staging repository location outdated -- Key: SPARK-4643 URL: https://issues.apache.org/jira/browse/SPARK-4643 Project: Spark Issue Type: Improvement Components: Build Reporter: Adrian Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4644) Implement skewed join
Shixiong Zhu created SPARK-4644: --- Summary: Implement skewed join Key: SPARK-4644 URL: https://issues.apache.org/jira/browse/SPARK-4644 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Skewed data is not rare. For example, a book recommendation site may have several books which are liked by most of the users. Running ALS on such skewed data will raise a OutOfMemory error, if some book has too many users which cannot be fit into memory. To solve it, we propose a skewed join implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4644) Implement skewed join
[ https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-4644: Attachment: Skewed Join Design Doc.pdf The design doc of skewed join Implement skewed join - Key: SPARK-4644 URL: https://issues.apache.org/jira/browse/SPARK-4644 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Attachments: Skewed Join Design Doc.pdf Skewed data is not rare. For example, a book recommendation site may have several books which are liked by most of the users. Running ALS on such skewed data will raise a OutOfMemory error, if some book has too many users which cannot be fit into memory. To solve it, we propose a skewed join implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4644) Implement skewed join
[ https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228113#comment-14228113 ] Apache Spark commented on SPARK-4644: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/3505 Implement skewed join - Key: SPARK-4644 URL: https://issues.apache.org/jira/browse/SPARK-4644 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Attachments: Skewed Join Design Doc.pdf Skewed data is not rare. For example, a book recommendation site may have several books which are liked by most of the users. Running ALS on such skewed data will raise a OutOfMemory error, if some book has too many users which cannot be fit into memory. To solve it, we propose a skewed join implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4641) A FileNotFoundException happened in Hash Shuffle Manager
[ https://issues.apache.org/jira/browse/SPARK-4641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4641. -- Resolution: Duplicate A FileNotFoundException happened in Hash Shuffle Manager Key: SPARK-4641 URL: https://issues.apache.org/jira/browse/SPARK-4641 Project: Spark Issue Type: Bug Components: Input/Output, Shuffle Environment: A WordCount Example with some special text input(normal words text) Reporter: hzw Using Hash Shuffle without consolidateFiles, it will throw such exception: java.io.IOException: Error in reading org.apache.spark.network.FileSegmentManagedBuffer .. (actual file length 0) Caused by: java.io.FileNotFoundException: (No such file or directory) And using Hash Shuffle with consolidateFiles, it will throw another exception: java.io.IOException: PARSING_ERROR(2) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4645) Asynchronous execution in HiveThriftServer2 with Hive 0.13.1 doesn't play well with Simba ODBC driver
Cheng Lian created SPARK-4645: - Summary: Asynchronous execution in HiveThriftServer2 with Hive 0.13.1 doesn't play well with Simba ODBC driver Key: SPARK-4645 URL: https://issues.apache.org/jira/browse/SPARK-4645 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Lian Priority: Blocker Hive 0.13.1 enables asynchronous execution for {{SQLOperation}} by default. So does Spark SQL HiveThriftServer2 when built with Hive 0.13.1. This works well for normal JDBC clients like BeeLine, but throws exception when using Simba ODBC driver. Simba ODBC driver tries to execute two statement while connecting to Spark SQL HiveThriftServer2: - {{use `default`}} - {{set -v}} However, HiveThriftServer2 throws exception when executing them: {code} 14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error executing query: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap space at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46) at org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:84) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error running hive query: org.apache.hive.service.cli.HiveSQLException: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap space at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:104) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4646) Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark
Takeshi Yamamuro created SPARK-4646: --- Summary: Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark Key: SPARK-4646 URL: https://issues.apache.org/jira/browse/SPARK-4646 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Takeshi Yamamuro Priority: Minor This patch just replaces a native quick sorter with Sorter(TimSort) in Spark. It could get performance gains by ~8% in my quick experiments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4646) Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark
[ https://issues.apache.org/jira/browse/SPARK-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228129#comment-14228129 ] Apache Spark commented on SPARK-4646: - User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/3507 Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark -- Key: SPARK-4646 URL: https://issues.apache.org/jira/browse/SPARK-4646 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Takeshi Yamamuro Priority: Minor This patch just replaces a native quick sorter with Sorter(TimSort) in Spark. It could get performance gains by ~8% in my quick experiments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4645) Asynchronous execution in HiveThriftServer2 with Hive 0.13.1 doesn't play well with Simba ODBC driver
[ https://issues.apache.org/jira/browse/SPARK-4645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228130#comment-14228130 ] Apache Spark commented on SPARK-4645: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/3506 Asynchronous execution in HiveThriftServer2 with Hive 0.13.1 doesn't play well with Simba ODBC driver - Key: SPARK-4645 URL: https://issues.apache.org/jira/browse/SPARK-4645 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Lian Priority: Blocker Hive 0.13.1 enables asynchronous execution for {{SQLOperation}} by default. So does Spark SQL HiveThriftServer2 when built with Hive 0.13.1. This works well for normal JDBC clients like BeeLine, but throws exception when using Simba ODBC driver. Simba ODBC driver tries to execute two statement while connecting to Spark SQL HiveThriftServer2: - {{use `default`}} - {{set -v}} However, HiveThriftServer2 throws exception when executing them: {code} 14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error executing query: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap space at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46) at org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:84) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error running hive query: org.apache.hive.service.cli.HiveSQLException: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap space at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:104) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at