[jira] [Created] (SPARK-5476) SQLContext.createDataFrame shouldn't be an implicit function
Reynold Xin created SPARK-5476: -- Summary: SQLContext.createDataFrame shouldn't be an implicit function Key: SPARK-5476 URL: https://issues.apache.org/jira/browse/SPARK-5476 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin Assignee: Reynold Xin It is sort of strange to ask users to import sqlContext._ or sqlContext.createDataFrame. The proposal here is to ask users to define an implicit val for SQLContext, and then dsl package object should include an implicit function that converts an RDD[Product] to a DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3977) Conversions between {Row, Coordinate}Matrix <-> BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-3977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3977: - Assignee: Burak Yavuz > Conversions between {Row, Coordinate}Matrix <-> BlockMatrix > --- > > Key: SPARK-3977 > URL: https://issues.apache.org/jira/browse/SPARK-3977 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh >Assignee: Burak Yavuz > Fix For: 1.3.0 > > > Build conversion functions between {Row, Coordinate}Matrix <-> BlockMatrix -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3977) Conversions between {Row, Coordinate}Matrix <-> BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-3977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-3977. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4256 [https://github.com/apache/spark/pull/4256] > Conversions between {Row, Coordinate}Matrix <-> BlockMatrix > --- > > Key: SPARK-3977 > URL: https://issues.apache.org/jira/browse/SPARK-3977 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > Fix For: 1.3.0 > > > Build conversion functions between {Row, Coordinate}Matrix <-> BlockMatrix -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2476) Have sbt-assembly include runtime dependencies in jar
[ https://issues.apache.org/jira/browse/SPARK-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2476. Resolution: Not a Problem [~srowen] Nope, I think we found a workaround. > Have sbt-assembly include runtime dependencies in jar > - > > Key: SPARK-2476 > URL: https://issues.apache.org/jira/browse/SPARK-2476 > Project: Spark > Issue Type: Sub-task > Components: Build >Reporter: Patrick Wendell >Assignee: Prashant Sharma >Priority: Minor > > If possible, we should try to contribute the ability to include > runtime-scoped dependencies in the assembly jar created with sbt-assembly. > Currently in only reads compile-scoped dependencies: > https://github.com/sbt/sbt-assembly/blob/master/src/main/scala/sbtassembly/Plugin.scala#L495 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2487) Follow up from SBT build refactor (i.e. SPARK-1776)
[ https://issues.apache.org/jira/browse/SPARK-2487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2487. Resolution: Fixed > Follow up from SBT build refactor (i.e. SPARK-1776) > --- > > Key: SPARK-2487 > URL: https://issues.apache.org/jira/browse/SPARK-2487 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Patrick Wendell > > This is to track follw up issues relating to SPARK-1776, which was a major > re-factoring of the SBT build in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5466) Build Error caused by Guava shading in Spark
[ https://issues.apache.org/jira/browse/SPARK-5466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5466: --- Component/s: Build > Build Error caused by Guava shading in Spark > > > Key: SPARK-5466 > URL: https://issues.apache.org/jira/browse/SPARK-5466 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.3.0 >Reporter: Jian Zhou >Priority: Blocker > > Guava is shaded inside spark-core itself. > https://github.com/apache/spark/commit/37a5e272f898e946c09c2e7de5d1bda6f27a8f39 > This causes build error in multiple components, including Graph/MLLib/SQL, > when package com.google.common on the classpath incompatible with the version > used when compiling Utils.class > [error] bad symbolic reference. A signature in Utils.class refers to term util > [error] in package com.google.common which is not available. > [error] It may be completely missing from the current classpath, or the > version on > [error] the classpath might be incompatible with the version used when > compiling Utils.class. > [error] > [error] while compiling: > /spark/graphx/src/main/scala/org/apache/spark/graphx/util/BytecodeUtils.scala > [error] during phase: erasure > [error] library version: version 2.10.4 > [error] compiler version: version 2.10.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5466) Build Error caused by Guava shading in Spark
[ https://issues.apache.org/jira/browse/SPARK-5466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296483#comment-14296483 ] Patrick Wendell commented on SPARK-5466: Also - [~srowen] can you reproduce this if you do not use Zinc? > Build Error caused by Guava shading in Spark > > > Key: SPARK-5466 > URL: https://issues.apache.org/jira/browse/SPARK-5466 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.3.0 >Reporter: Jian Zhou >Priority: Blocker > > Guava is shaded inside spark-core itself. > https://github.com/apache/spark/commit/37a5e272f898e946c09c2e7de5d1bda6f27a8f39 > This causes build error in multiple components, including Graph/MLLib/SQL, > when package com.google.common on the classpath incompatible with the version > used when compiling Utils.class > [error] bad symbolic reference. A signature in Utils.class refers to term util > [error] in package com.google.common which is not available. > [error] It may be completely missing from the current classpath, or the > version on > [error] the classpath might be incompatible with the version used when > compiling Utils.class. > [error] > [error] while compiling: > /spark/graphx/src/main/scala/org/apache/spark/graphx/util/BytecodeUtils.scala > [error] during phase: erasure > [error] library version: version 2.10.4 > [error] compiler version: version 2.10.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5466) Build Error caused by Guava shading in Spark
[ https://issues.apache.org/jira/browse/SPARK-5466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5466: --- Priority: Blocker (was: Major) > Build Error caused by Guava shading in Spark > > > Key: SPARK-5466 > URL: https://issues.apache.org/jira/browse/SPARK-5466 > Project: Spark > Issue Type: Bug >Affects Versions: 1.3.0 >Reporter: Jian Zhou >Priority: Blocker > > Guava is shaded inside spark-core itself. > https://github.com/apache/spark/commit/37a5e272f898e946c09c2e7de5d1bda6f27a8f39 > This causes build error in multiple components, including Graph/MLLib/SQL, > when package com.google.common on the classpath incompatible with the version > used when compiling Utils.class > [error] bad symbolic reference. A signature in Utils.class refers to term util > [error] in package com.google.common which is not available. > [error] It may be completely missing from the current classpath, or the > version on > [error] the classpath might be incompatible with the version used when > compiling Utils.class. > [error] > [error] while compiling: > /spark/graphx/src/main/scala/org/apache/spark/graphx/util/BytecodeUtils.scala > [error] during phase: erasure > [error] library version: version 2.10.4 > [error] compiler version: version 2.10.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5466) Build Error caused by Guava shading in Spark
[ https://issues.apache.org/jira/browse/SPARK-5466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296482#comment-14296482 ] Patrick Wendell commented on SPARK-5466: I sent [~vanzin] and e-mail today about this. Guess I'm not the only one seeing it. I was using zinc on OSX... are you guys using that too? I set up a zinc maven build on Jenkins and it worked just fine. > Build Error caused by Guava shading in Spark > > > Key: SPARK-5466 > URL: https://issues.apache.org/jira/browse/SPARK-5466 > Project: Spark > Issue Type: Bug >Affects Versions: 1.3.0 >Reporter: Jian Zhou > > Guava is shaded inside spark-core itself. > https://github.com/apache/spark/commit/37a5e272f898e946c09c2e7de5d1bda6f27a8f39 > This causes build error in multiple components, including Graph/MLLib/SQL, > when package com.google.common on the classpath incompatible with the version > used when compiling Utils.class > [error] bad symbolic reference. A signature in Utils.class refers to term util > [error] in package com.google.common which is not available. > [error] It may be completely missing from the current classpath, or the > version on > [error] the classpath might be incompatible with the version used when > compiling Utils.class. > [error] > [error] while compiling: > /spark/graphx/src/main/scala/org/apache/spark/graphx/util/BytecodeUtils.scala > [error] during phase: erasure > [error] library version: version 2.10.4 > [error] compiler version: version 2.10.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4049) Storage web UI "fraction cached" shows as > 100%
[ https://issues.apache.org/jira/browse/SPARK-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296479#comment-14296479 ] Patrick Wendell edited comment on SPARK-4049 at 1/29/15 6:58 AM: - [~skrasser] Yes - I agree that behavior is just confusing. One idea would be to have a "bit map" so to speak where you can't be 100% unless you have every partition cached. And you can never go over 100%. was (Author: pwendell): [~skrasser] Yes - I agree that behavior is just confusing. One idea would be to have a "bit map" so to speak where you can be 100% unless you have every partition cached. And you can never go over 100%. > Storage web UI "fraction cached" shows as > 100% > > > Key: SPARK-4049 > URL: https://issues.apache.org/jira/browse/SPARK-4049 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0 >Reporter: Josh Rosen >Priority: Minor > > In the Storage tab of the Spark Web UI, I saw a case where the "Fraction > Cached" was greater than 100%: > !http://i.imgur.com/Gm2hEeL.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4049) Storage web UI "fraction cached" shows as > 100%
[ https://issues.apache.org/jira/browse/SPARK-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296479#comment-14296479 ] Patrick Wendell commented on SPARK-4049: [~skrasser] Yes - I agree that behavior is just confusing. One idea would be to have a "bit map" so to speak where you can be 100% unless you have every partition cached. And you can never go over 100%. > Storage web UI "fraction cached" shows as > 100% > > > Key: SPARK-4049 > URL: https://issues.apache.org/jira/browse/SPARK-4049 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0 >Reporter: Josh Rosen >Priority: Minor > > In the Storage tab of the Spark Web UI, I saw a case where the "Fraction > Cached" was greater than 100%: > !http://i.imgur.com/Gm2hEeL.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5471) java.lang.NumberFormatException: For input string:
[ https://issues.apache.org/jira/browse/SPARK-5471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5471. Resolution: Not a Problem Resolving per your own comment. > java.lang.NumberFormatException: For input string: > > > Key: SPARK-5471 > URL: https://issues.apache.org/jira/browse/SPARK-5471 > Project: Spark > Issue Type: New Feature >Affects Versions: 1.2.0 > Environment: Spark 1.2.0 Maven >Reporter: DeepakVohra > > Naive Bayes Classifier generates exception with sample_naive_bayes_data.txt > java.lang.NumberFormatException: For input string: "0,1" > at > sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250) > at java.lang.Double.parseDouble(Double.java:540) > at > scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232) > at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31) > at > org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79) > at > org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:77) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249) > at > org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 15/01/28 21:13:57 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, > localhost): java.lang.NumberFormatException: For input string: "0,1" > at > sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250) > at java.lang.Double.parseDouble(Double.java:540) > at > scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232) > at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31) > at > org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79) > at > org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:77) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249) > at > org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 15/01/28 21:13:57 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > 15/01/28 21:13:57 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks > have all completed, from pool > 15/01/28 21:13:57 INFO TaskSchedulerImpl: Cancelling stage 0 > 15/01/28 21:13:57 INFO DAGScheduler: Job 0 failed: reduce at > MLUtils.scala:96, took 1.180869 s > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: > Lost task 0.0 in stage 0.0 (TID 0, localhost): > java.lang.NumberFormatException: For input string: "0,1" > at > sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250) > at java.lang.Double.parseDouble(Double.java:540) > at > scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232) > at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31) > at > org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply
[jira] [Commented] (SPARK-5162) Python yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296430#comment-14296430 ] Vladimir Grigor commented on SPARK-5162: [~lianhuiwang] Thank you for the walkaround suggestion! Still, I believe it would be great to have feature of remote script files - that would improve usability of yarn component in Spark a lot. If you think that is the case, and you know technical details of the system better, could you please create a ticket for that feature? Or please comment with any ideas of technical implementation. Thank you! > Python yarn-cluster mode > > > Key: SPARK-5162 > URL: https://issues.apache.org/jira/browse/SPARK-5162 > Project: Spark > Issue Type: New Feature > Components: PySpark, YARN >Reporter: Dana Klassen > Labels: cluster, python, yarn > > Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would > be great to be able to submit python applications to the cluster and (just > like java classes) have the resource manager setup an AM on any node in the > cluster. Does anyone know the issues blocking this feature? I was snooping > around with enabling python apps: > Removing the logic stopping python and yarn-cluster from sparkSubmit.scala > ... > // The following modes are not supported or applicable > (clusterManager, deployMode) match { > ... > case (_, CLUSTER) if args.isPython => > printErrorAndExit("Cluster deploy mode is currently not supported for > python applications.") > ... > } > … > and submitting application via: > HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster > --num-executors 2 —-py-files {{insert location of egg here}} > --executor-cores 1 ../tools/canary.py > Everything looks to run alright, pythonRunner is picked up as main class, > resources get setup, yarn client gets launched but falls flat on its face: > 2015-01-08 18:48:03,444 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > DEBUG: FAILED { > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, > 1420742868009, FILE, null }, Resource > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed > on src filesystem (expected 1420742868009, was 1420742869284 > and > 2015-01-08 18:48:03,446 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(->/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py) > transitioned from DOWNLOADING to FAILED > Tracked this down to the apache hadoop code(FSDownload.java line 249) related > to container localization of files upon downloading. At this point thought it > would be best to raise the issue here and get input. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5475) Java 8 tests are like maintenance overhead.
[ https://issues.apache.org/jira/browse/SPARK-5475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296405#comment-14296405 ] Apache Spark commented on SPARK-5475: - User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/4264 > Java 8 tests are like maintenance overhead. > > > Key: SPARK-5475 > URL: https://issues.apache.org/jira/browse/SPARK-5475 > Project: Spark > Issue Type: Bug >Reporter: Prashant Sharma > > Having tests that validate the same code compatible with java 8 and java 7 is > like asserting that java 8 is backward compatible with java 7 and still > supports java 8 features(lambda expressions to be precise). This was once > necessary as asm was not compatible with java 8 and so on. > Running java8-tests on the current code base results in more than 100 > compilation errors, it felt as if they are never run. This is based on the > fact that compilation errors have existed for a pretty long period. So IMHO, > we should really remove them, if we don't plan to maintain. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5475) Java 8 tests are like maintenance overhead.
[ https://issues.apache.org/jira/browse/SPARK-5475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-5475: --- Issue Type: Bug (was: Wish) > Java 8 tests are like maintenance overhead. > > > Key: SPARK-5475 > URL: https://issues.apache.org/jira/browse/SPARK-5475 > Project: Spark > Issue Type: Bug >Reporter: Prashant Sharma > > Having tests that validate the same code compatible with java 8 and java 7 is > like asserting that java 8 is backward compatible with java 7 and still > supports java 8 features(lambda expressions to be precise). This was once > necessary as asm was not compatible with java 8 and so on. > Running java8-tests on the current code base results in more than 100 > compilation errors, it felt as if they are never run. This is based on the > fact that compilation errors have existed for a pretty long period. So IMHO, > we should really remove them, if we don't plan to maintain. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5475) Java 8 tests are like maintenance overhead.
Prashant Sharma created SPARK-5475: -- Summary: Java 8 tests are like maintenance overhead. Key: SPARK-5475 URL: https://issues.apache.org/jira/browse/SPARK-5475 Project: Spark Issue Type: Wish Reporter: Prashant Sharma Having tests that validate the same code compatible with java 8 and java 7 is like asserting that java 8 is backward compatible with java 7 and still supports java 8 features(lambda expressions to be precise). This was once necessary as asm was not compatible with java 8 and so on. Running java8-tests on the current code base results in more than 100 compilation errors, it felt as if they are never run. This is based on the fact that compilation errors have existed for a pretty long period. So IMHO, we should really remove them, if we don't plan to maintain. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5474) curl should support URL redirection in build/mvn
[ https://issues.apache.org/jira/browse/SPARK-5474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296387#comment-14296387 ] Apache Spark commented on SPARK-5474: - User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/4263 > curl should support URL redirection in build/mvn > > > Key: SPARK-5474 > URL: https://issues.apache.org/jira/browse/SPARK-5474 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.0 >Reporter: Guoqiang Li > > {{http://archive.apache.org/dist/maven/maven-3/3.2.5/binaries/apache-maven-3.2.5-bin.tar.gz}} > sometimes return 3xx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5474) curl should support URL redirection in build/mvn
Guoqiang Li created SPARK-5474: -- Summary: curl should support URL redirection in build/mvn Key: SPARK-5474 URL: https://issues.apache.org/jira/browse/SPARK-5474 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.0 Reporter: Guoqiang Li {{http://archive.apache.org/dist/maven/maven-3/3.2.5/binaries/apache-maven-3.2.5-bin.tar.gz}} sometimes return 3xx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5162) Python yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296375#comment-14296375 ] Lianhui Wang commented on SPARK-5162: - The previous problem is that packages.egg is not found. but from your commonds, I donnot see packages.egg from your spark-submit and there are only package1.egg and package1.egg.so i think you need to add packages.egg to --py-files and you can try it. Or firstly you can run on yarn-client mode, if it is ok then try to run on yarn cluste mode. That can help us to find some problems. > Python yarn-cluster mode > > > Key: SPARK-5162 > URL: https://issues.apache.org/jira/browse/SPARK-5162 > Project: Spark > Issue Type: New Feature > Components: PySpark, YARN >Reporter: Dana Klassen > Labels: cluster, python, yarn > > Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would > be great to be able to submit python applications to the cluster and (just > like java classes) have the resource manager setup an AM on any node in the > cluster. Does anyone know the issues blocking this feature? I was snooping > around with enabling python apps: > Removing the logic stopping python and yarn-cluster from sparkSubmit.scala > ... > // The following modes are not supported or applicable > (clusterManager, deployMode) match { > ... > case (_, CLUSTER) if args.isPython => > printErrorAndExit("Cluster deploy mode is currently not supported for > python applications.") > ... > } > … > and submitting application via: > HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster > --num-executors 2 —-py-files {{insert location of egg here}} > --executor-cores 1 ../tools/canary.py > Everything looks to run alright, pythonRunner is picked up as main class, > resources get setup, yarn client gets launched but falls flat on its face: > 2015-01-08 18:48:03,444 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > DEBUG: FAILED { > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, > 1420742868009, FILE, null }, Resource > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed > on src filesystem (expected 1420742868009, was 1420742869284 > and > 2015-01-08 18:48:03,446 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(->/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py) > transitioned from DOWNLOADING to FAILED > Tracked this down to the apache hadoop code(FSDownload.java line 249) related > to container localization of files upon downloading. At this point thought it > would be best to raise the issue here and get input. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5162) Python yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-5162: Comment: was deleted (was: The previous problem is that packages.egg is not found. but from your commonds, I donnot see packages.egg from your spark-submit and there are only package1.egg and package1.egg.so i think you need to add packages.egg to --py-files and you can try it. Or firstly you can run on yarn-client mode, if it is ok then try to run on yarn cluste mode. That can help us to find some problems.) > Python yarn-cluster mode > > > Key: SPARK-5162 > URL: https://issues.apache.org/jira/browse/SPARK-5162 > Project: Spark > Issue Type: New Feature > Components: PySpark, YARN >Reporter: Dana Klassen > Labels: cluster, python, yarn > > Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would > be great to be able to submit python applications to the cluster and (just > like java classes) have the resource manager setup an AM on any node in the > cluster. Does anyone know the issues blocking this feature? I was snooping > around with enabling python apps: > Removing the logic stopping python and yarn-cluster from sparkSubmit.scala > ... > // The following modes are not supported or applicable > (clusterManager, deployMode) match { > ... > case (_, CLUSTER) if args.isPython => > printErrorAndExit("Cluster deploy mode is currently not supported for > python applications.") > ... > } > … > and submitting application via: > HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster > --num-executors 2 —-py-files {{insert location of egg here}} > --executor-cores 1 ../tools/canary.py > Everything looks to run alright, pythonRunner is picked up as main class, > resources get setup, yarn client gets launched but falls flat on its face: > 2015-01-08 18:48:03,444 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > DEBUG: FAILED { > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, > 1420742868009, FILE, null }, Resource > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed > on src filesystem (expected 1420742868009, was 1420742869284 > and > 2015-01-08 18:48:03,446 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(->/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py) > transitioned from DOWNLOADING to FAILED > Tracked this down to the apache hadoop code(FSDownload.java line 249) related > to container localization of files upon downloading. At this point thought it > would be best to raise the issue here and get input. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5162) Python yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296376#comment-14296376 ] Lianhui Wang commented on SPARK-5162: - The previous problem is that packages.egg is not found. but from your commonds, I donnot see packages.egg from your spark-submit and there are only package1.egg and package1.egg.so i think you need to add packages.egg to --py-files and you can try it. Or firstly you can run on yarn-client mode, if it is ok then try to run on yarn cluste mode. That can help us to find some problems. > Python yarn-cluster mode > > > Key: SPARK-5162 > URL: https://issues.apache.org/jira/browse/SPARK-5162 > Project: Spark > Issue Type: New Feature > Components: PySpark, YARN >Reporter: Dana Klassen > Labels: cluster, python, yarn > > Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would > be great to be able to submit python applications to the cluster and (just > like java classes) have the resource manager setup an AM on any node in the > cluster. Does anyone know the issues blocking this feature? I was snooping > around with enabling python apps: > Removing the logic stopping python and yarn-cluster from sparkSubmit.scala > ... > // The following modes are not supported or applicable > (clusterManager, deployMode) match { > ... > case (_, CLUSTER) if args.isPython => > printErrorAndExit("Cluster deploy mode is currently not supported for > python applications.") > ... > } > … > and submitting application via: > HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster > --num-executors 2 —-py-files {{insert location of egg here}} > --executor-cores 1 ../tools/canary.py > Everything looks to run alright, pythonRunner is picked up as main class, > resources get setup, yarn client gets launched but falls flat on its face: > 2015-01-08 18:48:03,444 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > DEBUG: FAILED { > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, > 1420742868009, FILE, null }, Resource > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed > on src filesystem (expected 1420742868009, was 1420742869284 > and > 2015-01-08 18:48:03,446 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(->/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py) > transitioned from DOWNLOADING to FAILED > Tracked this down to the apache hadoop code(FSDownload.java line 249) related > to container localization of files upon downloading. At this point thought it > would be best to raise the issue here and get input. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5473) Expose SSH failures after status checks pass
[ https://issues.apache.org/jira/browse/SPARK-5473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296374#comment-14296374 ] Apache Spark commented on SPARK-5473: - User 'nchammas' has created a pull request for this issue: https://github.com/apache/spark/pull/4262 > Expose SSH failures after status checks pass > > > Key: SPARK-5473 > URL: https://issues.apache.org/jira/browse/SPARK-5473 > Project: Spark > Issue Type: Improvement > Components: EC2 >Affects Versions: 1.2.0 >Reporter: Nicholas Chammas >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5473) Expose SSH failures after status checks pass
Nicholas Chammas created SPARK-5473: --- Summary: Expose SSH failures after status checks pass Key: SPARK-5473 URL: https://issues.apache.org/jira/browse/SPARK-5473 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.2.0 Reporter: Nicholas Chammas Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database
[ https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296344#comment-14296344 ] Apache Spark commented on SPARK-5472: - User 'tmyklebu' has created a pull request for this issue: https://github.com/apache/spark/pull/4261 > Add support for reading from and writing to a JDBC database > --- > > Key: SPARK-5472 > URL: https://issues.apache.org/jira/browse/SPARK-5472 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tor Myklebust >Priority: Minor > > It would be nice to be able to make a table in a JDBC database appear as a > table in Spark SQL. This would let users, for instance, perform a JOIN > between a DataFrame in Spark SQL with a table in a Postgres database. > It might also be nice to be able to go the other direction---save a DataFrame > to a database---for instance in an ETL job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5472) Add support for reading from and writing to a JDBC database
Tor Myklebust created SPARK-5472: Summary: Add support for reading from and writing to a JDBC database Key: SPARK-5472 URL: https://issues.apache.org/jira/browse/SPARK-5472 Project: Spark Issue Type: Improvement Components: SQL Reporter: Tor Myklebust Priority: Minor It would be nice to be able to make a table in a JDBC database appear as a table in Spark SQL. This would let users, for instance, perform a JOIN between a DataFrame in Spark SQL with a table in a Postgres database. It might also be nice to be able to go the other direction---save a DataFrame to a database---for instance in an ETL job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5471) java.lang.NumberFormatException: For input string:
[ https://issues.apache.org/jira/browse/SPARK-5471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296307#comment-14296307 ] DeepakVohra commented on SPARK-5471: Not a bug. The sample data has to be split at the ,. > java.lang.NumberFormatException: For input string: > > > Key: SPARK-5471 > URL: https://issues.apache.org/jira/browse/SPARK-5471 > Project: Spark > Issue Type: New Feature >Affects Versions: 1.2.0 > Environment: Spark 1.2.0 Maven >Reporter: DeepakVohra > > Naive Bayes Classifier generates exception with sample_naive_bayes_data.txt > java.lang.NumberFormatException: For input string: "0,1" > at > sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250) > at java.lang.Double.parseDouble(Double.java:540) > at > scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232) > at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31) > at > org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79) > at > org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:77) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249) > at > org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 15/01/28 21:13:57 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, > localhost): java.lang.NumberFormatException: For input string: "0,1" > at > sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250) > at java.lang.Double.parseDouble(Double.java:540) > at > scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232) > at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31) > at > org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79) > at > org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:77) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249) > at > org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 15/01/28 21:13:57 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > 15/01/28 21:13:57 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks > have all completed, from pool > 15/01/28 21:13:57 INFO TaskSchedulerImpl: Cancelling stage 0 > 15/01/28 21:13:57 INFO DAGScheduler: Job 0 failed: reduce at > MLUtils.scala:96, took 1.180869 s > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: > Lost task 0.0 in stage 0.0 (TID 0, localhost): > java.lang.NumberFormatException: For input string: "0,1" > at > sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250) > at java.lang.Double.parseDouble(Double.java:540) > at > scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232) > at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31) > at > org.apache.spa
[jira] [Updated] (SPARK-5445) Make sure DataFrame expressions are usable in Java
[ https://issues.apache.org/jira/browse/SPARK-5445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5445: --- Description: Some DataFrame expressions are not exactly usable in Java. For example, aggregate functions are only defined in the dsl package object, which is painful to use. (was: Some DataFrame expressions are not exactly usable in Java. For example, aggregate functions are only defined in the dsl package object, which is painful to use. Another example is operator overloading, which would require Java users to use $plus. ) > Make sure DataFrame expressions are usable in Java > -- > > Key: SPARK-5445 > URL: https://issues.apache.org/jira/browse/SPARK-5445 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.3.0 > > > Some DataFrame expressions are not exactly usable in Java. For example, > aggregate functions are only defined in the dsl package object, which is > painful to use. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5253) LinearRegression with L1/L2 (elastic net) using OWLQN in new ML pacakge
[ https://issues.apache.org/jira/browse/SPARK-5253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296309#comment-14296309 ] Apache Spark commented on SPARK-5253: - User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/4259 > LinearRegression with L1/L2 (elastic net) using OWLQN in new ML pacakge > --- > > Key: SPARK-5253 > URL: https://issues.apache.org/jira/browse/SPARK-5253 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: DB Tsai > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5445) Make sure DataFrame expressions are usable in Java
[ https://issues.apache.org/jira/browse/SPARK-5445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5445. Resolution: Fixed Fix Version/s: 1.3.0 > Make sure DataFrame expressions are usable in Java > -- > > Key: SPARK-5445 > URL: https://issues.apache.org/jira/browse/SPARK-5445 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.3.0 > > > Some DataFrame expressions are not exactly usable in Java. For example, > aggregate functions are only defined in the dsl package object, which is > painful to use. Another example is operator overloading, which would require > Java users to use $plus. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5471) java.lang.NumberFormatException: For input string:
DeepakVohra created SPARK-5471: -- Summary: java.lang.NumberFormatException: For input string: Key: SPARK-5471 URL: https://issues.apache.org/jira/browse/SPARK-5471 Project: Spark Issue Type: New Feature Affects Versions: 1.2.0 Environment: Spark 1.2.0 Maven Reporter: DeepakVohra Naive Bayes Classifier generates exception with sample_naive_bayes_data.txt java.lang.NumberFormatException: For input string: "0,1" at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250) at java.lang.Double.parseDouble(Double.java:540) at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232) at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31) at org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79) at org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:77) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/01/28 21:13:57 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NumberFormatException: For input string: "0,1" at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250) at java.lang.Double.parseDouble(Double.java:540) at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232) at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31) at org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79) at org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:77) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/01/28 21:13:57 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job 15/01/28 21:13:57 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/01/28 21:13:57 INFO TaskSchedulerImpl: Cancelling stage 0 15/01/28 21:13:57 INFO DAGScheduler: Job 0 failed: reduce at MLUtils.scala:96, took 1.180869 s Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NumberFormatException: For input string: "0,1" at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250) at java.lang.Double.parseDouble(Double.java:540) at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232) at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31) at org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79) at org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:77) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249) at org.apache.spark.CacheManager.putInBl
[jira] [Commented] (SPARK-5470) use defaultClassLoader of Serializer to load classes of classesToRegister in KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296282#comment-14296282 ] Apache Spark commented on SPARK-5470: - User 'lianhuiwang' has created a pull request for this issue: https://github.com/apache/spark/pull/4258 > use defaultClassLoader of Serializer to load classes of classesToRegister in > KryoSerializer > --- > > Key: SPARK-5470 > URL: https://issues.apache.org/jira/browse/SPARK-5470 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Lianhui Wang > > Now KryoSerializer load classes of classesToRegister at the time of its > initialization. when we set spark.kryo.classesToRegister=class1, it will > throw SparkException("Failed to load class to register with Kryo". > because in KryoSerializer's initialization, classLoader cannot include class > of user's jars. > we need to use defaultClassLoader of Serializer in newKryo(), because > executor will reset defaultClassLoader of Serializer after Serializer's > initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296301#comment-14296301 ] Chris T commented on SPARK-5436: I may be able to attempt this, but I'm not confident I'll be able to implement a solution (or even one of sufficient quality). I only wrote my first line of scala about a week ago, so I'm still finding my way around how things work. If I can come up with something, I'll share it... > Validate GradientBoostedTrees during training > - > > Key: SPARK-5436 > URL: https://issues.apache.org/jira/browse/SPARK-5436 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > For Gradient Boosting, it would be valuable to compute test error on a > separate validation set during training. That way, training could stop early > based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5162) Python yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296303#comment-14296303 ] Dana Klassen commented on SPARK-5162: - Yes of course. I tried these combinations: HADOOP_CONF_DIR=conf/conf.cloudera.yarn ./bin/spark-submit --master yarn-cluster --num-executors 2 --executor-cores 1 /Users/klassen/Desktop/test.py HADOOP_CONF_DIR=conf/conf.cloudera.yarn ./bin/spark-submit --master yarn-cluster --py-files '/path/to/package1.egg' --num-executors 2 --executor-cores 1 /Users/klassen/Desktop/test.py HADOOP_CONF_DIR=conf/conf.cloudera.yarn ./bin/spark-submit --master yarn-cluster --py-files '/path/to/package1.egg,/path/to/package2.egg'--num-executors 2 --executor-cores 1 /Users/klassen/Desktop/test.py *The test script in this case makes no use of the resources in the eggs I forgot to include enough of the logs to show that packages are uploaded to hdfs sparkStaging as follows:: ``` 15/01/28 21:38:07 INFO Client: Source and destination file systems are the same. Not copying hdfs://nn01.chi.shopify.com:8020/user/sparkles/spark-assembly-python-submit.jar 15/01/28 21:38:07 INFO Client: Uploading resource file:/Users/klassen/Desktop/test.py -> hdfs://nn01.chi.shopify.com:8020/user/klassen/.sparkStaging/application_1422398120127_3034/test.py ``` This is seen for the packages as well. Before these packages are downloaded to the container and setup they are cleared from sparkStaging ( seen at the end of the previous logs). > Python yarn-cluster mode > > > Key: SPARK-5162 > URL: https://issues.apache.org/jira/browse/SPARK-5162 > Project: Spark > Issue Type: New Feature > Components: PySpark, YARN >Reporter: Dana Klassen > Labels: cluster, python, yarn > > Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would > be great to be able to submit python applications to the cluster and (just > like java classes) have the resource manager setup an AM on any node in the > cluster. Does anyone know the issues blocking this feature? I was snooping > around with enabling python apps: > Removing the logic stopping python and yarn-cluster from sparkSubmit.scala > ... > // The following modes are not supported or applicable > (clusterManager, deployMode) match { > ... > case (_, CLUSTER) if args.isPython => > printErrorAndExit("Cluster deploy mode is currently not supported for > python applications.") > ... > } > … > and submitting application via: > HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster > --num-executors 2 —-py-files {{insert location of egg here}} > --executor-cores 1 ../tools/canary.py > Everything looks to run alright, pythonRunner is picked up as main class, > resources get setup, yarn client gets launched but falls flat on its face: > 2015-01-08 18:48:03,444 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > DEBUG: FAILED { > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, > 1420742868009, FILE, null }, Resource > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed > on src filesystem (expected 1420742868009, was 1420742869284 > and > 2015-01-08 18:48:03,446 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(->/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py) > transitioned from DOWNLOADING to FAILED > Tracked this down to the apache hadoop code(FSDownload.java line 249) related > to container localization of files upon downloading. At this point thought it > would be best to raise the issue here and get input. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5162) Python yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296270#comment-14296270 ] Lianhui Wang commented on SPARK-5162: - [~dklassen] can you provide syslog of applicationMaster? from you provided information,packages.egg donot upload to hdfs at the beginning of application.So can you provide the spark-submit command to me? > Python yarn-cluster mode > > > Key: SPARK-5162 > URL: https://issues.apache.org/jira/browse/SPARK-5162 > Project: Spark > Issue Type: New Feature > Components: PySpark, YARN >Reporter: Dana Klassen > Labels: cluster, python, yarn > > Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would > be great to be able to submit python applications to the cluster and (just > like java classes) have the resource manager setup an AM on any node in the > cluster. Does anyone know the issues blocking this feature? I was snooping > around with enabling python apps: > Removing the logic stopping python and yarn-cluster from sparkSubmit.scala > ... > // The following modes are not supported or applicable > (clusterManager, deployMode) match { > ... > case (_, CLUSTER) if args.isPython => > printErrorAndExit("Cluster deploy mode is currently not supported for > python applications.") > ... > } > … > and submitting application via: > HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster > --num-executors 2 —-py-files {{insert location of egg here}} > --executor-cores 1 ../tools/canary.py > Everything looks to run alright, pythonRunner is picked up as main class, > resources get setup, yarn client gets launched but falls flat on its face: > 2015-01-08 18:48:03,444 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > DEBUG: FAILED { > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, > 1420742868009, FILE, null }, Resource > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed > on src filesystem (expected 1420742868009, was 1420742869284 > and > 2015-01-08 18:48:03,446 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(->/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py) > transitioned from DOWNLOADING to FAILED > Tracked this down to the apache hadoop code(FSDownload.java line 249) related > to container localization of files upon downloading. At this point thought it > would be best to raise the issue here and get input. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5470) use defaultClassLoader of Serializer to load classes of classesToRegister in KryoSerializer
Lianhui Wang created SPARK-5470: --- Summary: use defaultClassLoader of Serializer to load classes of classesToRegister in KryoSerializer Key: SPARK-5470 URL: https://issues.apache.org/jira/browse/SPARK-5470 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Lianhui Wang Now KryoSerializer load classes of classesToRegister at the time of its initialization. when we set spark.kryo.classesToRegister=class1, it will throw SparkException("Failed to load class to register with Kryo". because in KryoSerializer's initialization, classLoader cannot include class of user's jars. we need to use defaultClassLoader of Serializer in newKryo(), because executor will reset defaultClassLoader of Serializer after Serializer's initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4631) Add real unit test for MQTT
[ https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296232#comment-14296232 ] Ye Xianjin commented on SPARK-4631: --- Hi [~dragos], I have the same issue here. I'd like to copy the email I sent to Sean here, which may help. {quote} Hi Sean: I enabled the debug flag in log4j. I believe the MQRRStreamSuite failure is more likely due to some weird network issue. However I cannot understand why this exception will be thrown. what I saw in the unit-tests.log is below: 15/01/28 23:41:37.390 ActiveMQ Transport: tcp:///127.0.0.1:53845@23456 DEBUG Transport: Transport Connection to: tcp://127.0.0.1:53845 failed: java.net.ProtocolException: Invalid CONNECT encoding java.net.ProtocolException: Invalid CONNECT encoding at org.fusesource.mqtt.codec.CONNECT.decode(CONNECT.java:77) at org.apache.activemq.transport.mqtt.MQTTProtocolConverter.onMQTTCommand(MQTTProtocolConverter.java:118) at org.apache.activemq.transport.mqtt.MQTTTransportFilter.onCommand(MQTTTransportFilter.java:74) at org.apache.activemq.transport.TransportSupport.doConsume(TransportSupport.java:83) at org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:222) at org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:204) at java.lang.Thread.run(Thread.java:695) However when I looked at the code http://grepcode.com/file/repo1.maven.org/maven2/org.fusesource.mqtt-client/mqtt-client/1.3/org/fusesource/mqtt/codec/CONNECT.java#76 , I don’t quite understand why that would happen. I am not familiar with activemq, maybe you can look at this and figure what really happened. {quote} The possible cause for that failure is that maybe org.eclipse.paho.mqtt-client don't write PROTOCOL_NAME in the mqtt frame with a quick look at the paho.mqtt-client code. But it don't make sense as the Jenkins run test successfully and I am not sure. > Add real unit test for MQTT > > > Key: SPARK-4631 > URL: https://issues.apache.org/jira/browse/SPARK-4631 > Project: Spark > Issue Type: Test > Components: Streaming >Reporter: Tathagata Das >Priority: Critical > Fix For: 1.3.0 > > > A real unit test that actually transfers data to ensure that the MQTTUtil is > functional -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5346) Parquet filter pushdown is not enabled when parquet.task.side.metadata is set to true (default value)
[ https://issues.apache.org/jira/browse/SPARK-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian closed SPARK-5346. - Resolution: Not a Problem I verified that filter push-down actually is enabled even if we set {{parquet.task.side.metadata}} to {{true}}. The actual filtering happens when the {{ParquetRecordReader.initialize()}} is called in {{NewHadoopRDD.compute}}. See [here|https://github.com/apache/spark/blob/v1.2.0/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L135] and [here|https://github.com/apache/incubator-parquet-mr/blob/parquet-1.6.0rc3/parquet-hadoop/src/main/java/parquet/hadoop/ParquetRecordReader.java#L157-L158]. As for Spark task input size, it seems that Hadoop {{FileSystem}} adds the size of a block to the metrics even if we only touch a fraction of it (reading Parquet metadata for example). This behaviour can be verified by the following snippet: {code} import org.apache.spark.sql.Row import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) import sc._ import sqlContext._ case class KeyValue(key: Int, value: String) parallelize(1 to 1024 * 1024 * 20). flatMap(i => Seq.fill(10)(KeyValue(i, i.toString))). saveAsParquetFile("large.parquet") hadoopConfiguration.set("parquet.task.side.metadata", "true") sql("SET spark.sql.parquet.filterPushdown=true") parquetFile("large.parquet").where('key === 0).queryExecution.toRdd.mapPartitions { _ => new Iterator[Row] { def hasNext = false def next() = ??? } }.collect() {code} Apparently we're reading nothing here (except for Parquet metadata in the footers), but the web UI still suggests that the input size of all tasks equals to the file size. In addition, we may find log lines written by {{ParquetRecordReader}} like this: {code} ... 15/01/28 16:50:56 INFO FilterCompat: Filtering using predicate: eq(key, 0) 15/01/28 16:50:56 INFO InternalParquetRecordReader: RecordReader initialized will read a total of 0 records. ... {code} which suggests row group filtering does work as expected. So I'll just close this ticket. > Parquet filter pushdown is not enabled when parquet.task.side.metadata is set > to true (default value) > - > > Key: SPARK-5346 > URL: https://issues.apache.org/jira/browse/SPARK-5346 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.2.0, 1.3.0 >Reporter: Cheng Lian >Priority: Blocker > > When computing Parquet splits, reading Parquet metadata from executor side is > more memory efficient, thus Spark SQL [sets {{parquet.task.side.metadata}} to > {{true}} by > default|https://github.com/apache/spark/blob/v1.2.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala#L437]. > However, somehow this disables filter pushdown. > To workaround this issue and enable Parquet filter pushdown, users can set > {{spark.sql.parquet.filterPushdown}} to {{true}} and > {{parquet.task.side.metadata}} to {{false}}. However, for large Parquet files > with a large number of part-files and/or columns, reading metadata from > driver side eats lots of memory. > The following Spark shell snippet can be useful to reproduce this issue: > {code} > import org.apache.spark.sql.SQLContext > val sqlContext = new SQLContext(sc) > import sqlContext._ > case class KeyValue(key: Int, value: String) > sc. > parallelize(1 to 1024). > flatMap(i => Seq.fill(1024)(KeyValue(i, i.toString))). > saveAsParquetFile("large.parquet") > parquetFile("large.parquet").registerTempTable("large") > sql("SET spark.sql.parquet.filterPushdown=true") > sql("SELECT * FROM large").collect() > sql("SELECT * FROM large WHERE key < 200").collect() > {code} > Users can verify this issue by checking the input size metrics from web UI. > When filter pushdown is enabled, the second query reads fewer data. > Notice that {{parquet.task.side.metadata}} must be set in _Hadoop_ > configuration (either via {{core-site.xml}} or > {{SparkConf.hadoopConfiguration.set()}}), setting it in > {{spark-defaults.conf}} or via {{SparkConf}} does NOT work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5430) Move treeReduce and treeAggregate to core
[ https://issues.apache.org/jira/browse/SPARK-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5430. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4228 [https://github.com/apache/spark/pull/4228] > Move treeReduce and treeAggregate to core > - > > Key: SPARK-5430 > URL: https://issues.apache.org/jira/browse/SPARK-5430 > Project: Spark > Issue Type: Improvement > Components: MLlib, Spark Core >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 1.3.0 > > > I've seen many use cases of treeAggregate/treeReduce outside the ML domain. > Maybe it is time to move them to Core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4586) Python API for ML Pipeline
[ https://issues.apache.org/jira/browse/SPARK-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4586. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4151 [https://github.com/apache/spark/pull/4151] > Python API for ML Pipeline > -- > > Key: SPARK-4586 > URL: https://issues.apache.org/jira/browse/SPARK-4586 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > Fix For: 1.3.0 > > > Add Python API to the newly added ML pipeline and parameters. The initial > design doc is posted here: > https://docs.google.com/document/d/1vL-4f5Xm-7t-kwVSaBylP_ZPrktPZjaOb2dWONtZU2s/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4049) Storage web UI "fraction cached" shows as > 100%
[ https://issues.apache.org/jira/browse/SPARK-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296037#comment-14296037 ] Sven Krasser edited comment on SPARK-4049 at 1/29/15 12:07 AM: --- -I'm also seeing this for a 2x replicated RDD (StorageLevel.MEMORY_AND_DISK_2). I assume that means that most partitions are replicated twice and some three times?- EDIT: looks like for 2x it will ramp up to 200%, which is consistent with Patrick's comment. That aside, I do not think it's a good idea to count overreplication towards that fraction in this way. As a user, when I see 100% on the UI, then I assume the RDD is fully cached. However, this could also mean that some partitions are missing (and need to be recomputed) and some are overreplicated. was (Author: skrasser): I'm also seeing this for a 2x replicated RDD (StorageLevel.MEMORY_AND_DISK_2). I assume that means that most partitions are replicated twice and some three times? That aside, I do not think it's a good idea to count overreplication towards that fraction in this way. As a user, when I see 100% on the UI, then I assume the RDD is fully cached. However, this could also mean that some partitions are missing (and need to be recomputed) and some are overreplicated. > Storage web UI "fraction cached" shows as > 100% > > > Key: SPARK-4049 > URL: https://issues.apache.org/jira/browse/SPARK-4049 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0 >Reporter: Josh Rosen >Priority: Minor > > In the Storage tab of the Spark Web UI, I saw a case where the "Fraction > Cached" was greater than 100%: > !http://i.imgur.com/Gm2hEeL.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5420) Cross-langauge load/store functions for creating and saving DataFrames
[ https://issues.apache.org/jira/browse/SPARK-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5420: --- Priority: Blocker (was: Major) > Cross-langauge load/store functions for creating and saving DataFrames > -- > > Key: SPARK-5420 > URL: https://issues.apache.org/jira/browse/SPARK-5420 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Patrick Wendell >Assignee: Yin Huai >Priority: Blocker > > We should have standard API's for loading or saving a table from a data > store. Per comment discussion: > {code} > def loadData(datasource: String, parameters: Map[String, String]): DataFrame > def loadData(datasource: String, parameters: java.util.Map[String, String]): > DataFrame > def storeData(datasource: String, parameters: Map[String, String]): DataFrame > def storeData(datasource: String, parameters: java.util.Map[String, String]): > DataFrame > {code} > Python should have this too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5420) Cross-langauge load/store functions for creating and saving DataFrames
[ https://issues.apache.org/jira/browse/SPARK-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5420: --- Description: We should have standard API's for loading or saving a table from a data store. Per comment discussion: {code} def loadData(datasource: String, parameters: Map[String, String]): DataFrame def loadData(datasource: String, parameters: java.util.Map[String, String]): DataFrame def storeData(datasource: String, parameters: Map[String, String]): DataFrame def storeData(datasource: String, parameters: java.util.Map[String, String]): DataFrame {code} Python should have this too. was: We should have standard API's for loading or saving a table from a data store. Per comment discussion: {code} df = sc.loadTable("path.to.DataSource", {"a": "b", "c": "d"}) sc.storeTable("path.to.DataSouce", {"a":"b", "c":"d"}) {code} > Cross-langauge load/store functions for creating and saving DataFrames > -- > > Key: SPARK-5420 > URL: https://issues.apache.org/jira/browse/SPARK-5420 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Patrick Wendell >Assignee: Yin Huai > > We should have standard API's for loading or saving a table from a data > store. Per comment discussion: > {code} > def loadData(datasource: String, parameters: Map[String, String]): DataFrame > def loadData(datasource: String, parameters: java.util.Map[String, String]): > DataFrame > def storeData(datasource: String, parameters: Map[String, String]): DataFrame > def storeData(datasource: String, parameters: java.util.Map[String, String]): > DataFrame > {code} > Python should have this too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5420) Cross-langauge load/store functions for creating and saving DataFrames
[ https://issues.apache.org/jira/browse/SPARK-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5420: --- Description: We should have standard API's for loading or saving a table from a data store. Per comment discussion: {code} df = sc.loadTable("path.to.DataSource", {"a": "b", "c": "d"}) sc.storeTable("path.to.DataSouce", {"a":"b", "c":"d"}) {code} was: We should have standard API's for loading or saving a table from a data store. One idea: {code} df = sc.loadTable("path.to.DataSource", {"a": "b", "c": "d"}) sc.storeTable("path.to.DataSouce", {"a":"b", "c":"d"}) {code} > Cross-langauge load/store functions for creating and saving DataFrames > -- > > Key: SPARK-5420 > URL: https://issues.apache.org/jira/browse/SPARK-5420 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Patrick Wendell >Assignee: Yin Huai > > We should have standard API's for loading or saving a table from a data > store. Per comment discussion: > {code} > df = sc.loadTable("path.to.DataSource", {"a": "b", "c": "d"}) > sc.storeTable("path.to.DataSouce", {"a":"b", "c":"d"}) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5420) Cross-langauge load/store functions for creating and saving DataFrames
[ https://issues.apache.org/jira/browse/SPARK-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5420: --- Assignee: Yin Huai > Cross-langauge load/store functions for creating and saving DataFrames > -- > > Key: SPARK-5420 > URL: https://issues.apache.org/jira/browse/SPARK-5420 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Patrick Wendell >Assignee: Yin Huai > > We should have standard API's for loading or saving a table from a data > store. One idea: > {code} > df = sc.loadTable("path.to.DataSource", {"a": "b", "c": "d"}) > sc.storeTable("path.to.DataSouce", {"a":"b", "c":"d"}) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5247) Enable javadoc/scaladoc for public classes in catalyst project
[ https://issues.apache.org/jira/browse/SPARK-5247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5247: --- Priority: Blocker (was: Major) > Enable javadoc/scaladoc for public classes in catalyst project > -- > > Key: SPARK-5247 > URL: https://issues.apache.org/jira/browse/SPARK-5247 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust >Priority: Blocker > > We previously did not generate any docs for the entire catalyst project. > Since now we are defining public APIs in that (under org.apache.spark.sql > outside of org.apache.spark.sql.catalyst, such as Row, types._), we should > start generating javadoc/scaladoc for those. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3977) Conversions between {Row, Coordinate}Matrix <-> BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-3977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296085#comment-14296085 ] Apache Spark commented on SPARK-3977: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/4256 > Conversions between {Row, Coordinate}Matrix <-> BlockMatrix > --- > > Key: SPARK-3977 > URL: https://issues.apache.org/jira/browse/SPARK-3977 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > > Build conversion functions between {Row, Coordinate}Matrix <-> BlockMatrix -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5469) Break sql.py into multiple files
Reynold Xin created SPARK-5469: -- Summary: Break sql.py into multiple files Key: SPARK-5469 URL: https://issues.apache.org/jira/browse/SPARK-5469 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin It is getting pretty long (2800 loc). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5468) Remove Python LocalHiveContext
Reynold Xin created SPARK-5468: -- Summary: Remove Python LocalHiveContext Key: SPARK-5468 URL: https://issues.apache.org/jira/browse/SPARK-5468 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway
[ https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296068#comment-14296068 ] Marcelo Vanzin commented on SPARK-5388: --- Hi [~andrewor14], I read through the spec and the protocol specification seems to be lacking some details. The mains things that bother me are: - It's not really a REST API. There's a single endpoint to which you POST different messages. This sort of forces your hand to use a custom implementation, instead of being able to use a much nicer framework for this purpose such as JAX-RS. Using a framework like that can later benefit other parts of Spark too, such as providing a REST API for application data through the web ui / history server. And as I mentioned in the PR, it allows you to define the endpoints using classes or interfaces, which serves two purposes: it allows you to do backwards compatibility checks with tools like MIMA, and it allows you to use the client functionality of JAX-RS for client requests too (and similar tools for other languages for those who, sort of feeding back into Dale's comment). Plus, you can use things like Jackson and not care about how to parse or generate JSON. - It's unclear how the protocol will be allowed to evolve. What happens when you add a new field or message in a later version, and that version tries to submit to Spark 1.3? Is there a version negotiation up front, so that the new client knows to use the old protocol if possible, or does the client just send the new message and the server will complain if it contains things it doesn't understand? The latter kinda feeds into the first comment. With a proper REST-based API, you'd put the first version of the protocol under "/v1", for example. Later versions are added under "/v2" and can add new things. Client and server can then negotiate up front (e.g, client needs at least version "x" for the current app, asks the server for its supported versions, and complains if "x" is not there). Also, it could be more specific about how errors are reported. Do you get specific HTTP error codes for different things? Is there an "Error" type that is sent back to the client in JSON, and if so, what fields does it have? > Provide a stable application submission gateway > --- > > Key: SPARK-5388 > URL: https://issues.apache.org/jira/browse/SPARK-5388 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Blocker > Attachments: Stable Spark Standalone Submission.pdf > > > The existing submission gateway in standalone mode is not compatible across > Spark versions. If you have a newer version of Spark submitting to an older > version of the standalone Master, it is currently not guaranteed to work. The > goal is to provide a stable REST interface to replace this channel. > The first cut implementation will target standalone cluster mode because > there are very few messages exchanged. The design, however, will be general > enough to eventually support this for other cluster managers too. Note that > this is not necessarily required in YARN because we already use YARN's stable > interface to submit applications there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5445) Make sure DataFrame expressions are usable in Java
[ https://issues.apache.org/jira/browse/SPARK-5445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296064#comment-14296064 ] Apache Spark commented on SPARK-5445: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4241 > Make sure DataFrame expressions are usable in Java > -- > > Key: SPARK-5445 > URL: https://issues.apache.org/jira/browse/SPARK-5445 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Some DataFrame expressions are not exactly usable in Java. For example, > aggregate functions are only defined in the dsl package object, which is > painful to use. Another example is operator overloading, which would require > Java users to use $plus. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5448) Make CacheManager a concrete class and field in SQLContext
[ https://issues.apache.org/jira/browse/SPARK-5448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5448. Resolution: Fixed Fix Version/s: 1.3.0 > Make CacheManager a concrete class and field in SQLContext > -- > > Key: SPARK-5448 > URL: https://issues.apache.org/jira/browse/SPARK-5448 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.3.0 > > > So we don't have to include it using trait mixin. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5447) Replace reference to SchemaRDD with DataFrame
[ https://issues.apache.org/jira/browse/SPARK-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5447. Resolution: Fixed Fix Version/s: 1.3.0 > Replace reference to SchemaRDD with DataFrame > - > > Key: SPARK-5447 > URL: https://issues.apache.org/jira/browse/SPARK-5447 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.3.0 > > > We renamed SchemaRDD -> DataFrame, but internally various code still > reference SchemaRDD in JavaDoc and comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5467) DStreams should provide windowing based on timestamps from the data (as opposed to wall clock time)
[ https://issues.apache.org/jira/browse/SPARK-5467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-5467. - Resolution: Duplicate > DStreams should provide windowing based on timestamps from the data (as > opposed to wall clock time) > --- > > Key: SPARK-5467 > URL: https://issues.apache.org/jira/browse/SPARK-5467 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Imran Rashid > > DStreams currently only let you window based on wall clock time. This > doesn't work very well when you're loading historical logs that are already > sitting around, because they'll all go into one window. DStreams should > provide a way for you to window based on a field of the incoming data. This > would be useful if you want to either (1) bootstrap a streaming app from some > logs or (2) test out the behavior of your app on historical logs, eg. for > correctness or performance. > I think there are some open questions here, such as whether the input data > sources need to be sorted by time, how batches get triggered etc., but it > seems like an important use case. > This just came up on the mailing list: > http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKeyAndWindow-but-using-log-timestamps-instead-of-clock-seconds-td21405.html > And I think it is also what was this Jira was getting at: > https://issues.apache.org/jira/browse/SPARK-4427 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5467) DStreams should provide windowing based on timestamps from the data (as opposed to wall clock time)
[ https://issues.apache.org/jira/browse/SPARK-5467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296047#comment-14296047 ] Imran Rashid commented on SPARK-5467: - shoot, sorry I missed that jira. (I swear I tried searching, no idea how I missed it.) It actually seems to be proposing something significantly more involved, by keeping all the bins open to receive more events, but I suppose its close enough to a duplicate. I'll close this. > DStreams should provide windowing based on timestamps from the data (as > opposed to wall clock time) > --- > > Key: SPARK-5467 > URL: https://issues.apache.org/jira/browse/SPARK-5467 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Imran Rashid > > DStreams currently only let you window based on wall clock time. This > doesn't work very well when you're loading historical logs that are already > sitting around, because they'll all go into one window. DStreams should > provide a way for you to window based on a field of the incoming data. This > would be useful if you want to either (1) bootstrap a streaming app from some > logs or (2) test out the behavior of your app on historical logs, eg. for > correctness or performance. > I think there are some open questions here, such as whether the input data > sources need to be sorted by time, how batches get triggered etc., but it > seems like an important use case. > This just came up on the mailing list: > http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKeyAndWindow-but-using-log-timestamps-instead-of-clock-seconds-td21405.html > And I think it is also what was this Jira was getting at: > https://issues.apache.org/jira/browse/SPARK-4427 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4049) Storage web UI "fraction cached" shows as > 100%
[ https://issues.apache.org/jira/browse/SPARK-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296037#comment-14296037 ] Sven Krasser commented on SPARK-4049: - I'm also seeing this for a 2x replicated RDD (StorageLevel.MEMORY_AND_DISK_2). I assume that means that most partitions are replicated twice and some three times? That aside, I do not think it's a good idea to count overreplication towards that fraction in this way. As a user, when I see 100% on the UI, then I assume the RDD is fully cached. However, this could also mean that some partitions are missing (and need to be recomputed) and some are overreplicated. > Storage web UI "fraction cached" shows as > 100% > > > Key: SPARK-4049 > URL: https://issues.apache.org/jira/browse/SPARK-4049 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0 >Reporter: Josh Rosen >Priority: Minor > > In the Storage tab of the Spark Web UI, I saw a case where the "Fraction > Cached" was greater than 100%: > !http://i.imgur.com/Gm2hEeL.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5467) DStreams should provide windowing based on timestamps from the data (as opposed to wall clock time)
[ https://issues.apache.org/jira/browse/SPARK-5467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296032#comment-14296032 ] Sean Owen commented on SPARK-5467: -- Is this about the same as SPARK-4392? > DStreams should provide windowing based on timestamps from the data (as > opposed to wall clock time) > --- > > Key: SPARK-5467 > URL: https://issues.apache.org/jira/browse/SPARK-5467 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Imran Rashid > > DStreams currently only let you window based on wall clock time. This > doesn't work very well when you're loading historical logs that are already > sitting around, because they'll all go into one window. DStreams should > provide a way for you to window based on a field of the incoming data. This > would be useful if you want to either (1) bootstrap a streaming app from some > logs or (2) test out the behavior of your app on historical logs, eg. for > correctness or performance. > I think there are some open questions here, such as whether the input data > sources need to be sorted by time, how batches get triggered etc., but it > seems like an important use case. > This just came up on the mailing list: > http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKeyAndWindow-but-using-log-timestamps-instead-of-clock-seconds-td21405.html > And I think it is also what was this Jira was getting at: > https://issues.apache.org/jira/browse/SPARK-4427 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5466) Build Error caused by Guava shading in Spark
[ https://issues.apache.org/jira/browse/SPARK-5466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296031#comment-14296031 ] Sean Owen commented on SPARK-5466: -- I see this too from a completely clean build. > Build Error caused by Guava shading in Spark > > > Key: SPARK-5466 > URL: https://issues.apache.org/jira/browse/SPARK-5466 > Project: Spark > Issue Type: Bug >Affects Versions: 1.3.0 >Reporter: Jian Zhou > > Guava is shaded inside spark-core itself. > https://github.com/apache/spark/commit/37a5e272f898e946c09c2e7de5d1bda6f27a8f39 > This causes build error in multiple components, including Graph/MLLib/SQL, > when package com.google.common on the classpath incompatible with the version > used when compiling Utils.class > [error] bad symbolic reference. A signature in Utils.class refers to term util > [error] in package com.google.common which is not available. > [error] It may be completely missing from the current classpath, or the > version on > [error] the classpath might be incompatible with the version used when > compiling Utils.class. > [error] > [error] while compiling: > /spark/graphx/src/main/scala/org/apache/spark/graphx/util/BytecodeUtils.scala > [error] during phase: erasure > [error] library version: version 2.10.4 > [error] compiler version: version 2.10.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5467) DStreams should provide windowing based on timestamps from the data (as opposed to wall clock time)
Imran Rashid created SPARK-5467: --- Summary: DStreams should provide windowing based on timestamps from the data (as opposed to wall clock time) Key: SPARK-5467 URL: https://issues.apache.org/jira/browse/SPARK-5467 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Imran Rashid DStreams currently only let you window based on wall clock time. This doesn't work very well when you're loading historical logs that are already sitting around, because they'll all go into one window. DStreams should provide a way for you to window based on a field of the incoming data. This would be useful if you want to either (1) bootstrap a streaming app from some logs or (2) test out the behavior of your app on historical logs, eg. for correctness or performance. I think there are some open questions here, such as whether the input data sources need to be sorted by time, how batches get triggered etc., but it seems like an important use case. This just came up on the mailing list: http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKeyAndWindow-but-using-log-timestamps-instead-of-clock-seconds-td21405.html And I think it is also what was this Jira was getting at: https://issues.apache.org/jira/browse/SPARK-4427 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5466) Build Error caused by Guava shading in Spark
Jian Zhou created SPARK-5466: Summary: Build Error caused by Guava shading in Spark Key: SPARK-5466 URL: https://issues.apache.org/jira/browse/SPARK-5466 Project: Spark Issue Type: Bug Affects Versions: 1.3.0 Reporter: Jian Zhou Guava is shaded inside spark-core itself. https://github.com/apache/spark/commit/37a5e272f898e946c09c2e7de5d1bda6f27a8f39 This causes build error in multiple components, including Graph/MLLib/SQL, when package com.google.common on the classpath incompatible with the version used when compiling Utils.class [error] bad symbolic reference. A signature in Utils.class refers to term util [error] in package com.google.common which is not available. [error] It may be completely missing from the current classpath, or the version on [error] the classpath might be incompatible with the version used when compiling Utils.class. [error] [error] while compiling: /spark/graphx/src/main/scala/org/apache/spark/graphx/util/BytecodeUtils.scala [error] during phase: erasure [error] library version: version 2.10.4 [error] compiler version: version 2.10.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5465) Data source version of Parquet doesn't push down And filters properly
[ https://issues.apache.org/jira/browse/SPARK-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295951#comment-14295951 ] Apache Spark commented on SPARK-5465: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/4255 > Data source version of Parquet doesn't push down And filters properly > - > > Key: SPARK-5465 > URL: https://issues.apache.org/jira/browse/SPARK-5465 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.2.0, 1.2.1 >Reporter: Cheng Lian >Priority: Blocker > > The current implementation combines all predicates and then tries to convert > it to a single Parquet filter predicate. In this way, the Parquet filter > predicate can not be generated if any component of the original filters can > not be converted. (code lines > [here|https://github.com/apache/spark/blob/a731314c319a6f265060e05267844069027804fd/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala#L197-L201]). > For example, {{a > 10 AND a < 20}} can be successfully converted, while {{a > > 10 AND a < b}} can't because Parquet doesn't accept filters like {{a < b}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5441) SerDeUtil Pair RDD to python conversion doesn't accept empty RDDs
[ https://issues.apache.org/jira/browse/SPARK-5441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5441: -- Target Version/s: 1.3.0, 1.1.2, 1.2.2 (was: 1.3.0) Fix Version/s: 1.3.0 Labels: backport-needed (was: ) I've merged https://github.com/apache/spark/pull/4236, which fixes this, in to 1.3.0 and I'll come back later to backport this to 1.2.2 and 1.1.2. > SerDeUtil Pair RDD to python conversion doesn't accept empty RDDs > - > > Key: SPARK-5441 > URL: https://issues.apache.org/jira/browse/SPARK-5441 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.1.1, 1.2.0 >Reporter: Michael Nazario >Assignee: Michael Nazario > Labels: backport-needed > Fix For: 1.3.0 > > > SerDeUtil.pairRDDToPython and SerDeUtil.pythonToPairRDD rely on rdd.first() > which throws an exception if the RDD is empty. We should be able to handle > the empty RDD case because this doesn't prevent a valid RDD from being > created. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4259) Add Power Iteration Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295930#comment-14295930 ] Andrew Musselman commented on SPARK-4259: - So this feature won't be doing spectral clustering, and will be switching to the power iteration method? Should I create another ticket for spectral clustering if so? > Add Power Iteration Clustering Algorithm with Gaussian Similarity Function > -- > > Key: SPARK-4259 > URL: https://issues.apache.org/jira/browse/SPARK-4259 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Fan Jiang >Assignee: Fan Jiang > Labels: features > > In recent years, power Iteration clustering has become one of the most > popular modern clustering algorithms. It is simple to implement, can be > solved efficiently by standard linear algebra software, and very often > outperforms traditional clustering algorithms such as the k-means algorithm. > Power iteration clustering is a scalable and efficient algorithm for > clustering points given pointwise mutual affinity values. Internally the > algorithm: > computes the Gaussian distance between all pairs of points and represents > these distances in an Affinity Matrix > calculates a Normalized Affinity Matrix > calculates the principal eigenvalue and eigenvector > Clusters each of the input points according to their principal eigenvector > component value > Details of this algorithm are found within [Power Iteration Clustering, Lin > and Cohen]{www.icml2010.org/papers/387.pdf} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4387) Refactoring python profiling code to make it extensible
[ https://issues.apache.org/jira/browse/SPARK-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4387: -- Assignee: Yandu Oppacher > Refactoring python profiling code to make it extensible > --- > > Key: SPARK-4387 > URL: https://issues.apache.org/jira/browse/SPARK-4387 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.1.0 >Reporter: Yandu Oppacher >Assignee: Yandu Oppacher > Fix For: 1.3.0 > > > SPARK-3478 introduced python profiling for workers which is great but it > would be nice to be able to change the profiler and output formats as needed. > This is a refactoring of the code to allow that to happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4387) Refactoring python profiling code to make it extensible
[ https://issues.apache.org/jira/browse/SPARK-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4387. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3901 [https://github.com/apache/spark/pull/3901] > Refactoring python profiling code to make it extensible > --- > > Key: SPARK-4387 > URL: https://issues.apache.org/jira/browse/SPARK-4387 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.1.0 >Reporter: Yandu Oppacher >Assignee: Yandu Oppacher > Fix For: 1.3.0 > > > SPARK-3478 introduced python profiling for workers which is great but it > would be nice to be able to change the profiler and output formats as needed. > This is a refactoring of the code to allow that to happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4259) Add Power Iteration Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295912#comment-14295912 ] Apache Spark commented on SPARK-4259: - User 'fjiang6' has created a pull request for this issue: https://github.com/apache/spark/pull/4254 > Add Power Iteration Clustering Algorithm with Gaussian Similarity Function > -- > > Key: SPARK-4259 > URL: https://issues.apache.org/jira/browse/SPARK-4259 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Fan Jiang >Assignee: Fan Jiang > Labels: features > > In recent years, power Iteration clustering has become one of the most > popular modern clustering algorithms. It is simple to implement, can be > solved efficiently by standard linear algebra software, and very often > outperforms traditional clustering algorithms such as the k-means algorithm. > Power iteration clustering is a scalable and efficient algorithm for > clustering points given pointwise mutual affinity values. Internally the > algorithm: > computes the Gaussian distance between all pairs of points and represents > these distances in an Affinity Matrix > calculates a Normalized Affinity Matrix > calculates the principal eigenvalue and eigenvector > Clusters each of the input points according to their principal eigenvector > component value > Details of this algorithm are found within [Power Iteration Clustering, Lin > and Cohen]{www.icml2010.org/papers/387.pdf} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5465) Data source version of Parquet doesn't push down And filters properly
Cheng Lian created SPARK-5465: - Summary: Data source version of Parquet doesn't push down And filters properly Key: SPARK-5465 URL: https://issues.apache.org/jira/browse/SPARK-5465 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.2.0, 1.2.1 Reporter: Cheng Lian Priority: Blocker The current implementation combines all predicates and then tries to convert it to a single Parquet filter predicate. In this way, the Parquet filter predicate can not be generated if any component of the original filters can not be converted. (code lines [here|https://github.com/apache/spark/blob/a731314c319a6f265060e05267844069027804fd/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala#L197-L201]). For example, {{a > 10 AND a < 20}} can be successfully converted, while {{a > 10 AND a < b}} can't because Parquet doesn't accept filters like {{a < b}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4259) Add Power Iteration Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-4259: - Description: In recent years, power Iteration clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. Power iteration clustering is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm: computes the Gaussian distance between all pairs of points and represents these distances in an Affinity Matrix calculates a Normalized Affinity Matrix calculates the principal eigenvalue and eigenvector Clusters each of the input points according to their principal eigenvector component value Details of this algorithm are found within [Power Iteration Clustering, Lin and Cohen]{www.icml2010.org/papers/387.pdf} was: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. Power iteration clustering is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm: computes the Gaussian distance between all pairs of points and represents these distances in an Affinity Matrix calculates a Normalized Affinity Matrix calculates the principal eigenvalue and eigenvector Clusters each of the input points according to their principal eigenvector component value Details of this algorithm are found within [Power Iteration Clustering, Lin and Cohen]{www.icml2010.org/papers/387.pdf} > Add Power Iteration Clustering Algorithm with Gaussian Similarity Function > -- > > Key: SPARK-4259 > URL: https://issues.apache.org/jira/browse/SPARK-4259 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Fan Jiang >Assignee: Fan Jiang > Labels: features > > In recent years, power Iteration clustering has become one of the most > popular modern clustering algorithms. It is simple to implement, can be > solved efficiently by standard linear algebra software, and very often > outperforms traditional clustering algorithms such as the k-means algorithm. > Power iteration clustering is a scalable and efficient algorithm for > clustering points given pointwise mutual affinity values. Internally the > algorithm: > computes the Gaussian distance between all pairs of points and represents > these distances in an Affinity Matrix > calculates a Normalized Affinity Matrix > calculates the principal eigenvalue and eigenvector > Clusters each of the input points according to their principal eigenvector > component value > Details of this algorithm are found within [Power Iteration Clustering, Lin > and Cohen]{www.icml2010.org/papers/387.pdf} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4259) Add Power Iteration Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-4259: - Description: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. Power iteration clustering is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm: computes the Gaussian distance between all pairs of points and represents these distances in an Affinity Matrix calculates a Normalized Affinity Matrix calculates the principal eigenvalue and eigenvector Clusters each of the input points according to their principal eigenvector component value Details of this algorithm are found within [Power Iteration Clustering, Lin and Cohen]{www.icml2010.org/papers/387.pdf} was: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci }. Summary: Add Power Iteration Clustering Algorithm with Gaussian Similarity Function (was: Add Spectral Clustering Algorithm with Gaussian Similarity Function) > Add Power Iteration Clustering Algorithm with Gaussian Similarity Function > -- > > Key: SPARK-4259 > URL: https://issues.apache.org/jira/browse/SPARK-4259 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Fan Jiang >Assignee: Fan Jiang > Labels: features > > In recent years, spectral clustering has become one of the most popular > modern clustering algorithms. It is simple to implement, can be solved > efficiently by standard linear algebra software, and very often outperforms > traditional clustering algorithms such as the k-means algorithm. > Power iteration clustering is a scalable and efficient algorithm for > clustering points given pointwise mutual affinity values. Internally the > algorithm: > computes the Gaussian distance between all pairs of points and represents > these distances in an Affinity Matrix > calculates a Normalized Affinity Matrix > calculates the principal eigenvalue and eigenvector > Clusters each of the input points according to their principal eigenvector > component value > Details of this algorithm are found within [Power Iteration Clustering, Lin > and Cohen]{www.icml2010.org/papers/387.pdf} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5463) Fix Parquet filter push-down
[ https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295902#comment-14295902 ] Cheng Lian commented on SPARK-5463: --- SPARK-4258 is fixed in Parquet master. SPARK-5451 is fixed by [Parquet PR #108|https://github.com/apache/incubator-parquet-mr/pull/108]. Once Parquet PR #108 is merged and a new (RC) release is cut, the first 2 sub-tasks can be resolved. > Fix Parquet filter push-down > > > Key: SPARK-5463 > URL: https://issues.apache.org/jira/browse/SPARK-5463 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.2.1, 1.2.2 >Reporter: Cheng Lian >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5346) Parquet filter pushdown is not enabled when parquet.task.side.metadata is set to true (default value)
[ https://issues.apache.org/jira/browse/SPARK-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-5346: -- Issue Type: Sub-task (was: Bug) Parent: SPARK-5463 > Parquet filter pushdown is not enabled when parquet.task.side.metadata is set > to true (default value) > - > > Key: SPARK-5346 > URL: https://issues.apache.org/jira/browse/SPARK-5346 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.2.0, 1.3.0 >Reporter: Cheng Lian >Priority: Blocker > > When computing Parquet splits, reading Parquet metadata from executor side is > more memory efficient, thus Spark SQL [sets {{parquet.task.side.metadata}} to > {{true}} by > default|https://github.com/apache/spark/blob/v1.2.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala#L437]. > However, somehow this disables filter pushdown. > To workaround this issue and enable Parquet filter pushdown, users can set > {{spark.sql.parquet.filterPushdown}} to {{true}} and > {{parquet.task.side.metadata}} to {{false}}. However, for large Parquet files > with a large number of part-files and/or columns, reading metadata from > driver side eats lots of memory. > The following Spark shell snippet can be useful to reproduce this issue: > {code} > import org.apache.spark.sql.SQLContext > val sqlContext = new SQLContext(sc) > import sqlContext._ > case class KeyValue(key: Int, value: String) > sc. > parallelize(1 to 1024). > flatMap(i => Seq.fill(1024)(KeyValue(i, i.toString))). > saveAsParquetFile("large.parquet") > parquetFile("large.parquet").registerTempTable("large") > sql("SET spark.sql.parquet.filterPushdown=true") > sql("SELECT * FROM large").collect() > sql("SELECT * FROM large WHERE key < 200").collect() > {code} > Users can verify this issue by checking the input size metrics from web UI. > When filter pushdown is enabled, the second query reads fewer data. > Notice that {{parquet.task.side.metadata}} must be set in _Hadoop_ > configuration (either via {{core-site.xml}} or > {{SparkConf.hadoopConfiguration.set()}}), setting it in > {{spark-defaults.conf}} or via {{SparkConf}} does NOT work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5464) Calling help() on a Python DataFrame fails with "cannot resolve column name __name__" error
Josh Rosen created SPARK-5464: - Summary: Calling help() on a Python DataFrame fails with "cannot resolve column name __name__" error Key: SPARK-5464 URL: https://issues.apache.org/jira/browse/SPARK-5464 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.3.0 Reporter: Josh Rosen Priority: Blocker Trying to call {{help()}} on a Python DataFrame fails with an exception: {code} >>> help(df) Traceback (most recent call last): File "", line 1, in File "/Users/joshrosen/anaconda/lib/python2.7/site.py", line 464, in __call__ return pydoc.help(*args, **kwds) File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1787, in __call__ self.help(request) File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1834, in help else: doc(request, 'Help on %s:') File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1571, in doc pager(render_doc(thing, title, forceload)) File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1545, in render_doc object, name = resolve(thing, forceload) File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1540, in resolve name = getattr(thing, '__name__', None) File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, in __getattr__ return Column(self._jdf.apply(name)) File "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o31.apply. : java.lang.RuntimeException: Cannot resolve column name "__name__" at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:123) at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:123) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {code} Here's a reproduction: {code} >>> from pyspark.sql import SQLContext, Row >>> sqlContext = SQLContext(sc) >>> rdd = sc.parallelize(['{"foo":"bar"}', '{"foo":"baz"}']) >>> df = sqlContext.jsonRDD(rdd) >>> help(df) {code} I think the problem here is that we don't throw the expected exception from our overloaded {{getattr}} if a column can't be found. We should be able to fix this by only attempting to call {{apply}} after checking that the column name is valid (e.g. check against {{columns}}). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5451) And predicates are not properly pushed down
[ https://issues.apache.org/jira/browse/SPARK-5451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-5451: -- Issue Type: Sub-task (was: Bug) Parent: SPARK-5463 > And predicates are not properly pushed down > --- > > Key: SPARK-5451 > URL: https://issues.apache.org/jira/browse/SPARK-5451 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.2.0, 1.2.1 >Reporter: Cheng Lian >Priority: Critical > > This issue is actually caused by PARQUET-173. > The following {{spark-shell}} session can be used to reproduce this bug: > {code} > import org.apache.spark.sql.SQLContext > val sqlContext = new SQLContext(sc) > import sc._ > import sqlContext._ > case class KeyValue(key: Int, value: String) > parallelize(1 to 1024 * 1024 * 20). > flatMap(i => Seq.fill(10)(KeyValue(i, i.toString))). > saveAsParquetFile("large.parquet") > parquetFile("large.parquet").registerTempTable("large") > hadoopConfiguration.set("parquet.task.side.metadata", "false") > sql("SET spark.sql.parquet.filterPushdown=true") > sql("SELECT value FROM large WHERE 1024 < value AND value < 2048").collect() > {code} > From the log we can find: > {code} > There were no row groups that could be dropped due to filter predicates > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4258) NPE with new Parquet Filters
[ https://issues.apache.org/jira/browse/SPARK-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-4258: -- Issue Type: Sub-task (was: Bug) Parent: SPARK-5463 > NPE with new Parquet Filters > > > Key: SPARK-4258 > URL: https://issues.apache.org/jira/browse/SPARK-4258 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheng Lian >Priority: Critical > Fix For: 1.2.0 > > > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 21.0 (TID 160, ip-10-0-247-144.us-west-2.compute.internal): > java.lang.NullPointerException: > parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:206) > parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:162) > > parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:100) > > parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47) > parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) > > parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:210) > > parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47) > parquet.filter2.predicate.Operators$Or.accept(Operators.java:302) > > parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:201) > > parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47) > parquet.filter2.predicate.Operators$And.accept(Operators.java:290) > > parquet.filter2.statisticslevel.StatisticsFilter.canDrop(StatisticsFilter.java:52) > parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:46) > parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) > > parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) > > parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) > > parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) > > parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) > {code} > This occurs when reading parquet data encoded with the older version of the > library for TPC-DS query 34. Will work on coming up with a smaller > reproduction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5463) Fix Parquet filter push-down
Cheng Lian created SPARK-5463: - Summary: Fix Parquet filter push-down Key: SPARK-5463 URL: https://issues.apache.org/jira/browse/SPARK-5463 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.2.2 Reporter: Cheng Lian Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5462) Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in Python DataFrame
Josh Rosen created SPARK-5462: - Summary: Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in Python DataFrame Key: SPARK-5462 URL: https://issues.apache.org/jira/browse/SPARK-5462 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.3.0 Reporter: Josh Rosen Priority: Blocker When trying to access fields on a Python DataFrame created via inferSchema, I ran into a confusing Catalyst Py4J error. Here's a reproduction: {code} from pyspark import SparkContext from pyspark.sql import SQLContext, Row sc = SparkContext("local", "test") sqlContext = SQLContext(sc) # Load a text file and convert each line to a Row. lines = sc.textFile("examples/src/main/resources/people.txt") parts = lines.map(lambda l: l.split(",")) people = parts.map(lambda p: Row(name=p[0], age=int(p[1]))) # Infer the schema, and register the SchemaRDD as a table. schemaPeople = sqlContext.inferSchema(people) schemaPeople.registerTempTable("people") # SQL can be run over SchemaRDDs that have been registered as a table. teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") print teenagers.name {code} This fails with the following error: {code} Traceback (most recent call last): File "/Users/joshrosen/Documents/spark/sqltest.py", line 19, in print teenagers.name File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, in __getattr__ return Column(self._jdf.apply(name)) File "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o66.apply. : org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to qualifiers on unresolved object, tree: 'name at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50) at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {code} This is distinct from the helpful error message that I get when trying to access a non-existent column. This error didn't occur when I tried the same thing with a DataFrame created via jsonRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4989) wrong application configuration cause cluster down in standalone mode
[ https://issues.apache.org/jira/browse/SPARK-4989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4989: - Fix Version/s: 1.1.2 > wrong application configuration cause cluster down in standalone mode > - > > Key: SPARK-4989 > URL: https://issues.apache.org/jira/browse/SPARK-4989 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.0.0, 1.1.0, 1.2.0 >Reporter: Zhang, Liye >Assignee: Zhang, Liye >Priority: Critical > Fix For: 1.3.0, 1.1.2 > > > when enabling eventlog in standalone mode, if give the wrong configuration, > the standalone cluster will down (cause master restart, lose connection with > workers). > How to reproduce: just give an invalid value to "spark.eventLog.dir", for > example: *spark.eventLog.dir=hdfs://tmp/logdir1, hdfs://tmp/logdir2*. This > will throw illegalArgumentException, which will cause the *Master* restart. > And the whole cluster is not available. > This is not acceptable that cluster is crashed by one application's wrong > setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4989) wrong application configuration cause cluster down in standalone mode
[ https://issues.apache.org/jira/browse/SPARK-4989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4989: - Target Version/s: 1.3.0, 1.1.2, 1.2.2 (was: 1.3.0) > wrong application configuration cause cluster down in standalone mode > - > > Key: SPARK-4989 > URL: https://issues.apache.org/jira/browse/SPARK-4989 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.0.0, 1.1.0, 1.2.0 >Reporter: Zhang, Liye >Assignee: Zhang, Liye >Priority: Critical > Fix For: 1.3.0, 1.1.2 > > > when enabling eventlog in standalone mode, if give the wrong configuration, > the standalone cluster will down (cause master restart, lose connection with > workers). > How to reproduce: just give an invalid value to "spark.eventLog.dir", for > example: *spark.eventLog.dir=hdfs://tmp/logdir1, hdfs://tmp/logdir2*. This > will throw illegalArgumentException, which will cause the *Master* restart. > And the whole cluster is not available. > This is not acceptable that cluster is crashed by one application's wrong > setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-4989) wrong application configuration cause cluster down in standalone mode
[ https://issues.apache.org/jira/browse/SPARK-4989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reopened SPARK-4989: -- > wrong application configuration cause cluster down in standalone mode > - > > Key: SPARK-4989 > URL: https://issues.apache.org/jira/browse/SPARK-4989 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.0.0, 1.1.0, 1.2.0 >Reporter: Zhang, Liye >Assignee: Zhang, Liye >Priority: Critical > Fix For: 1.3.0, 1.1.2 > > > when enabling eventlog in standalone mode, if give the wrong configuration, > the standalone cluster will down (cause master restart, lose connection with > workers). > How to reproduce: just give an invalid value to "spark.eventLog.dir", for > example: *spark.eventLog.dir=hdfs://tmp/logdir1, hdfs://tmp/logdir2*. This > will throw illegalArgumentException, which will cause the *Master* restart. > And the whole cluster is not available. > This is not acceptable that cluster is crashed by one application's wrong > setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5417) Remove redundant executor-ID set() call
[ https://issues.apache.org/jira/browse/SPARK-5417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5417: - Target Version/s: 1.3.0, 1.2.1 Fix Version/s: 1.3.0 Assignee: Ryan Williams Labels: backport-needed (was: ) > Remove redundant executor-ID set() call > --- > > Key: SPARK-5417 > URL: https://issues.apache.org/jira/browse/SPARK-5417 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Ryan Williams >Assignee: Ryan Williams >Priority: Minor > Labels: backport-needed > Fix For: 1.3.0 > > > {{spark.executor.id}} no longer [needs to be set in > Executor.scala|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L79], > as of [#4194|https://github.com/apache/spark/pull/4194]; it is set upstream > in > [SparkEnv|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/SparkEnv.scala#L332]. > Might as well remove the redundant set() in Executor.scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5434) Preserve spaces in path to spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5434: - Target Version/s: 1.3.0, 1.2.1 Fix Version/s: 1.3.0 Labels: backport-needed (was: ) > Preserve spaces in path to spark-ec2 > > > Key: SPARK-5434 > URL: https://issues.apache.org/jira/browse/SPARK-5434 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.2.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: backport-needed > Fix For: 1.3.0 > > > If the path to {{spark-ec2}} contains spaces, the script won't run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5434) Preserve spaces in path to spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5434: - Assignee: Nicholas Chammas > Preserve spaces in path to spark-ec2 > > > Key: SPARK-5434 > URL: https://issues.apache.org/jira/browse/SPARK-5434 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.2.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Labels: backport-needed > Fix For: 1.3.0 > > > If the path to {{spark-ec2}} contains spaces, the script won't run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4955) Dynamic allocation doesn't work in YARN cluster mode
[ https://issues.apache.org/jira/browse/SPARK-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4955. Resolution: Fixed Fix Version/s: 1.3.0 > Dynamic allocation doesn't work in YARN cluster mode > > > Key: SPARK-4955 > URL: https://issues.apache.org/jira/browse/SPARK-4955 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0 >Reporter: Chengxiang Li >Assignee: Lianhui Wang >Priority: Blocker > Fix For: 1.3.0 > > > With executor dynamic scaling enabled, in yarn-cluster mode, after query > finished and spark.dynamicAllocation.executorIdleTimeout interval, executor > number is not reduced to configured min number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5440) Add toLocalIterator to pyspark rdd
[ https://issues.apache.org/jira/browse/SPARK-5440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5440: -- Assignee: Michael Nazario > Add toLocalIterator to pyspark rdd > -- > > Key: SPARK-5440 > URL: https://issues.apache.org/jira/browse/SPARK-5440 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Michael Nazario >Assignee: Michael Nazario > Fix For: 1.3.0 > > > toLocalIterator is available in Java and Scala. If we add this functionality > to Python, then we can also be able to use PySpark to iterate over a dataset > partition by partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5440) Add toLocalIterator to pyspark rdd
[ https://issues.apache.org/jira/browse/SPARK-5440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-5440. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4237 [https://github.com/apache/spark/pull/4237] > Add toLocalIterator to pyspark rdd > -- > > Key: SPARK-5440 > URL: https://issues.apache.org/jira/browse/SPARK-5440 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Michael Nazario > Fix For: 1.3.0 > > > toLocalIterator is available in Java and Scala. If we add this functionality > to Python, then we can also be able to use PySpark to iterate over a dataset > partition by partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1934) "this" reference escape to "selectorThread" during construction in ConnectionManager
[ https://issues.apache.org/jira/browse/SPARK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1934. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Sean Owen > "this" reference escape to "selectorThread" during construction in > ConnectionManager > > > Key: SPARK-1934 > URL: https://issues.apache.org/jira/browse/SPARK-1934 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Sean Owen >Priority: Minor > Fix For: 1.3.0 > > > `selectorThread` starts in the construction of > `org.apache.spark.network.ConnectionManager`, which may cause > `writeRunnableStarted` and `readRunnableStarted` are uninitialized before > them are used. > Indirectly, `BlockManager.this` also escape since it calls `new > ConnectionManager(...)` and will be used in some threads of > `ConnectionManager`. Some threads may view an uninitialized `BlockManager`. > In summary, it's dangerous and hard to analyse the correctness of > concurrency. Such escape should be avoided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5188) make-distribution.sh should support curl, not only wget to get Tachyon
[ https://issues.apache.org/jira/browse/SPARK-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5188: --- Fix Version/s: 1.3.0 > make-distribution.sh should support curl, not only wget to get Tachyon > -- > > Key: SPARK-5188 > URL: https://issues.apache.org/jira/browse/SPARK-5188 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta > Fix For: 1.3.0 > > > When we use `make-distribution.sh` with `--with-tachyon` option, Tachyon will > be downloaded by `wget` command but some systems don't have `wget` by default > (MacOS X doesn't have). > Other scripts like build/mvn, build/sbt support not only `wget` but also > `curl` so `make-distribution.sh` should support `curl` too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5188) make-distribution.sh should support curl, not only wget to get Tachyon
[ https://issues.apache.org/jira/browse/SPARK-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5188. Resolution: Fixed Assignee: Kousuke Saruta > make-distribution.sh should support curl, not only wget to get Tachyon > -- > > Key: SPARK-5188 > URL: https://issues.apache.org/jira/browse/SPARK-5188 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta > > When we use `make-distribution.sh` with `--with-tachyon` option, Tachyon will > be downloaded by `wget` command but some systems don't have `wget` by default > (MacOS X doesn't have). > Other scripts like build/mvn, build/sbt support not only `wget` but also > `curl` so `make-distribution.sh` should support `curl` too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5458) Refer to aggregateByKey instead of combineByKey in docs
[ https://issues.apache.org/jira/browse/SPARK-5458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5458. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Sandy Ryza > Refer to aggregateByKey instead of combineByKey in docs > --- > > Key: SPARK-5458 > URL: https://issues.apache.org/jira/browse/SPARK-5458 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Sandy Ryza >Assignee: Sandy Ryza >Priority: Trivial > Fix For: 1.3.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5440) Add toLocalIterator to pyspark rdd
[ https://issues.apache.org/jira/browse/SPARK-5440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5440: -- Affects Version/s: (was: 1.2.0) I'm removing the "Affects Version(s)" field from this since it isn't a bug. > Add toLocalIterator to pyspark rdd > -- > > Key: SPARK-5440 > URL: https://issues.apache.org/jira/browse/SPARK-5440 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Michael Nazario > > toLocalIterator is available in Java and Scala. If we add this functionality > to Python, then we can also be able to use PySpark to iterate over a dataset > partition by partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5461) Graph should have isCheckpointed, getCheckpointFiles methods
[ https://issues.apache.org/jira/browse/SPARK-5461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295817#comment-14295817 ] Apache Spark commented on SPARK-5461: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/4253 > Graph should have isCheckpointed, getCheckpointFiles methods > > > Key: SPARK-5461 > URL: https://issues.apache.org/jira/browse/SPARK-5461 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > Graph has a checkpoint method but does not have other helper functionality > which RDD has. Proposal: > {code} > /** >* Return whether this Graph has been checkpointed or not >*/ > def isCheckpointed: Boolean > /** >* Gets the name of the files to which this Graph was checkpointed >*/ > def getCheckpointFiles: Seq[String] > {code} > I need this for [SPARK-1405]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5288) Stabilize Spark SQL data type API followup
[ https://issues.apache.org/jira/browse/SPARK-5288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295771#comment-14295771 ] Reynold Xin commented on SPARK-5288: Alright maybe we can expose NumericType as well, but hide the other non-leaf ones. > Stabilize Spark SQL data type API followup > --- > > Key: SPARK-5288 > URL: https://issues.apache.org/jira/browse/SPARK-5288 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > Several issues we need to address before release 1.3 > * Do we want to make all classes in > org.apache.spark.sql.types.dataTypes.scala public? Seems we do not need to > make those abstract classes public. > * Seems NativeType is not a very clear and useful concept. Should we just > remove it? > * We need to Stabilize the type hierarchy of our data types. Seems StringType > and Decimal Type should not be primitive types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5288) Stabilize Spark SQL data type API followup
[ https://issues.apache.org/jira/browse/SPARK-5288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295765#comment-14295765 ] Xiangrui Meng commented on SPARK-5288: -- +1 on the use case [~prudenko] metnioned. [~rxin] If we only keep leaf types, we should provide methods to validate a group of types, e.g., whether a type could be casted to Double. > Stabilize Spark SQL data type API followup > --- > > Key: SPARK-5288 > URL: https://issues.apache.org/jira/browse/SPARK-5288 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > Several issues we need to address before release 1.3 > * Do we want to make all classes in > org.apache.spark.sql.types.dataTypes.scala public? Seems we do not need to > make those abstract classes public. > * Seems NativeType is not a very clear and useful concept. Should we just > remove it? > * We need to Stabilize the type hierarchy of our data types. Seems StringType > and Decimal Type should not be primitive types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3996) Shade Jetty in Spark deliverables
[ https://issues.apache.org/jira/browse/SPARK-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295716#comment-14295716 ] Apache Spark commented on SPARK-3996: - User 'pwendell' has created a pull request for this issue: https://github.com/apache/spark/pull/4252 > Shade Jetty in Spark deliverables > - > > Key: SPARK-3996 > URL: https://issues.apache.org/jira/browse/SPARK-3996 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Mingyu Kim >Assignee: Matthew Cheah > > We'd like to use Spark in a Jetty 9 server, and it's causing a version > conflict. Given that Spark's dependency on Jetty is light, it'd be a good > idea to shade this dependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5461) Graph should have isCheckpointed, getCheckpointFiles methods
[ https://issues.apache.org/jira/browse/SPARK-5461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-5461: Assignee: Joseph K. Bradley > Graph should have isCheckpointed, getCheckpointFiles methods > > > Key: SPARK-5461 > URL: https://issues.apache.org/jira/browse/SPARK-5461 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > Graph has a checkpoint method but does not have other helper functionality > which RDD has. Proposal: > {code} > /** >* Return whether this Graph has been checkpointed or not >*/ > def isCheckpointed: Boolean > /** >* Gets the name of the files to which this Graph was checkpointed >*/ > def getCheckpointFiles: Seq[String] > {code} > I need this for [SPARK-1405]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5461) Graph should have isCheckpointed, getCheckpointFiles methods
Joseph K. Bradley created SPARK-5461: Summary: Graph should have isCheckpointed, getCheckpointFiles methods Key: SPARK-5461 URL: https://issues.apache.org/jira/browse/SPARK-5461 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Graph has a checkpoint method but does not have other helper functionality which RDD has. Proposal: {code} /** * Return whether this Graph has been checkpointed or not */ def isCheckpointed: Boolean /** * Gets the name of the files to which this Graph was checkpointed */ def getCheckpointFiles: Seq[String] {code} I need this for [SPARK-1405]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5460) RandomForest should catch exceptions when removing checkpoint files
Joseph K. Bradley created SPARK-5460: Summary: RandomForest should catch exceptions when removing checkpoint files Key: SPARK-5460 URL: https://issues.apache.org/jira/browse/SPARK-5460 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor RandomForest can optionally use checkpointing. When it tries to remove checkpoint files, it could fail (if a user has write but not delete access on some filesystem). There should be a try-catch to catch exceptions when trying to remove checkpoint files in NodeIdCache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4768) Add Support For Impala Encoded Timestamp (INT96)
[ https://issues.apache.org/jira/browse/SPARK-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295663#comment-14295663 ] Yin Huai commented on SPARK-4768: - [~taiji] Seems the attached parquet file only has the last row ('test row 4', null). Can you upload the correct file? Also, can you add a row with a nanosecond precision timestamp value? > Add Support For Impala Encoded Timestamp (INT96) > > > Key: SPARK-4768 > URL: https://issues.apache.org/jira/browse/SPARK-4768 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Pat McDonough >Priority: Critical > Attachments: 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq > > > Impala is using INT96 for timestamps. Spark SQL should be able to read this > data despite the fact that it is not part of the spec. > Perhaps adding a flag to act like impala when reading parquet (like we do for > strings already) would be useful. > Here's an example of the error you might see: > {code} > Caused by: java.lang.RuntimeException: Potential loss of precision: cannot > convert INT96 > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:61) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:113) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:314) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:311) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToAttributes(ParquetTypes.scala:310) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:441) > at > org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:66) > at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:141) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5361) Multiple Java RDD <-> Python RDD conversions not working correctly
[ https://issues.apache.org/jira/browse/SPARK-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5361: -- Target Version/s: 1.2.2 Fix Version/s: 1.3.0 Labels: backport-needed (was: ) This has been fixed by https://github.com/apache/spark/pull/4146 in 1.3.0. I'd also like to backport this to {{branch-1.2}}, but I'm not doing that right away since we're voting on a 1.2.1 RC right now. I've added the {{backport-needed}} label and I'll merge this to {{branch-1.2}} as soon as 1.2.1 is released. > Multiple Java RDD <-> Python RDD conversions not working correctly > -- > > Key: SPARK-5361 > URL: https://issues.apache.org/jira/browse/SPARK-5361 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.0 >Reporter: Winston Chen > Labels: backport-needed > Fix For: 1.3.0 > > > This is found through reading RDD from `sc.newAPIHadoopRDD` and writing it > back using `rdd.saveAsNewAPIHadoopFile` in pyspark. > It turns out that whenever there are multiple RDD conversions from JavaRDD to > PythonRDD then back to JavaRDD, the exception below happens: > {noformat} > 15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7) > java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to > java.util.ArrayList > at > org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157) > at > org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) > {noformat} > The test case code below reproduces it: > {noformat} > from pyspark.rdd import RDD > dl = [ > (u'2', {u'director': u'David Lean'}), > (u'7', {u'director': u'Andrew Dominik'}) > ] > dl_rdd = sc.parallelize(dl) > tmp = dl_rdd._to_java_object_rdd() > tmp2 = sc._jvm.SerDe.javaToPython(tmp) > t = RDD(tmp2, sc) > t.count() > tmp = t._to_java_object_rdd() > tmp2 = sc._jvm.SerDe.javaToPython(tmp) > t = RDD(tmp2, sc) > t.count() # it blows up here during the 2nd time of conversion > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5291) Add timestamp and reason why an executor is removed to SparkListenerExecutorAdded and SparkListenerExecutorRemoved
[ https://issues.apache.org/jira/browse/SPARK-5291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-5291. --- Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Kousuke Saruta Fixed by https://github.com/apache/spark/pull/4082. > Add timestamp and reason why an executor is removed to > SparkListenerExecutorAdded and SparkListenerExecutorRemoved > -- > > Key: SPARK-5291 > URL: https://issues.apache.org/jira/browse/SPARK-5291 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta > Fix For: 1.3.0 > > > Recently SparkListenerExecutorAdded and SparkListenerExecutorRemoved are > added. > I think it's useful if they have timestamp and the reason why an executor is > removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt
[ https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295623#comment-14295623 ] Alex Baretta commented on SPARK-5236: - [~lian cheng][~imranr] Thanks for commenting and for taking interest in this issue. I definitely wish to help fix this, so that I don't run into this again. I'll try to reproduce this with a stock Spark checkout from master. > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to > org.apache.spark.sql.catalyst.expressions.MutableInt > - > > Key: SPARK-5236 > URL: https://issues.apache.org/jira/browse/SPARK-5236 > Project: Spark > Issue Type: Bug >Reporter: Alex Baretta > > {code} > 15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 > (TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value > at 0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet > at > parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) > at > parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141) > at > org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to > org.apache.spark.sql.catalyst.expressions.MutableInt > at > org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:241) > at > org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter.updateInt(ParquetConverter.scala:375) > at > org.apache.spark.sql.parquet.CatalystPrimitiveConverter.addInt(ParquetConverter.scala:434) > at > parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:237) > at > parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:353) > at > parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:402) > at > parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194) > ... 27 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org