[jira] [Created] (SPARK-5476) SQLContext.createDataFrame shouldn't be an implicit function

2015-01-28 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-5476:
--

 Summary: SQLContext.createDataFrame shouldn't be an implicit 
function
 Key: SPARK-5476
 URL: https://issues.apache.org/jira/browse/SPARK-5476
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Assignee: Reynold Xin


It is sort of strange to ask users to import sqlContext._ or 
sqlContext.createDataFrame.

The proposal here is to ask users to define an implicit val for SQLContext, and 
then dsl package object should include an implicit function that converts an 
RDD[Product] to a DataFrame.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3977) Conversions between {Row, Coordinate}Matrix <-> BlockMatrix

2015-01-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3977:
-
Assignee: Burak Yavuz

> Conversions between {Row, Coordinate}Matrix <-> BlockMatrix
> ---
>
> Key: SPARK-3977
> URL: https://issues.apache.org/jira/browse/SPARK-3977
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
>Assignee: Burak Yavuz
> Fix For: 1.3.0
>
>
> Build conversion functions between {Row, Coordinate}Matrix <-> BlockMatrix



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3977) Conversions between {Row, Coordinate}Matrix <-> BlockMatrix

2015-01-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3977.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4256
[https://github.com/apache/spark/pull/4256]

> Conversions between {Row, Coordinate}Matrix <-> BlockMatrix
> ---
>
> Key: SPARK-3977
> URL: https://issues.apache.org/jira/browse/SPARK-3977
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
> Fix For: 1.3.0
>
>
> Build conversion functions between {Row, Coordinate}Matrix <-> BlockMatrix



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2476) Have sbt-assembly include runtime dependencies in jar

2015-01-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2476.

Resolution: Not a Problem

[~srowen] Nope, I think we found a workaround.

> Have sbt-assembly include runtime dependencies in jar
> -
>
> Key: SPARK-2476
> URL: https://issues.apache.org/jira/browse/SPARK-2476
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Patrick Wendell
>Assignee: Prashant Sharma
>Priority: Minor
>
> If possible, we should try to contribute the ability to include 
> runtime-scoped dependencies in the assembly jar created with sbt-assembly.
> Currently in only reads compile-scoped dependencies:
> https://github.com/sbt/sbt-assembly/blob/master/src/main/scala/sbtassembly/Plugin.scala#L495



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2487) Follow up from SBT build refactor (i.e. SPARK-1776)

2015-01-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2487.

Resolution: Fixed

> Follow up from SBT build refactor (i.e. SPARK-1776)
> ---
>
> Key: SPARK-2487
> URL: https://issues.apache.org/jira/browse/SPARK-2487
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Patrick Wendell
>
> This is to track follw up issues relating to SPARK-1776, which was a major 
> re-factoring of the SBT build in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5466) Build Error caused by Guava shading in Spark

2015-01-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5466:
---
Component/s: Build

> Build Error caused by Guava shading in Spark
> 
>
> Key: SPARK-5466
> URL: https://issues.apache.org/jira/browse/SPARK-5466
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.0
>Reporter: Jian Zhou
>Priority: Blocker
>
> Guava is shaded inside spark-core itself.
> https://github.com/apache/spark/commit/37a5e272f898e946c09c2e7de5d1bda6f27a8f39
> This causes build error in multiple components, including Graph/MLLib/SQL, 
> when package com.google.common on the classpath incompatible with the version 
> used when compiling Utils.class
> [error] bad symbolic reference. A signature in Utils.class refers to term util
> [error] in package com.google.common which is not available.
> [error] It may be completely missing from the current classpath, or the 
> version on
> [error] the classpath might be incompatible with the version used when 
> compiling Utils.class.
> [error] 
> [error]  while compiling: 
> /spark/graphx/src/main/scala/org/apache/spark/graphx/util/BytecodeUtils.scala
> [error] during phase: erasure
> [error]  library version: version 2.10.4
> [error] compiler version: version 2.10.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5466) Build Error caused by Guava shading in Spark

2015-01-28 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296483#comment-14296483
 ] 

Patrick Wendell commented on SPARK-5466:


Also - [~srowen] can you reproduce this if you do not use Zinc?

> Build Error caused by Guava shading in Spark
> 
>
> Key: SPARK-5466
> URL: https://issues.apache.org/jira/browse/SPARK-5466
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.0
>Reporter: Jian Zhou
>Priority: Blocker
>
> Guava is shaded inside spark-core itself.
> https://github.com/apache/spark/commit/37a5e272f898e946c09c2e7de5d1bda6f27a8f39
> This causes build error in multiple components, including Graph/MLLib/SQL, 
> when package com.google.common on the classpath incompatible with the version 
> used when compiling Utils.class
> [error] bad symbolic reference. A signature in Utils.class refers to term util
> [error] in package com.google.common which is not available.
> [error] It may be completely missing from the current classpath, or the 
> version on
> [error] the classpath might be incompatible with the version used when 
> compiling Utils.class.
> [error] 
> [error]  while compiling: 
> /spark/graphx/src/main/scala/org/apache/spark/graphx/util/BytecodeUtils.scala
> [error] during phase: erasure
> [error]  library version: version 2.10.4
> [error] compiler version: version 2.10.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5466) Build Error caused by Guava shading in Spark

2015-01-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5466:
---
Priority: Blocker  (was: Major)

> Build Error caused by Guava shading in Spark
> 
>
> Key: SPARK-5466
> URL: https://issues.apache.org/jira/browse/SPARK-5466
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.0
>Reporter: Jian Zhou
>Priority: Blocker
>
> Guava is shaded inside spark-core itself.
> https://github.com/apache/spark/commit/37a5e272f898e946c09c2e7de5d1bda6f27a8f39
> This causes build error in multiple components, including Graph/MLLib/SQL, 
> when package com.google.common on the classpath incompatible with the version 
> used when compiling Utils.class
> [error] bad symbolic reference. A signature in Utils.class refers to term util
> [error] in package com.google.common which is not available.
> [error] It may be completely missing from the current classpath, or the 
> version on
> [error] the classpath might be incompatible with the version used when 
> compiling Utils.class.
> [error] 
> [error]  while compiling: 
> /spark/graphx/src/main/scala/org/apache/spark/graphx/util/BytecodeUtils.scala
> [error] during phase: erasure
> [error]  library version: version 2.10.4
> [error] compiler version: version 2.10.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5466) Build Error caused by Guava shading in Spark

2015-01-28 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296482#comment-14296482
 ] 

Patrick Wendell commented on SPARK-5466:


I sent [~vanzin] and e-mail today about this. Guess I'm not the only one seeing 
it. I was using zinc on OSX... are you guys using that too? I set up a zinc 
maven build on Jenkins and it worked just fine.

> Build Error caused by Guava shading in Spark
> 
>
> Key: SPARK-5466
> URL: https://issues.apache.org/jira/browse/SPARK-5466
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.0
>Reporter: Jian Zhou
>
> Guava is shaded inside spark-core itself.
> https://github.com/apache/spark/commit/37a5e272f898e946c09c2e7de5d1bda6f27a8f39
> This causes build error in multiple components, including Graph/MLLib/SQL, 
> when package com.google.common on the classpath incompatible with the version 
> used when compiling Utils.class
> [error] bad symbolic reference. A signature in Utils.class refers to term util
> [error] in package com.google.common which is not available.
> [error] It may be completely missing from the current classpath, or the 
> version on
> [error] the classpath might be incompatible with the version used when 
> compiling Utils.class.
> [error] 
> [error]  while compiling: 
> /spark/graphx/src/main/scala/org/apache/spark/graphx/util/BytecodeUtils.scala
> [error] during phase: erasure
> [error]  library version: version 2.10.4
> [error] compiler version: version 2.10.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4049) Storage web UI "fraction cached" shows as > 100%

2015-01-28 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296479#comment-14296479
 ] 

Patrick Wendell edited comment on SPARK-4049 at 1/29/15 6:58 AM:
-

[~skrasser] Yes - I agree that behavior is just confusing. One idea would be to 
have a "bit map" so to speak where you can't be 100% unless you have every 
partition cached. And you can never go over 100%.


was (Author: pwendell):
[~skrasser] Yes - I agree that behavior is just confusing. One idea would be to 
have a "bit map" so to speak where you can be 100% unless you have every 
partition cached. And you can never go over 100%.

> Storage web UI "fraction cached" shows as > 100%
> 
>
> Key: SPARK-4049
> URL: https://issues.apache.org/jira/browse/SPARK-4049
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0
>Reporter: Josh Rosen
>Priority: Minor
>
> In the Storage tab of the Spark Web UI, I saw a case where the "Fraction 
> Cached" was greater than 100%:
> !http://i.imgur.com/Gm2hEeL.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4049) Storage web UI "fraction cached" shows as > 100%

2015-01-28 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296479#comment-14296479
 ] 

Patrick Wendell commented on SPARK-4049:


[~skrasser] Yes - I agree that behavior is just confusing. One idea would be to 
have a "bit map" so to speak where you can be 100% unless you have every 
partition cached. And you can never go over 100%.

> Storage web UI "fraction cached" shows as > 100%
> 
>
> Key: SPARK-4049
> URL: https://issues.apache.org/jira/browse/SPARK-4049
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0
>Reporter: Josh Rosen
>Priority: Minor
>
> In the Storage tab of the Spark Web UI, I saw a case where the "Fraction 
> Cached" was greater than 100%:
> !http://i.imgur.com/Gm2hEeL.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5471) java.lang.NumberFormatException: For input string:

2015-01-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5471.

Resolution: Not a Problem

Resolving per your own comment.

>  java.lang.NumberFormatException: For input string: 
> 
>
> Key: SPARK-5471
> URL: https://issues.apache.org/jira/browse/SPARK-5471
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 1.2.0
> Environment: Spark 1.2.0 Maven 
>Reporter: DeepakVohra
>
> Naive Bayes Classifier generates exception with sample_naive_bayes_data.txt
> java.lang.NumberFormatException: For input string: "0,1"
>   at 
> sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)
>   at java.lang.Double.parseDouble(Double.java:540)
>   at 
> scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
>   at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)
>   at 
> org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79)
>   at 
> org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:77)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249)
>   at 
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/01/28 21:13:57 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
> localhost): java.lang.NumberFormatException: For input string: "0,1"
>   at 
> sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)
>   at java.lang.Double.parseDouble(Double.java:540)
>   at 
> scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
>   at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)
>   at 
> org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79)
>   at 
> org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:77)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249)
>   at 
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/01/28 21:13:57 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
> aborting job
> 15/01/28 21:13:57 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
> have all completed, from pool 
> 15/01/28 21:13:57 INFO TaskSchedulerImpl: Cancelling stage 0
> 15/01/28 21:13:57 INFO DAGScheduler: Job 0 failed: reduce at 
> MLUtils.scala:96, took 1.180869 s
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 0.0 in stage 0.0 (TID 0, localhost): 
> java.lang.NumberFormatException: For input string: "0,1"
>   at 
> sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)
>   at java.lang.Double.parseDouble(Double.java:540)
>   at 
> scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
>   at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)
>   at 
> org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply

[jira] [Commented] (SPARK-5162) Python yarn-cluster mode

2015-01-28 Thread Vladimir Grigor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296430#comment-14296430
 ] 

Vladimir Grigor commented on SPARK-5162:


[~lianhuiwang] Thank you for the walkaround suggestion! Still, I believe it 
would be great to have feature of remote script files - that would improve 
usability of yarn component in Spark a lot. If you think that is the case, and 
you know technical details of the system better, could you please create a 
ticket for that feature? Or please comment with any ideas of technical 
implementation. Thank you!

> Python yarn-cluster mode
> 
>
> Key: SPARK-5162
> URL: https://issues.apache.org/jira/browse/SPARK-5162
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, YARN
>Reporter: Dana Klassen
>  Labels: cluster, python, yarn
>
> Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would 
> be great to be able to submit python applications to the cluster and (just 
> like java classes) have the resource manager setup an AM on any node in the 
> cluster. Does anyone know the issues blocking this feature? I was snooping 
> around with enabling python apps:
> Removing the logic stopping python and yarn-cluster from sparkSubmit.scala
> ...
> // The following modes are not supported or applicable
> (clusterManager, deployMode) match {
>   ...
>   case (_, CLUSTER) if args.isPython =>
> printErrorAndExit("Cluster deploy mode is currently not supported for 
> python applications.")
>   ...
> }
> …
> and submitting application via:
> HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster 
> --num-executors 2  —-py-files {{insert location of egg here}} 
> --executor-cores 1  ../tools/canary.py
> Everything looks to run alright, pythonRunner is picked up as main class, 
> resources get setup, yarn client gets launched but falls flat on its face:
> 2015-01-08 18:48:03,444 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  DEBUG: FAILED { 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, 
> 1420742868009, FILE, null }, Resource 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed 
> on src filesystem (expected 1420742868009, was 1420742869284
> and
> 2015-01-08 18:48:03,446 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(->/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py)
>  transitioned from DOWNLOADING to FAILED
> Tracked this down to the apache hadoop code(FSDownload.java line 249) related 
> to container localization of files upon downloading. At this point thought it 
> would be best to raise the issue here and get input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5475) Java 8 tests are like maintenance overhead.

2015-01-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296405#comment-14296405
 ] 

Apache Spark commented on SPARK-5475:
-

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/4264

> Java 8 tests are like maintenance overhead. 
> 
>
> Key: SPARK-5475
> URL: https://issues.apache.org/jira/browse/SPARK-5475
> Project: Spark
>  Issue Type: Bug
>Reporter: Prashant Sharma
>
> Having tests that validate the same code compatible with java 8 and java 7 is 
> like asserting that java 8 is backward compatible with java 7 and still 
> supports java 8 features(lambda expressions to be precise). This was once 
> necessary as asm was not compatible with java 8 and so on. 
> Running java8-tests on the current code base results in more than 100 
> compilation errors, it felt as if they are never run. This is based on the 
> fact that compilation errors have existed for a pretty long period. So IMHO, 
> we should really remove them, if we don't plan to maintain.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5475) Java 8 tests are like maintenance overhead.

2015-01-28 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-5475:
---
Issue Type: Bug  (was: Wish)

> Java 8 tests are like maintenance overhead. 
> 
>
> Key: SPARK-5475
> URL: https://issues.apache.org/jira/browse/SPARK-5475
> Project: Spark
>  Issue Type: Bug
>Reporter: Prashant Sharma
>
> Having tests that validate the same code compatible with java 8 and java 7 is 
> like asserting that java 8 is backward compatible with java 7 and still 
> supports java 8 features(lambda expressions to be precise). This was once 
> necessary as asm was not compatible with java 8 and so on. 
> Running java8-tests on the current code base results in more than 100 
> compilation errors, it felt as if they are never run. This is based on the 
> fact that compilation errors have existed for a pretty long period. So IMHO, 
> we should really remove them, if we don't plan to maintain.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5475) Java 8 tests are like maintenance overhead.

2015-01-28 Thread Prashant Sharma (JIRA)
Prashant Sharma created SPARK-5475:
--

 Summary: Java 8 tests are like maintenance overhead. 
 Key: SPARK-5475
 URL: https://issues.apache.org/jira/browse/SPARK-5475
 Project: Spark
  Issue Type: Wish
Reporter: Prashant Sharma


Having tests that validate the same code compatible with java 8 and java 7 is 
like asserting that java 8 is backward compatible with java 7 and still 
supports java 8 features(lambda expressions to be precise). This was once 
necessary as asm was not compatible with java 8 and so on. 

Running java8-tests on the current code base results in more than 100 
compilation errors, it felt as if they are never run. This is based on the fact 
that compilation errors have existed for a pretty long period. So IMHO, we 
should really remove them, if we don't plan to maintain.

Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5474) curl should support URL redirection in build/mvn

2015-01-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296387#comment-14296387
 ] 

Apache Spark commented on SPARK-5474:
-

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/4263

> curl should support URL redirection in build/mvn
> 
>
> Key: SPARK-5474
> URL: https://issues.apache.org/jira/browse/SPARK-5474
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.0
>Reporter: Guoqiang Li
>
> {{http://archive.apache.org/dist/maven/maven-3/3.2.5/binaries/apache-maven-3.2.5-bin.tar.gz}}
>   sometimes return 3xx



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5474) curl should support URL redirection in build/mvn

2015-01-28 Thread Guoqiang Li (JIRA)
Guoqiang Li created SPARK-5474:
--

 Summary: curl should support URL redirection in build/mvn
 Key: SPARK-5474
 URL: https://issues.apache.org/jira/browse/SPARK-5474
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.0
Reporter: Guoqiang Li


{{http://archive.apache.org/dist/maven/maven-3/3.2.5/binaries/apache-maven-3.2.5-bin.tar.gz}}
  sometimes return 3xx



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5162) Python yarn-cluster mode

2015-01-28 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296375#comment-14296375
 ] 

Lianhui Wang commented on SPARK-5162:
-

The previous problem is that packages.egg is not found. but from your commonds, 
I donnot see packages.egg from your spark-submit and there are only 
package1.egg and package1.egg.so i think you need to add packages.egg to  
--py-files and you can try it. 
Or firstly you can run on yarn-client mode, if it is ok then try to run on yarn 
cluste mode. That can help us to find some problems.

> Python yarn-cluster mode
> 
>
> Key: SPARK-5162
> URL: https://issues.apache.org/jira/browse/SPARK-5162
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, YARN
>Reporter: Dana Klassen
>  Labels: cluster, python, yarn
>
> Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would 
> be great to be able to submit python applications to the cluster and (just 
> like java classes) have the resource manager setup an AM on any node in the 
> cluster. Does anyone know the issues blocking this feature? I was snooping 
> around with enabling python apps:
> Removing the logic stopping python and yarn-cluster from sparkSubmit.scala
> ...
> // The following modes are not supported or applicable
> (clusterManager, deployMode) match {
>   ...
>   case (_, CLUSTER) if args.isPython =>
> printErrorAndExit("Cluster deploy mode is currently not supported for 
> python applications.")
>   ...
> }
> …
> and submitting application via:
> HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster 
> --num-executors 2  —-py-files {{insert location of egg here}} 
> --executor-cores 1  ../tools/canary.py
> Everything looks to run alright, pythonRunner is picked up as main class, 
> resources get setup, yarn client gets launched but falls flat on its face:
> 2015-01-08 18:48:03,444 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  DEBUG: FAILED { 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, 
> 1420742868009, FILE, null }, Resource 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed 
> on src filesystem (expected 1420742868009, was 1420742869284
> and
> 2015-01-08 18:48:03,446 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(->/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py)
>  transitioned from DOWNLOADING to FAILED
> Tracked this down to the apache hadoop code(FSDownload.java line 249) related 
> to container localization of files upon downloading. At this point thought it 
> would be best to raise the issue here and get input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5162) Python yarn-cluster mode

2015-01-28 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-5162:

Comment: was deleted

(was: The previous problem is that packages.egg is not found. but from your 
commonds, I donnot see packages.egg from your spark-submit and there are only 
package1.egg and package1.egg.so i think you need to add packages.egg to  
--py-files and you can try it. 
Or firstly you can run on yarn-client mode, if it is ok then try to run on yarn 
cluste mode. That can help us to find some problems.)

> Python yarn-cluster mode
> 
>
> Key: SPARK-5162
> URL: https://issues.apache.org/jira/browse/SPARK-5162
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, YARN
>Reporter: Dana Klassen
>  Labels: cluster, python, yarn
>
> Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would 
> be great to be able to submit python applications to the cluster and (just 
> like java classes) have the resource manager setup an AM on any node in the 
> cluster. Does anyone know the issues blocking this feature? I was snooping 
> around with enabling python apps:
> Removing the logic stopping python and yarn-cluster from sparkSubmit.scala
> ...
> // The following modes are not supported or applicable
> (clusterManager, deployMode) match {
>   ...
>   case (_, CLUSTER) if args.isPython =>
> printErrorAndExit("Cluster deploy mode is currently not supported for 
> python applications.")
>   ...
> }
> …
> and submitting application via:
> HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster 
> --num-executors 2  —-py-files {{insert location of egg here}} 
> --executor-cores 1  ../tools/canary.py
> Everything looks to run alright, pythonRunner is picked up as main class, 
> resources get setup, yarn client gets launched but falls flat on its face:
> 2015-01-08 18:48:03,444 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  DEBUG: FAILED { 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, 
> 1420742868009, FILE, null }, Resource 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed 
> on src filesystem (expected 1420742868009, was 1420742869284
> and
> 2015-01-08 18:48:03,446 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(->/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py)
>  transitioned from DOWNLOADING to FAILED
> Tracked this down to the apache hadoop code(FSDownload.java line 249) related 
> to container localization of files upon downloading. At this point thought it 
> would be best to raise the issue here and get input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5162) Python yarn-cluster mode

2015-01-28 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296376#comment-14296376
 ] 

Lianhui Wang commented on SPARK-5162:
-

The previous problem is that packages.egg is not found. but from your commonds, 
I donnot see packages.egg from your spark-submit and there are only 
package1.egg and package1.egg.so i think you need to add packages.egg to  
--py-files and you can try it. 
Or firstly you can run on yarn-client mode, if it is ok then try to run on yarn 
cluste mode. That can help us to find some problems.

> Python yarn-cluster mode
> 
>
> Key: SPARK-5162
> URL: https://issues.apache.org/jira/browse/SPARK-5162
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, YARN
>Reporter: Dana Klassen
>  Labels: cluster, python, yarn
>
> Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would 
> be great to be able to submit python applications to the cluster and (just 
> like java classes) have the resource manager setup an AM on any node in the 
> cluster. Does anyone know the issues blocking this feature? I was snooping 
> around with enabling python apps:
> Removing the logic stopping python and yarn-cluster from sparkSubmit.scala
> ...
> // The following modes are not supported or applicable
> (clusterManager, deployMode) match {
>   ...
>   case (_, CLUSTER) if args.isPython =>
> printErrorAndExit("Cluster deploy mode is currently not supported for 
> python applications.")
>   ...
> }
> …
> and submitting application via:
> HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster 
> --num-executors 2  —-py-files {{insert location of egg here}} 
> --executor-cores 1  ../tools/canary.py
> Everything looks to run alright, pythonRunner is picked up as main class, 
> resources get setup, yarn client gets launched but falls flat on its face:
> 2015-01-08 18:48:03,444 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  DEBUG: FAILED { 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, 
> 1420742868009, FILE, null }, Resource 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed 
> on src filesystem (expected 1420742868009, was 1420742869284
> and
> 2015-01-08 18:48:03,446 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(->/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py)
>  transitioned from DOWNLOADING to FAILED
> Tracked this down to the apache hadoop code(FSDownload.java line 249) related 
> to container localization of files upon downloading. At this point thought it 
> would be best to raise the issue here and get input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5473) Expose SSH failures after status checks pass

2015-01-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296374#comment-14296374
 ] 

Apache Spark commented on SPARK-5473:
-

User 'nchammas' has created a pull request for this issue:
https://github.com/apache/spark/pull/4262

> Expose SSH failures after status checks pass
> 
>
> Key: SPARK-5473
> URL: https://issues.apache.org/jira/browse/SPARK-5473
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Affects Versions: 1.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5473) Expose SSH failures after status checks pass

2015-01-28 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-5473:
---

 Summary: Expose SSH failures after status checks pass
 Key: SPARK-5473
 URL: https://issues.apache.org/jira/browse/SPARK-5473
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.2.0
Reporter: Nicholas Chammas
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database

2015-01-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296344#comment-14296344
 ] 

Apache Spark commented on SPARK-5472:
-

User 'tmyklebu' has created a pull request for this issue:
https://github.com/apache/spark/pull/4261

> Add support for reading from and writing to a JDBC database
> ---
>
> Key: SPARK-5472
> URL: https://issues.apache.org/jira/browse/SPARK-5472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tor Myklebust
>Priority: Minor
>
> It would be nice to be able to make a table in a JDBC database appear as a 
> table in Spark SQL.  This would let users, for instance, perform a JOIN 
> between a DataFrame in Spark SQL with a table in a Postgres database.
> It might also be nice to be able to go the other direction---save a DataFrame 
> to a database---for instance in an ETL job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5472) Add support for reading from and writing to a JDBC database

2015-01-28 Thread Tor Myklebust (JIRA)
Tor Myklebust created SPARK-5472:


 Summary: Add support for reading from and writing to a JDBC 
database
 Key: SPARK-5472
 URL: https://issues.apache.org/jira/browse/SPARK-5472
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Tor Myklebust
Priority: Minor


It would be nice to be able to make a table in a JDBC database appear as a 
table in Spark SQL.  This would let users, for instance, perform a JOIN between 
a DataFrame in Spark SQL with a table in a Postgres database.

It might also be nice to be able to go the other direction---save a DataFrame 
to a database---for instance in an ETL job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5471) java.lang.NumberFormatException: For input string:

2015-01-28 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296307#comment-14296307
 ] 

DeepakVohra commented on SPARK-5471:


Not a bug. The sample data has to be split at the ,.

>  java.lang.NumberFormatException: For input string: 
> 
>
> Key: SPARK-5471
> URL: https://issues.apache.org/jira/browse/SPARK-5471
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 1.2.0
> Environment: Spark 1.2.0 Maven 
>Reporter: DeepakVohra
>
> Naive Bayes Classifier generates exception with sample_naive_bayes_data.txt
> java.lang.NumberFormatException: For input string: "0,1"
>   at 
> sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)
>   at java.lang.Double.parseDouble(Double.java:540)
>   at 
> scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
>   at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)
>   at 
> org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79)
>   at 
> org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:77)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249)
>   at 
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/01/28 21:13:57 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
> localhost): java.lang.NumberFormatException: For input string: "0,1"
>   at 
> sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)
>   at java.lang.Double.parseDouble(Double.java:540)
>   at 
> scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
>   at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)
>   at 
> org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79)
>   at 
> org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:77)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249)
>   at 
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/01/28 21:13:57 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
> aborting job
> 15/01/28 21:13:57 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
> have all completed, from pool 
> 15/01/28 21:13:57 INFO TaskSchedulerImpl: Cancelling stage 0
> 15/01/28 21:13:57 INFO DAGScheduler: Job 0 failed: reduce at 
> MLUtils.scala:96, took 1.180869 s
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 0.0 in stage 0.0 (TID 0, localhost): 
> java.lang.NumberFormatException: For input string: "0,1"
>   at 
> sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)
>   at java.lang.Double.parseDouble(Double.java:540)
>   at 
> scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
>   at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)
>   at 
> org.apache.spa

[jira] [Updated] (SPARK-5445) Make sure DataFrame expressions are usable in Java

2015-01-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5445:
---
Description: Some DataFrame expressions are not exactly usable in Java. For 
example, aggregate functions are only defined in the dsl package object, which 
is painful to use.   (was: Some DataFrame expressions are not exactly usable in 
Java. For example, aggregate functions are only defined in the dsl package 
object, which is painful to use. Another example is operator overloading, which 
would require Java users to use $plus. )

> Make sure DataFrame expressions are usable in Java
> --
>
> Key: SPARK-5445
> URL: https://issues.apache.org/jira/browse/SPARK-5445
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.3.0
>
>
> Some DataFrame expressions are not exactly usable in Java. For example, 
> aggregate functions are only defined in the dsl package object, which is 
> painful to use. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5253) LinearRegression with L1/L2 (elastic net) using OWLQN in new ML pacakge

2015-01-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296309#comment-14296309
 ] 

Apache Spark commented on SPARK-5253:
-

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/4259

> LinearRegression with L1/L2 (elastic net) using OWLQN in new ML pacakge
> ---
>
> Key: SPARK-5253
> URL: https://issues.apache.org/jira/browse/SPARK-5253
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: DB Tsai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5445) Make sure DataFrame expressions are usable in Java

2015-01-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5445.

   Resolution: Fixed
Fix Version/s: 1.3.0

> Make sure DataFrame expressions are usable in Java
> --
>
> Key: SPARK-5445
> URL: https://issues.apache.org/jira/browse/SPARK-5445
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.3.0
>
>
> Some DataFrame expressions are not exactly usable in Java. For example, 
> aggregate functions are only defined in the dsl package object, which is 
> painful to use. Another example is operator overloading, which would require 
> Java users to use $plus. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5471) java.lang.NumberFormatException: For input string:

2015-01-28 Thread DeepakVohra (JIRA)
DeepakVohra created SPARK-5471:
--

 Summary:  java.lang.NumberFormatException: For input string: 
 Key: SPARK-5471
 URL: https://issues.apache.org/jira/browse/SPARK-5471
 Project: Spark
  Issue Type: New Feature
Affects Versions: 1.2.0
 Environment: Spark 1.2.0 Maven 
Reporter: DeepakVohra


Naive Bayes Classifier generates exception with sample_naive_bayes_data.txt


java.lang.NumberFormatException: For input string: "0,1"
at 
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)
at java.lang.Double.parseDouble(Double.java:540)
at 
scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)
at 
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79)
at 
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:77)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/01/28 21:13:57 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
localhost): java.lang.NumberFormatException: For input string: "0,1"
at 
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)
at java.lang.Double.parseDouble(Double.java:540)
at 
scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)
at 
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79)
at 
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:77)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

15/01/28 21:13:57 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
aborting job
15/01/28 21:13:57 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have 
all completed, from pool 
15/01/28 21:13:57 INFO TaskSchedulerImpl: Cancelling stage 0
15/01/28 21:13:57 INFO DAGScheduler: Job 0 failed: reduce at MLUtils.scala:96, 
took 1.180869 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost 
task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NumberFormatException: For 
input string: "0,1"
at 
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)
at java.lang.Double.parseDouble(Double.java:540)
at 
scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)
at 
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79)
at 
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:77)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249)
at 
org.apache.spark.CacheManager.putInBl

[jira] [Commented] (SPARK-5470) use defaultClassLoader of Serializer to load classes of classesToRegister in KryoSerializer

2015-01-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296282#comment-14296282
 ] 

Apache Spark commented on SPARK-5470:
-

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/4258

> use defaultClassLoader of Serializer to load classes of classesToRegister in 
> KryoSerializer
> ---
>
> Key: SPARK-5470
> URL: https://issues.apache.org/jira/browse/SPARK-5470
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Lianhui Wang
>
> Now KryoSerializer load classes of classesToRegister at the time of its 
> initialization. when we set spark.kryo.classesToRegister=class1, it will 
> throw  SparkException("Failed to load class to register with Kryo".
> because in KryoSerializer's initialization, classLoader cannot include class 
> of user's jars.
> we need to use defaultClassLoader of Serializer in newKryo(), because 
> executor will reset defaultClassLoader of Serializer after Serializer's 
> initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-01-28 Thread Chris T (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296301#comment-14296301
 ] 

Chris T commented on SPARK-5436:


I may be able to attempt this, but I'm not confident I'll be able to implement 
a solution (or even one of sufficient quality). I only wrote my first line of 
scala about a week ago, so I'm still finding my way around how things work. If 
I can come up with something, I'll share it...

> Validate GradientBoostedTrees during training
> -
>
> Key: SPARK-5436
> URL: https://issues.apache.org/jira/browse/SPARK-5436
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> For Gradient Boosting, it would be valuable to compute test error on a 
> separate validation set during training.  That way, training could stop early 
> based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5162) Python yarn-cluster mode

2015-01-28 Thread Dana Klassen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296303#comment-14296303
 ] 

Dana Klassen commented on SPARK-5162:
-

Yes of course. I tried these combinations:

HADOOP_CONF_DIR=conf/conf.cloudera.yarn  ./bin/spark-submit --master 
yarn-cluster --num-executors 2 --executor-cores 1  
/Users/klassen/Desktop/test.py

HADOOP_CONF_DIR=conf/conf.cloudera.yarn  ./bin/spark-submit --master 
yarn-cluster --py-files '/path/to/package1.egg' --num-executors 2 
--executor-cores 1  /Users/klassen/Desktop/test.py

HADOOP_CONF_DIR=conf/conf.cloudera.yarn  ./bin/spark-submit --master 
yarn-cluster --py-files 
'/path/to/package1.egg,/path/to/package2.egg'--num-executors 2 --executor-cores 
1  /Users/klassen/Desktop/test.py

*The test script in this case makes no use of the resources in the eggs

I forgot to include enough of the logs to show that packages are uploaded to 
hdfs sparkStaging as follows::

```
15/01/28 21:38:07 INFO Client: Source and destination file systems are the 
same. Not copying 
hdfs://nn01.chi.shopify.com:8020/user/sparkles/spark-assembly-python-submit.jar
15/01/28 21:38:07 INFO Client: Uploading resource 
file:/Users/klassen/Desktop/test.py -> 
hdfs://nn01.chi.shopify.com:8020/user/klassen/.sparkStaging/application_1422398120127_3034/test.py
```

This is seen for the packages as well. Before these packages are downloaded to 
the container and setup they are cleared from sparkStaging ( seen at the end of 
the previous logs). 

> Python yarn-cluster mode
> 
>
> Key: SPARK-5162
> URL: https://issues.apache.org/jira/browse/SPARK-5162
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, YARN
>Reporter: Dana Klassen
>  Labels: cluster, python, yarn
>
> Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would 
> be great to be able to submit python applications to the cluster and (just 
> like java classes) have the resource manager setup an AM on any node in the 
> cluster. Does anyone know the issues blocking this feature? I was snooping 
> around with enabling python apps:
> Removing the logic stopping python and yarn-cluster from sparkSubmit.scala
> ...
> // The following modes are not supported or applicable
> (clusterManager, deployMode) match {
>   ...
>   case (_, CLUSTER) if args.isPython =>
> printErrorAndExit("Cluster deploy mode is currently not supported for 
> python applications.")
>   ...
> }
> …
> and submitting application via:
> HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster 
> --num-executors 2  —-py-files {{insert location of egg here}} 
> --executor-cores 1  ../tools/canary.py
> Everything looks to run alright, pythonRunner is picked up as main class, 
> resources get setup, yarn client gets launched but falls flat on its face:
> 2015-01-08 18:48:03,444 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  DEBUG: FAILED { 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, 
> 1420742868009, FILE, null }, Resource 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed 
> on src filesystem (expected 1420742868009, was 1420742869284
> and
> 2015-01-08 18:48:03,446 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(->/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py)
>  transitioned from DOWNLOADING to FAILED
> Tracked this down to the apache hadoop code(FSDownload.java line 249) related 
> to container localization of files upon downloading. At this point thought it 
> would be best to raise the issue here and get input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5162) Python yarn-cluster mode

2015-01-28 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296270#comment-14296270
 ] 

Lianhui Wang commented on SPARK-5162:
-

[~dklassen] can you provide syslog of applicationMaster? from you provided 
information,packages.egg donot upload to hdfs at the beginning of 
application.So can you provide the spark-submit command to me? 

> Python yarn-cluster mode
> 
>
> Key: SPARK-5162
> URL: https://issues.apache.org/jira/browse/SPARK-5162
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, YARN
>Reporter: Dana Klassen
>  Labels: cluster, python, yarn
>
> Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would 
> be great to be able to submit python applications to the cluster and (just 
> like java classes) have the resource manager setup an AM on any node in the 
> cluster. Does anyone know the issues blocking this feature? I was snooping 
> around with enabling python apps:
> Removing the logic stopping python and yarn-cluster from sparkSubmit.scala
> ...
> // The following modes are not supported or applicable
> (clusterManager, deployMode) match {
>   ...
>   case (_, CLUSTER) if args.isPython =>
> printErrorAndExit("Cluster deploy mode is currently not supported for 
> python applications.")
>   ...
> }
> …
> and submitting application via:
> HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster 
> --num-executors 2  —-py-files {{insert location of egg here}} 
> --executor-cores 1  ../tools/canary.py
> Everything looks to run alright, pythonRunner is picked up as main class, 
> resources get setup, yarn client gets launched but falls flat on its face:
> 2015-01-08 18:48:03,444 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  DEBUG: FAILED { 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, 
> 1420742868009, FILE, null }, Resource 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed 
> on src filesystem (expected 1420742868009, was 1420742869284
> and
> 2015-01-08 18:48:03,446 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(->/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py)
>  transitioned from DOWNLOADING to FAILED
> Tracked this down to the apache hadoop code(FSDownload.java line 249) related 
> to container localization of files upon downloading. At this point thought it 
> would be best to raise the issue here and get input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5470) use defaultClassLoader of Serializer to load classes of classesToRegister in KryoSerializer

2015-01-28 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-5470:
---

 Summary: use defaultClassLoader of Serializer to load classes of 
classesToRegister in KryoSerializer
 Key: SPARK-5470
 URL: https://issues.apache.org/jira/browse/SPARK-5470
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Lianhui Wang


Now KryoSerializer load classes of classesToRegister at the time of its 
initialization. when we set spark.kryo.classesToRegister=class1, it will throw  
SparkException("Failed to load class to register with Kryo".
because in KryoSerializer's initialization, classLoader cannot include class of 
user's jars.
we need to use defaultClassLoader of Serializer in newKryo(), because executor 
will reset defaultClassLoader of Serializer after Serializer's initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4631) Add real unit test for MQTT

2015-01-28 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296232#comment-14296232
 ] 

Ye Xianjin commented on SPARK-4631:
---

Hi [~dragos], I have the same issue here. I'd like to copy the email I sent to 
Sean here, which may help. 

{quote}
Hi Sean:

I enabled the debug flag in log4j. I believe the MQRRStreamSuite failure is 
more likely due to some weird network issue. However I cannot understand why 
this exception will be thrown.

what I saw in the unit-tests.log is below:
15/01/28 23:41:37.390 ActiveMQ Transport: tcp:///127.0.0.1:53845@23456 DEBUG 
Transport: Transport Connection to: tcp://127.0.0.1:53845 failed: 
java.net.ProtocolException: Invalid CONNECT encoding
java.net.ProtocolException: Invalid CONNECT encoding
at org.fusesource.mqtt.codec.CONNECT.decode(CONNECT.java:77)
at 
org.apache.activemq.transport.mqtt.MQTTProtocolConverter.onMQTTCommand(MQTTProtocolConverter.java:118)
at 
org.apache.activemq.transport.mqtt.MQTTTransportFilter.onCommand(MQTTTransportFilter.java:74)
at 
org.apache.activemq.transport.TransportSupport.doConsume(TransportSupport.java:83)
at 
org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:222)
at 
org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:204)
at java.lang.Thread.run(Thread.java:695)

However when I looked at the code 
http://grepcode.com/file/repo1.maven.org/maven2/org.fusesource.mqtt-client/mqtt-client/1.3/org/fusesource/mqtt/codec/CONNECT.java#76
 , I don’t quite understand why that would happen.
I am not familiar with activemq, maybe you can look at this and figure what 
really happened.
{quote}

The possible cause for that failure is that maybe org.eclipse.paho.mqtt-client 
don't write PROTOCOL_NAME in the mqtt frame with a quick look at the 
paho.mqtt-client code. But it don't make sense as the Jenkins run test 
successfully and I am not sure.

> Add real unit test for MQTT 
> 
>
> Key: SPARK-4631
> URL: https://issues.apache.org/jira/browse/SPARK-4631
> Project: Spark
>  Issue Type: Test
>  Components: Streaming
>Reporter: Tathagata Das
>Priority: Critical
> Fix For: 1.3.0
>
>
> A real unit test that actually transfers data to ensure that the MQTTUtil is 
> functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5346) Parquet filter pushdown is not enabled when parquet.task.side.metadata is set to true (default value)

2015-01-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian closed SPARK-5346.
-
Resolution: Not a Problem

I verified that filter push-down actually is enabled even if we set 
{{parquet.task.side.metadata}} to {{true}}.

The actual filtering happens when the {{ParquetRecordReader.initialize()}} is 
called in {{NewHadoopRDD.compute}}. See 
[here|https://github.com/apache/spark/blob/v1.2.0/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L135]
 and 
[here|https://github.com/apache/incubator-parquet-mr/blob/parquet-1.6.0rc3/parquet-hadoop/src/main/java/parquet/hadoop/ParquetRecordReader.java#L157-L158].

As for Spark task input size, it seems that Hadoop {{FileSystem}} adds the size 
of a block to the metrics even if we only touch a fraction of it (reading 
Parquet metadata for example).  This behaviour can be verified by the following 
snippet:
{code}
import org.apache.spark.sql.Row
import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
import sc._
import sqlContext._

case class KeyValue(key: Int, value: String)

parallelize(1 to 1024 * 1024 * 20).
  flatMap(i => Seq.fill(10)(KeyValue(i, i.toString))).
  saveAsParquetFile("large.parquet")

hadoopConfiguration.set("parquet.task.side.metadata", "true")
sql("SET spark.sql.parquet.filterPushdown=true")

parquetFile("large.parquet").where('key === 
0).queryExecution.toRdd.mapPartitions { _ =>
  new Iterator[Row] {
def hasNext = false
def next() = ???
  }
}.collect()
{code}

Apparently we're reading nothing here (except for Parquet metadata in the 
footers), but the web UI still suggests that the input size of all tasks equals 
to the file size.  In addition, we may find log lines written by 
{{ParquetRecordReader}} like this:
{code}
...
15/01/28 16:50:56 INFO FilterCompat: Filtering using predicate: eq(key, 0)
15/01/28 16:50:56 INFO InternalParquetRecordReader: RecordReader initialized 
will read a total of 0 records.
...
{code}
which suggests row group filtering does work as expected.

So I'll just close this ticket.

> Parquet filter pushdown is not enabled when parquet.task.side.metadata is set 
> to true (default value)
> -
>
> Key: SPARK-5346
> URL: https://issues.apache.org/jira/browse/SPARK-5346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Cheng Lian
>Priority: Blocker
>
> When computing Parquet splits, reading Parquet metadata from executor side is 
> more memory efficient, thus Spark SQL [sets {{parquet.task.side.metadata}} to 
> {{true}} by 
> default|https://github.com/apache/spark/blob/v1.2.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala#L437].
>  However, somehow this disables filter pushdown. 
> To workaround this issue and enable Parquet filter pushdown, users can set 
> {{spark.sql.parquet.filterPushdown}} to {{true}} and 
> {{parquet.task.side.metadata}} to {{false}}. However, for large Parquet files 
> with a large number of part-files and/or columns, reading metadata from 
> driver side eats lots of memory.
> The following Spark shell snippet can be useful to reproduce this issue:
> {code}
> import org.apache.spark.sql.SQLContext
> val sqlContext = new SQLContext(sc)
> import sqlContext._
> case class KeyValue(key: Int, value: String)
> sc.
>   parallelize(1 to 1024).
>   flatMap(i => Seq.fill(1024)(KeyValue(i, i.toString))).
>   saveAsParquetFile("large.parquet")
> parquetFile("large.parquet").registerTempTable("large")
> sql("SET spark.sql.parquet.filterPushdown=true")
> sql("SELECT * FROM large").collect()
> sql("SELECT * FROM large WHERE key < 200").collect()
> {code}
> Users can verify this issue by checking the input size metrics from web UI. 
> When filter pushdown is enabled, the second query reads fewer data.
> Notice that {{parquet.task.side.metadata}} must be set in _Hadoop_ 
> configuration (either via {{core-site.xml}} or 
> {{SparkConf.hadoopConfiguration.set()}}), setting it in 
> {{spark-defaults.conf}} or via {{SparkConf}} does NOT work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5430) Move treeReduce and treeAggregate to core

2015-01-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5430.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4228
[https://github.com/apache/spark/pull/4228]

> Move treeReduce and treeAggregate to core
> -
>
> Key: SPARK-5430
> URL: https://issues.apache.org/jira/browse/SPARK-5430
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.3.0
>
>
> I've seen many use cases of treeAggregate/treeReduce outside the ML domain. 
> Maybe it is time to move them to Core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4586) Python API for ML Pipeline

2015-01-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4586.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4151
[https://github.com/apache/spark/pull/4151]

> Python API for ML Pipeline
> --
>
> Key: SPARK-4586
> URL: https://issues.apache.org/jira/browse/SPARK-4586
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 1.3.0
>
>
> Add Python API to the newly added ML pipeline and parameters. The initial 
> design doc is posted here: 
> https://docs.google.com/document/d/1vL-4f5Xm-7t-kwVSaBylP_ZPrktPZjaOb2dWONtZU2s/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4049) Storage web UI "fraction cached" shows as > 100%

2015-01-28 Thread Sven Krasser (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296037#comment-14296037
 ] 

Sven Krasser edited comment on SPARK-4049 at 1/29/15 12:07 AM:
---

-I'm also seeing this for a 2x replicated RDD (StorageLevel.MEMORY_AND_DISK_2). 
I assume that means that most partitions are replicated twice and some three 
times?- EDIT: looks like for 2x it will ramp up to 200%, which is consistent 
with Patrick's comment.

That aside, I do not think it's a good idea to count overreplication towards 
that fraction in this way. As a user, when I see 100% on the UI, then I assume 
the RDD is fully cached. However, this could also mean that some partitions are 
missing (and need to be recomputed) and some are overreplicated.


was (Author: skrasser):
I'm also seeing this for a 2x replicated RDD (StorageLevel.MEMORY_AND_DISK_2). 
I assume that means that most partitions are replicated twice and some three 
times?

That aside, I do not think it's a good idea to count overreplication towards 
that fraction in this way. As a user, when I see 100% on the UI, then I assume 
the RDD is fully cached. However, this could also mean that some partitions are 
missing (and need to be recomputed) and some are overreplicated.

> Storage web UI "fraction cached" shows as > 100%
> 
>
> Key: SPARK-4049
> URL: https://issues.apache.org/jira/browse/SPARK-4049
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0
>Reporter: Josh Rosen
>Priority: Minor
>
> In the Storage tab of the Spark Web UI, I saw a case where the "Fraction 
> Cached" was greater than 100%:
> !http://i.imgur.com/Gm2hEeL.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5420) Cross-langauge load/store functions for creating and saving DataFrames

2015-01-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5420:
---
Priority: Blocker  (was: Major)

> Cross-langauge load/store functions for creating and saving DataFrames
> --
>
> Key: SPARK-5420
> URL: https://issues.apache.org/jira/browse/SPARK-5420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Patrick Wendell
>Assignee: Yin Huai
>Priority: Blocker
>
> We should have standard API's for loading or saving a table from a data 
> store. Per comment discussion:
> {code}
> def loadData(datasource: String, parameters: Map[String, String]): DataFrame
> def loadData(datasource: String, parameters: java.util.Map[String, String]): 
> DataFrame
> def storeData(datasource: String, parameters: Map[String, String]): DataFrame
> def storeData(datasource: String, parameters: java.util.Map[String, String]): 
> DataFrame
> {code}
> Python should have this too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5420) Cross-langauge load/store functions for creating and saving DataFrames

2015-01-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5420:
---
Description: 
We should have standard API's for loading or saving a table from a data store. 
Per comment discussion:

{code}
def loadData(datasource: String, parameters: Map[String, String]): DataFrame
def loadData(datasource: String, parameters: java.util.Map[String, String]): 
DataFrame
def storeData(datasource: String, parameters: Map[String, String]): DataFrame
def storeData(datasource: String, parameters: java.util.Map[String, String]): 
DataFrame
{code}

Python should have this too.

  was:
We should have standard API's for loading or saving a table from a data store. 
Per comment discussion:

{code}
df = sc.loadTable("path.to.DataSource", {"a": "b", "c": "d"})
sc.storeTable("path.to.DataSouce", {"a":"b", "c":"d"})
{code}


> Cross-langauge load/store functions for creating and saving DataFrames
> --
>
> Key: SPARK-5420
> URL: https://issues.apache.org/jira/browse/SPARK-5420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Patrick Wendell
>Assignee: Yin Huai
>
> We should have standard API's for loading or saving a table from a data 
> store. Per comment discussion:
> {code}
> def loadData(datasource: String, parameters: Map[String, String]): DataFrame
> def loadData(datasource: String, parameters: java.util.Map[String, String]): 
> DataFrame
> def storeData(datasource: String, parameters: Map[String, String]): DataFrame
> def storeData(datasource: String, parameters: java.util.Map[String, String]): 
> DataFrame
> {code}
> Python should have this too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5420) Cross-langauge load/store functions for creating and saving DataFrames

2015-01-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5420:
---
Description: 
We should have standard API's for loading or saving a table from a data store. 
Per comment discussion:

{code}
df = sc.loadTable("path.to.DataSource", {"a": "b", "c": "d"})
sc.storeTable("path.to.DataSouce", {"a":"b", "c":"d"})
{code}

  was:
We should have standard API's for loading or saving a table from a data store. 
One idea:

{code}
df = sc.loadTable("path.to.DataSource", {"a": "b", "c": "d"})
sc.storeTable("path.to.DataSouce", {"a":"b", "c":"d"})
{code}


> Cross-langauge load/store functions for creating and saving DataFrames
> --
>
> Key: SPARK-5420
> URL: https://issues.apache.org/jira/browse/SPARK-5420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Patrick Wendell
>Assignee: Yin Huai
>
> We should have standard API's for loading or saving a table from a data 
> store. Per comment discussion:
> {code}
> df = sc.loadTable("path.to.DataSource", {"a": "b", "c": "d"})
> sc.storeTable("path.to.DataSouce", {"a":"b", "c":"d"})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5420) Cross-langauge load/store functions for creating and saving DataFrames

2015-01-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5420:
---
Assignee: Yin Huai

> Cross-langauge load/store functions for creating and saving DataFrames
> --
>
> Key: SPARK-5420
> URL: https://issues.apache.org/jira/browse/SPARK-5420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Patrick Wendell
>Assignee: Yin Huai
>
> We should have standard API's for loading or saving a table from a data 
> store. One idea:
> {code}
> df = sc.loadTable("path.to.DataSource", {"a": "b", "c": "d"})
> sc.storeTable("path.to.DataSouce", {"a":"b", "c":"d"})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5247) Enable javadoc/scaladoc for public classes in catalyst project

2015-01-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5247:
---
Priority: Blocker  (was: Major)

> Enable javadoc/scaladoc for public classes in catalyst project
> --
>
> Key: SPARK-5247
> URL: https://issues.apache.org/jira/browse/SPARK-5247
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>Priority: Blocker
>
> We previously did not generate any docs for the entire catalyst project. 
> Since now we are defining public APIs in that (under org.apache.spark.sql 
> outside of org.apache.spark.sql.catalyst, such as Row, types._), we should 
> start generating javadoc/scaladoc for those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3977) Conversions between {Row, Coordinate}Matrix <-> BlockMatrix

2015-01-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296085#comment-14296085
 ] 

Apache Spark commented on SPARK-3977:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/4256

> Conversions between {Row, Coordinate}Matrix <-> BlockMatrix
> ---
>
> Key: SPARK-3977
> URL: https://issues.apache.org/jira/browse/SPARK-3977
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
>
> Build conversion functions between {Row, Coordinate}Matrix <-> BlockMatrix



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5469) Break sql.py into multiple files

2015-01-28 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-5469:
--

 Summary: Break sql.py into multiple files
 Key: SPARK-5469
 URL: https://issues.apache.org/jira/browse/SPARK-5469
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


It is getting pretty long (2800 loc).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5468) Remove Python LocalHiveContext

2015-01-28 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-5468:
--

 Summary: Remove Python LocalHiveContext
 Key: SPARK-5468
 URL: https://issues.apache.org/jira/browse/SPARK-5468
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway

2015-01-28 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296068#comment-14296068
 ] 

Marcelo Vanzin commented on SPARK-5388:
---

Hi [~andrewor14],

I read through the spec and the protocol specification seems to be lacking some 
details. The mains things that bother me are:

- It's not really a REST API. There's a single endpoint to which you POST 
different messages. This sort of forces your hand to use a custom 
implementation, instead of being able to use a much nicer framework for this 
purpose such as JAX-RS. Using a framework like that can later benefit other 
parts of Spark too, such as providing a REST API for application data through 
the web ui / history server. And as I mentioned in the PR, it allows you to 
define the endpoints using classes or interfaces, which serves two purposes: it 
allows you to do backwards compatibility checks with tools like MIMA, and it 
allows you to use the client functionality of JAX-RS for client requests too 
(and similar tools for other languages for those who, sort of feeding back into 
Dale's comment). Plus, you can use things like Jackson and not care about how 
to parse or generate JSON.

- It's unclear how the protocol will be allowed to evolve. What happens when 
you add a new field or message in a later version, and that version tries to 
submit to Spark 1.3? Is there a version negotiation up front, so that the new 
client knows to use the old protocol if possible, or does the client just send 
the new message and the server will complain if it contains things it doesn't 
understand?

The latter kinda feeds into the first comment. With a proper REST-based API, 
you'd put the first version of the protocol under "/v1", for example. Later 
versions are added under "/v2" and can add new things. Client and server can 
then negotiate up front (e.g, client needs at least version "x" for the current 
app, asks the server for its supported versions, and complains if "x" is not 
there).

Also, it could be more specific about how errors are reported. Do you get 
specific HTTP error codes for different things? Is there an "Error" type that 
is sent back to the client in JSON, and if so, what fields does it have?


> Provide a stable application submission gateway
> ---
>
> Key: SPARK-5388
> URL: https://issues.apache.org/jira/browse/SPARK-5388
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
> Attachments: Stable Spark Standalone Submission.pdf
>
>
> The existing submission gateway in standalone mode is not compatible across 
> Spark versions. If you have a newer version of Spark submitting to an older 
> version of the standalone Master, it is currently not guaranteed to work. The 
> goal is to provide a stable REST interface to replace this channel.
> The first cut implementation will target standalone cluster mode because 
> there are very few messages exchanged. The design, however, will be general 
> enough to eventually support this for other cluster managers too. Note that 
> this is not necessarily required in YARN because we already use YARN's stable 
> interface to submit applications there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5445) Make sure DataFrame expressions are usable in Java

2015-01-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296064#comment-14296064
 ] 

Apache Spark commented on SPARK-5445:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4241

> Make sure DataFrame expressions are usable in Java
> --
>
> Key: SPARK-5445
> URL: https://issues.apache.org/jira/browse/SPARK-5445
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Some DataFrame expressions are not exactly usable in Java. For example, 
> aggregate functions are only defined in the dsl package object, which is 
> painful to use. Another example is operator overloading, which would require 
> Java users to use $plus. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5448) Make CacheManager a concrete class and field in SQLContext

2015-01-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5448.

   Resolution: Fixed
Fix Version/s: 1.3.0

> Make CacheManager a concrete class and field in SQLContext
> --
>
> Key: SPARK-5448
> URL: https://issues.apache.org/jira/browse/SPARK-5448
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.3.0
>
>
> So we don't have to include it using trait mixin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5447) Replace reference to SchemaRDD with DataFrame

2015-01-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5447.

   Resolution: Fixed
Fix Version/s: 1.3.0

> Replace reference to SchemaRDD with DataFrame
> -
>
> Key: SPARK-5447
> URL: https://issues.apache.org/jira/browse/SPARK-5447
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.3.0
>
>
> We renamed SchemaRDD -> DataFrame, but internally various code still 
> reference SchemaRDD in JavaDoc and comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5467) DStreams should provide windowing based on timestamps from the data (as opposed to wall clock time)

2015-01-28 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-5467.
-
Resolution: Duplicate

> DStreams should provide windowing based on timestamps from the data (as 
> opposed to wall clock time)
> ---
>
> Key: SPARK-5467
> URL: https://issues.apache.org/jira/browse/SPARK-5467
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Imran Rashid
>
> DStreams currently only let you window based on wall clock time.  This 
> doesn't work very well when you're loading historical logs that are already 
> sitting around, because they'll all go into one window.  DStreams should 
> provide a way for you to window based on a field of the incoming data.  This 
> would be useful if you want to either (1) bootstrap a streaming app from some 
> logs or (2) test out the behavior of your app on historical logs, eg. for 
> correctness or performance.
> I think there are some open questions here, such as whether the input data 
> sources need to be sorted by time, how batches get triggered etc., but it 
> seems like an important use case.
> This just came up on the mailing list: 
> http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKeyAndWindow-but-using-log-timestamps-instead-of-clock-seconds-td21405.html
> And I think it is also what was this Jira was getting at: 
> https://issues.apache.org/jira/browse/SPARK-4427



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5467) DStreams should provide windowing based on timestamps from the data (as opposed to wall clock time)

2015-01-28 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296047#comment-14296047
 ] 

Imran Rashid commented on SPARK-5467:
-

shoot, sorry I missed that jira.  (I swear I tried searching, no idea how I 
missed it.)

It actually seems to be proposing something significantly more involved, by 
keeping all the bins open to receive more events, but I suppose its close 
enough to a duplicate.  I'll close this.

> DStreams should provide windowing based on timestamps from the data (as 
> opposed to wall clock time)
> ---
>
> Key: SPARK-5467
> URL: https://issues.apache.org/jira/browse/SPARK-5467
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Imran Rashid
>
> DStreams currently only let you window based on wall clock time.  This 
> doesn't work very well when you're loading historical logs that are already 
> sitting around, because they'll all go into one window.  DStreams should 
> provide a way for you to window based on a field of the incoming data.  This 
> would be useful if you want to either (1) bootstrap a streaming app from some 
> logs or (2) test out the behavior of your app on historical logs, eg. for 
> correctness or performance.
> I think there are some open questions here, such as whether the input data 
> sources need to be sorted by time, how batches get triggered etc., but it 
> seems like an important use case.
> This just came up on the mailing list: 
> http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKeyAndWindow-but-using-log-timestamps-instead-of-clock-seconds-td21405.html
> And I think it is also what was this Jira was getting at: 
> https://issues.apache.org/jira/browse/SPARK-4427



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4049) Storage web UI "fraction cached" shows as > 100%

2015-01-28 Thread Sven Krasser (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296037#comment-14296037
 ] 

Sven Krasser commented on SPARK-4049:
-

I'm also seeing this for a 2x replicated RDD (StorageLevel.MEMORY_AND_DISK_2). 
I assume that means that most partitions are replicated twice and some three 
times?

That aside, I do not think it's a good idea to count overreplication towards 
that fraction in this way. As a user, when I see 100% on the UI, then I assume 
the RDD is fully cached. However, this could also mean that some partitions are 
missing (and need to be recomputed) and some are overreplicated.

> Storage web UI "fraction cached" shows as > 100%
> 
>
> Key: SPARK-4049
> URL: https://issues.apache.org/jira/browse/SPARK-4049
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0
>Reporter: Josh Rosen
>Priority: Minor
>
> In the Storage tab of the Spark Web UI, I saw a case where the "Fraction 
> Cached" was greater than 100%:
> !http://i.imgur.com/Gm2hEeL.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5467) DStreams should provide windowing based on timestamps from the data (as opposed to wall clock time)

2015-01-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296032#comment-14296032
 ] 

Sean Owen commented on SPARK-5467:
--

Is this about the same as SPARK-4392?

> DStreams should provide windowing based on timestamps from the data (as 
> opposed to wall clock time)
> ---
>
> Key: SPARK-5467
> URL: https://issues.apache.org/jira/browse/SPARK-5467
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Imran Rashid
>
> DStreams currently only let you window based on wall clock time.  This 
> doesn't work very well when you're loading historical logs that are already 
> sitting around, because they'll all go into one window.  DStreams should 
> provide a way for you to window based on a field of the incoming data.  This 
> would be useful if you want to either (1) bootstrap a streaming app from some 
> logs or (2) test out the behavior of your app on historical logs, eg. for 
> correctness or performance.
> I think there are some open questions here, such as whether the input data 
> sources need to be sorted by time, how batches get triggered etc., but it 
> seems like an important use case.
> This just came up on the mailing list: 
> http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKeyAndWindow-but-using-log-timestamps-instead-of-clock-seconds-td21405.html
> And I think it is also what was this Jira was getting at: 
> https://issues.apache.org/jira/browse/SPARK-4427



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5466) Build Error caused by Guava shading in Spark

2015-01-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296031#comment-14296031
 ] 

Sean Owen commented on SPARK-5466:
--

I see this too from a completely clean build.

> Build Error caused by Guava shading in Spark
> 
>
> Key: SPARK-5466
> URL: https://issues.apache.org/jira/browse/SPARK-5466
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.0
>Reporter: Jian Zhou
>
> Guava is shaded inside spark-core itself.
> https://github.com/apache/spark/commit/37a5e272f898e946c09c2e7de5d1bda6f27a8f39
> This causes build error in multiple components, including Graph/MLLib/SQL, 
> when package com.google.common on the classpath incompatible with the version 
> used when compiling Utils.class
> [error] bad symbolic reference. A signature in Utils.class refers to term util
> [error] in package com.google.common which is not available.
> [error] It may be completely missing from the current classpath, or the 
> version on
> [error] the classpath might be incompatible with the version used when 
> compiling Utils.class.
> [error] 
> [error]  while compiling: 
> /spark/graphx/src/main/scala/org/apache/spark/graphx/util/BytecodeUtils.scala
> [error] during phase: erasure
> [error]  library version: version 2.10.4
> [error] compiler version: version 2.10.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5467) DStreams should provide windowing based on timestamps from the data (as opposed to wall clock time)

2015-01-28 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-5467:
---

 Summary: DStreams should provide windowing based on timestamps 
from the data (as opposed to wall clock time)
 Key: SPARK-5467
 URL: https://issues.apache.org/jira/browse/SPARK-5467
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Imran Rashid


DStreams currently only let you window based on wall clock time.  This doesn't 
work very well when you're loading historical logs that are already sitting 
around, because they'll all go into one window.  DStreams should provide a way 
for you to window based on a field of the incoming data.  This would be useful 
if you want to either (1) bootstrap a streaming app from some logs or (2) test 
out the behavior of your app on historical logs, eg. for correctness or 
performance.

I think there are some open questions here, such as whether the input data 
sources need to be sorted by time, how batches get triggered etc., but it seems 
like an important use case.

This just came up on the mailing list: 
http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKeyAndWindow-but-using-log-timestamps-instead-of-clock-seconds-td21405.html

And I think it is also what was this Jira was getting at: 
https://issues.apache.org/jira/browse/SPARK-4427



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5466) Build Error caused by Guava shading in Spark

2015-01-28 Thread Jian Zhou (JIRA)
Jian Zhou created SPARK-5466:


 Summary: Build Error caused by Guava shading in Spark
 Key: SPARK-5466
 URL: https://issues.apache.org/jira/browse/SPARK-5466
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.0
Reporter: Jian Zhou


Guava is shaded inside spark-core itself.

https://github.com/apache/spark/commit/37a5e272f898e946c09c2e7de5d1bda6f27a8f39

This causes build error in multiple components, including Graph/MLLib/SQL, when 
package com.google.common on the classpath incompatible with the version used 
when compiling Utils.class

[error] bad symbolic reference. A signature in Utils.class refers to term util
[error] in package com.google.common which is not available.
[error] It may be completely missing from the current classpath, or the version 
on
[error] the classpath might be incompatible with the version used when 
compiling Utils.class.
[error] 
[error]  while compiling: 
/spark/graphx/src/main/scala/org/apache/spark/graphx/util/BytecodeUtils.scala
[error] during phase: erasure
[error]  library version: version 2.10.4
[error] compiler version: version 2.10.4




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5465) Data source version of Parquet doesn't push down And filters properly

2015-01-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295951#comment-14295951
 ] 

Apache Spark commented on SPARK-5465:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/4255

> Data source version of Parquet doesn't push down And filters properly
> -
>
> Key: SPARK-5465
> URL: https://issues.apache.org/jira/browse/SPARK-5465
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1
>Reporter: Cheng Lian
>Priority: Blocker
>
> The current implementation combines all predicates and then tries to convert 
> it to a single Parquet filter predicate. In this way, the Parquet filter 
> predicate can not be generated if any component of the original filters can 
> not be converted. (code lines 
> [here|https://github.com/apache/spark/blob/a731314c319a6f265060e05267844069027804fd/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala#L197-L201]).
> For example, {{a > 10 AND a < 20}} can be successfully converted, while {{a > 
> 10 AND a < b}} can't because Parquet doesn't accept filters like {{a < b}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5441) SerDeUtil Pair RDD to python conversion doesn't accept empty RDDs

2015-01-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5441:
--
Target Version/s: 1.3.0, 1.1.2, 1.2.2  (was: 1.3.0)
   Fix Version/s: 1.3.0
  Labels: backport-needed  (was: )

I've merged https://github.com/apache/spark/pull/4236, which fixes this, in to 
1.3.0 and I'll come back later to backport this to 1.2.2 and 1.1.2.

> SerDeUtil Pair RDD to python conversion doesn't accept empty RDDs
> -
>
> Key: SPARK-5441
> URL: https://issues.apache.org/jira/browse/SPARK-5441
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Michael Nazario
>Assignee: Michael Nazario
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> SerDeUtil.pairRDDToPython and SerDeUtil.pythonToPairRDD rely on rdd.first() 
> which throws an exception if the RDD is empty. We should be able to handle 
> the empty RDD case because this doesn't prevent a valid RDD from being 
> created.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4259) Add Power Iteration Clustering Algorithm with Gaussian Similarity Function

2015-01-28 Thread Andrew Musselman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295930#comment-14295930
 ] 

Andrew Musselman commented on SPARK-4259:
-

So this feature won't be doing spectral clustering, and will be switching to 
the power iteration method?

Should I create another ticket for spectral clustering if so?

> Add Power Iteration Clustering Algorithm with Gaussian Similarity Function
> --
>
> Key: SPARK-4259
> URL: https://issues.apache.org/jira/browse/SPARK-4259
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Fan Jiang
>Assignee: Fan Jiang
>  Labels: features
>
> In recent years, power Iteration clustering has become one of the most 
> popular modern clustering algorithms. It is simple to implement, can be 
> solved efficiently by standard linear algebra software, and very often 
> outperforms traditional clustering algorithms such as the k-means algorithm.
> Power iteration clustering is a scalable and efficient algorithm for 
> clustering points given pointwise mutual affinity values.  Internally the 
> algorithm:
> computes the Gaussian distance between all pairs of points and represents 
> these distances in an Affinity Matrix
> calculates a Normalized Affinity Matrix
> calculates the principal eigenvalue and eigenvector
> Clusters each of the input points according to their principal eigenvector 
> component value
> Details of this algorithm are found within [Power Iteration Clustering, Lin 
> and Cohen]{www.icml2010.org/papers/387.pdf}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4387) Refactoring python profiling code to make it extensible

2015-01-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4387:
--
Assignee: Yandu Oppacher

> Refactoring python profiling code to make it extensible
> ---
>
> Key: SPARK-4387
> URL: https://issues.apache.org/jira/browse/SPARK-4387
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.1.0
>Reporter: Yandu Oppacher
>Assignee: Yandu Oppacher
> Fix For: 1.3.0
>
>
> SPARK-3478 introduced python profiling for workers which is great but it 
> would be nice to be able to change the profiler and output formats as needed. 
> This is a refactoring of the code to allow that to happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4387) Refactoring python profiling code to make it extensible

2015-01-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4387.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3901
[https://github.com/apache/spark/pull/3901]

> Refactoring python profiling code to make it extensible
> ---
>
> Key: SPARK-4387
> URL: https://issues.apache.org/jira/browse/SPARK-4387
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.1.0
>Reporter: Yandu Oppacher
>Assignee: Yandu Oppacher
> Fix For: 1.3.0
>
>
> SPARK-3478 introduced python profiling for workers which is great but it 
> would be nice to be able to change the profiler and output formats as needed. 
> This is a refactoring of the code to allow that to happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4259) Add Power Iteration Clustering Algorithm with Gaussian Similarity Function

2015-01-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295912#comment-14295912
 ] 

Apache Spark commented on SPARK-4259:
-

User 'fjiang6' has created a pull request for this issue:
https://github.com/apache/spark/pull/4254

> Add Power Iteration Clustering Algorithm with Gaussian Similarity Function
> --
>
> Key: SPARK-4259
> URL: https://issues.apache.org/jira/browse/SPARK-4259
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Fan Jiang
>Assignee: Fan Jiang
>  Labels: features
>
> In recent years, power Iteration clustering has become one of the most 
> popular modern clustering algorithms. It is simple to implement, can be 
> solved efficiently by standard linear algebra software, and very often 
> outperforms traditional clustering algorithms such as the k-means algorithm.
> Power iteration clustering is a scalable and efficient algorithm for 
> clustering points given pointwise mutual affinity values.  Internally the 
> algorithm:
> computes the Gaussian distance between all pairs of points and represents 
> these distances in an Affinity Matrix
> calculates a Normalized Affinity Matrix
> calculates the principal eigenvalue and eigenvector
> Clusters each of the input points according to their principal eigenvector 
> component value
> Details of this algorithm are found within [Power Iteration Clustering, Lin 
> and Cohen]{www.icml2010.org/papers/387.pdf}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5465) Data source version of Parquet doesn't push down And filters properly

2015-01-28 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-5465:
-

 Summary: Data source version of Parquet doesn't push down And 
filters properly
 Key: SPARK-5465
 URL: https://issues.apache.org/jira/browse/SPARK-5465
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.2.0, 1.2.1
Reporter: Cheng Lian
Priority: Blocker


The current implementation combines all predicates and then tries to convert it 
to a single Parquet filter predicate. In this way, the Parquet filter predicate 
can not be generated if any component of the original filters can not be 
converted. (code lines 
[here|https://github.com/apache/spark/blob/a731314c319a6f265060e05267844069027804fd/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala#L197-L201]).

For example, {{a > 10 AND a < 20}} can be successfully converted, while {{a > 
10 AND a < b}} can't because Parquet doesn't accept filters like {{a < b}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4259) Add Power Iteration Clustering Algorithm with Gaussian Similarity Function

2015-01-28 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-4259:
-
Description: 
In recent years, power Iteration clustering has become one of the most popular 
modern clustering algorithms. It is simple to implement, can be solved 
efficiently by standard linear algebra software, and very often outperforms 
traditional clustering algorithms such as the k-means algorithm.

Power iteration clustering is a scalable and efficient algorithm for clustering 
points given pointwise mutual affinity values.  Internally the algorithm:

computes the Gaussian distance between all pairs of points and represents these 
distances in an Affinity Matrix
calculates a Normalized Affinity Matrix
calculates the principal eigenvalue and eigenvector
Clusters each of the input points according to their principal eigenvector 
component value

Details of this algorithm are found within [Power Iteration Clustering, Lin and 
Cohen]{www.icml2010.org/papers/387.pdf}


  was:
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

Power iteration clustering is a scalable and efficient algorithm for clustering 
points given pointwise mutual affinity values.  Internally the algorithm:

computes the Gaussian distance between all pairs of points and represents these 
distances in an Affinity Matrix
calculates a Normalized Affinity Matrix
calculates the principal eigenvalue and eigenvector
Clusters each of the input points according to their principal eigenvector 
component value

Details of this algorithm are found within [Power Iteration Clustering, Lin and 
Cohen]{www.icml2010.org/papers/387.pdf}



> Add Power Iteration Clustering Algorithm with Gaussian Similarity Function
> --
>
> Key: SPARK-4259
> URL: https://issues.apache.org/jira/browse/SPARK-4259
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Fan Jiang
>Assignee: Fan Jiang
>  Labels: features
>
> In recent years, power Iteration clustering has become one of the most 
> popular modern clustering algorithms. It is simple to implement, can be 
> solved efficiently by standard linear algebra software, and very often 
> outperforms traditional clustering algorithms such as the k-means algorithm.
> Power iteration clustering is a scalable and efficient algorithm for 
> clustering points given pointwise mutual affinity values.  Internally the 
> algorithm:
> computes the Gaussian distance between all pairs of points and represents 
> these distances in an Affinity Matrix
> calculates a Normalized Affinity Matrix
> calculates the principal eigenvalue and eigenvector
> Clusters each of the input points according to their principal eigenvector 
> component value
> Details of this algorithm are found within [Power Iteration Clustering, Lin 
> and Cohen]{www.icml2010.org/papers/387.pdf}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4259) Add Power Iteration Clustering Algorithm with Gaussian Similarity Function

2015-01-28 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-4259:
-
Description: 
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

Power iteration clustering is a scalable and efficient algorithm for clustering 
points given pointwise mutual affinity values.  Internally the algorithm:

computes the Gaussian distance between all pairs of points and represents these 
distances in an Affinity Matrix
calculates a Normalized Affinity Matrix
calculates the principal eigenvalue and eigenvector
Clusters each of the input points according to their principal eigenvector 
component value

Details of this algorithm are found within [Power Iteration Clustering, Lin and 
Cohen]{www.icml2010.org/papers/387.pdf}


  was:
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 

• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.

Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci }.



Summary: Add Power Iteration Clustering Algorithm with Gaussian 
Similarity Function  (was: Add Spectral Clustering Algorithm with Gaussian 
Similarity Function)

> Add Power Iteration Clustering Algorithm with Gaussian Similarity Function
> --
>
> Key: SPARK-4259
> URL: https://issues.apache.org/jira/browse/SPARK-4259
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Fan Jiang
>Assignee: Fan Jiang
>  Labels: features
>
> In recent years, spectral clustering has become one of the most popular 
> modern clustering algorithms. It is simple to implement, can be solved 
> efficiently by standard linear algebra software, and very often outperforms 
> traditional clustering algorithms such as the k-means algorithm.
> Power iteration clustering is a scalable and efficient algorithm for 
> clustering points given pointwise mutual affinity values.  Internally the 
> algorithm:
> computes the Gaussian distance between all pairs of points and represents 
> these distances in an Affinity Matrix
> calculates a Normalized Affinity Matrix
> calculates the principal eigenvalue and eigenvector
> Clusters each of the input points according to their principal eigenvector 
> component value
> Details of this algorithm are found within [Power Iteration Clustering, Lin 
> and Cohen]{www.icml2010.org/papers/387.pdf}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5463) Fix Parquet filter push-down

2015-01-28 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295902#comment-14295902
 ] 

Cheng Lian commented on SPARK-5463:
---

SPARK-4258 is fixed in Parquet master. SPARK-5451 is fixed by [Parquet PR 
#108|https://github.com/apache/incubator-parquet-mr/pull/108]. Once Parquet PR 
#108 is merged and a new (RC) release is cut, the first 2 sub-tasks can be 
resolved.

> Fix Parquet filter push-down
> 
>
> Key: SPARK-5463
> URL: https://issues.apache.org/jira/browse/SPARK-5463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1, 1.2.2
>Reporter: Cheng Lian
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5346) Parquet filter pushdown is not enabled when parquet.task.side.metadata is set to true (default value)

2015-01-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-5346:
--
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-5463

> Parquet filter pushdown is not enabled when parquet.task.side.metadata is set 
> to true (default value)
> -
>
> Key: SPARK-5346
> URL: https://issues.apache.org/jira/browse/SPARK-5346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Cheng Lian
>Priority: Blocker
>
> When computing Parquet splits, reading Parquet metadata from executor side is 
> more memory efficient, thus Spark SQL [sets {{parquet.task.side.metadata}} to 
> {{true}} by 
> default|https://github.com/apache/spark/blob/v1.2.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala#L437].
>  However, somehow this disables filter pushdown. 
> To workaround this issue and enable Parquet filter pushdown, users can set 
> {{spark.sql.parquet.filterPushdown}} to {{true}} and 
> {{parquet.task.side.metadata}} to {{false}}. However, for large Parquet files 
> with a large number of part-files and/or columns, reading metadata from 
> driver side eats lots of memory.
> The following Spark shell snippet can be useful to reproduce this issue:
> {code}
> import org.apache.spark.sql.SQLContext
> val sqlContext = new SQLContext(sc)
> import sqlContext._
> case class KeyValue(key: Int, value: String)
> sc.
>   parallelize(1 to 1024).
>   flatMap(i => Seq.fill(1024)(KeyValue(i, i.toString))).
>   saveAsParquetFile("large.parquet")
> parquetFile("large.parquet").registerTempTable("large")
> sql("SET spark.sql.parquet.filterPushdown=true")
> sql("SELECT * FROM large").collect()
> sql("SELECT * FROM large WHERE key < 200").collect()
> {code}
> Users can verify this issue by checking the input size metrics from web UI. 
> When filter pushdown is enabled, the second query reads fewer data.
> Notice that {{parquet.task.side.metadata}} must be set in _Hadoop_ 
> configuration (either via {{core-site.xml}} or 
> {{SparkConf.hadoopConfiguration.set()}}), setting it in 
> {{spark-defaults.conf}} or via {{SparkConf}} does NOT work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5464) Calling help() on a Python DataFrame fails with "cannot resolve column name __name__" error

2015-01-28 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-5464:
-

 Summary: Calling help() on a Python DataFrame fails with "cannot 
resolve column name __name__" error
 Key: SPARK-5464
 URL: https://issues.apache.org/jira/browse/SPARK-5464
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.3.0
Reporter: Josh Rosen
Priority: Blocker


Trying to call {{help()}} on a Python DataFrame fails with an exception:

{code}
>>> help(df)
Traceback (most recent call last):
  File "", line 1, in 
  File "/Users/joshrosen/anaconda/lib/python2.7/site.py", line 464, in __call__
return pydoc.help(*args, **kwds)
  File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1787, in 
__call__
self.help(request)
  File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1834, in help
else: doc(request, 'Help on %s:')
  File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1571, in doc
pager(render_doc(thing, title, forceload))
  File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1545, in 
render_doc
object, name = resolve(thing, forceload)
  File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1540, in resolve
name = getattr(thing, '__name__', None)
  File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, in 
__getattr__
return Column(self._jdf.apply(name))
  File 
"/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 538, in __call__
  File 
"/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
 line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o31.apply.
: java.lang.RuntimeException: Cannot resolve column name "__name__"
at 
org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:123)
at 
org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:123)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122)
at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
{code}

Here's a reproduction:

{code}
>>> from pyspark.sql import SQLContext, Row
>>> sqlContext = SQLContext(sc)
>>> rdd = sc.parallelize(['{"foo":"bar"}', '{"foo":"baz"}'])
>>> df = sqlContext.jsonRDD(rdd)
>>> help(df)
{code}

I think the problem here is that we don't throw the expected exception from our 
overloaded {{getattr}} if a column can't be found.

We should be able to fix this by only attempting to call {{apply}} after 
checking that the column name is valid (e.g. check against {{columns}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5451) And predicates are not properly pushed down

2015-01-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-5451:
--
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-5463

> And predicates are not properly pushed down
> ---
>
> Key: SPARK-5451
> URL: https://issues.apache.org/jira/browse/SPARK-5451
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1
>Reporter: Cheng Lian
>Priority: Critical
>
> This issue is actually caused by PARQUET-173.
> The following {{spark-shell}} session can be used to reproduce this bug:
> {code}
> import org.apache.spark.sql.SQLContext
> val sqlContext = new SQLContext(sc)
> import sc._
> import sqlContext._
> case class KeyValue(key: Int, value: String)
> parallelize(1 to 1024 * 1024 * 20).
>   flatMap(i => Seq.fill(10)(KeyValue(i, i.toString))).
>   saveAsParquetFile("large.parquet")
> parquetFile("large.parquet").registerTempTable("large")
> hadoopConfiguration.set("parquet.task.side.metadata", "false")
> sql("SET spark.sql.parquet.filterPushdown=true")
> sql("SELECT value FROM large WHERE 1024 < value AND value < 2048").collect()
> {code}
> From the log we can find:
> {code}
> There were no row groups that could be dropped due to filter predicates
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4258) NPE with new Parquet Filters

2015-01-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-4258:
--
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-5463

> NPE with new Parquet Filters
> 
>
> Key: SPARK-4258
> URL: https://issues.apache.org/jira/browse/SPARK-4258
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.2.0
>
>
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 21.0 (TID 160, ip-10-0-247-144.us-west-2.compute.internal): 
> java.lang.NullPointerException: 
> parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:206)
> parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:162)
> 
> parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:100)
> 
> parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
> parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
> 
> parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:210)
> 
> parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
> parquet.filter2.predicate.Operators$Or.accept(Operators.java:302)
> 
> parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:201)
> 
> parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
> parquet.filter2.predicate.Operators$And.accept(Operators.java:290)
> 
> parquet.filter2.statisticslevel.StatisticsFilter.canDrop(StatisticsFilter.java:52)
> parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:46)
> parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
> 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
> 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
> 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
> {code}
> This occurs when reading parquet data encoded with the older version of the 
> library for TPC-DS query 34.  Will work on coming up with a smaller 
> reproduction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5463) Fix Parquet filter push-down

2015-01-28 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-5463:
-

 Summary: Fix Parquet filter push-down
 Key: SPARK-5463
 URL: https://issues.apache.org/jira/browse/SPARK-5463
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.2.2
Reporter: Cheng Lian
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5462) Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in Python DataFrame

2015-01-28 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-5462:
-

 Summary: Catalyst UnresolvedException "Invalid call to qualifiers 
on unresolved object" error when accessing fields in Python DataFrame
 Key: SPARK-5462
 URL: https://issues.apache.org/jira/browse/SPARK-5462
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.3.0
Reporter: Josh Rosen
Priority: Blocker


When trying to access fields on a Python DataFrame created via inferSchema, I 
ran into a confusing Catalyst Py4J error.  Here's a reproduction:

{code}
from pyspark import SparkContext
from pyspark.sql import SQLContext, Row

sc = SparkContext("local", "test")
sqlContext = SQLContext(sc)

# Load a text file and convert each line to a Row.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))

# Infer the schema, and register the SchemaRDD as a table.
schemaPeople = sqlContext.inferSchema(people)
schemaPeople.registerTempTable("people")

# SQL can be run over SchemaRDDs that have been registered as a table.
teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 
19")

print teenagers.name
{code}

This fails with the following error:

{code}
Traceback (most recent call last):
  File "/Users/joshrosen/Documents/spark/sqltest.py", line 19, in 
print teenagers.name
  File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, in 
__getattr__
return Column(self._jdf.apply(name))
  File 
"/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 538, in __call__
  File 
"/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
 line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o66.apply.
: org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
qualifiers on unresolved object, tree: 'name
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50)
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122)
at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
{code}

This is distinct from the helpful error message that I get when trying to 
access a non-existent column.  This error didn't occur when I tried the same 
thing with a DataFrame created via jsonRDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4989) wrong application configuration cause cluster down in standalone mode

2015-01-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4989:
-
Fix Version/s: 1.1.2

> wrong application configuration cause cluster down in standalone mode
> -
>
> Key: SPARK-4989
> URL: https://issues.apache.org/jira/browse/SPARK-4989
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.0.0, 1.1.0, 1.2.0
>Reporter: Zhang, Liye
>Assignee: Zhang, Liye
>Priority: Critical
> Fix For: 1.3.0, 1.1.2
>
>
> when enabling eventlog in standalone mode, if give the wrong configuration, 
> the standalone cluster will down (cause master restart, lose connection with 
> workers). 
> How to reproduce: just give an invalid value to "spark.eventLog.dir", for 
> example: *spark.eventLog.dir=hdfs://tmp/logdir1, hdfs://tmp/logdir2*. This 
> will throw illegalArgumentException, which will cause the *Master* restart. 
> And the whole cluster is not available.
> This is not acceptable that cluster is crashed by one application's wrong 
> setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4989) wrong application configuration cause cluster down in standalone mode

2015-01-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4989:
-
Target Version/s: 1.3.0, 1.1.2, 1.2.2  (was: 1.3.0)

> wrong application configuration cause cluster down in standalone mode
> -
>
> Key: SPARK-4989
> URL: https://issues.apache.org/jira/browse/SPARK-4989
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.0.0, 1.1.0, 1.2.0
>Reporter: Zhang, Liye
>Assignee: Zhang, Liye
>Priority: Critical
> Fix For: 1.3.0, 1.1.2
>
>
> when enabling eventlog in standalone mode, if give the wrong configuration, 
> the standalone cluster will down (cause master restart, lose connection with 
> workers). 
> How to reproduce: just give an invalid value to "spark.eventLog.dir", for 
> example: *spark.eventLog.dir=hdfs://tmp/logdir1, hdfs://tmp/logdir2*. This 
> will throw illegalArgumentException, which will cause the *Master* restart. 
> And the whole cluster is not available.
> This is not acceptable that cluster is crashed by one application's wrong 
> setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-4989) wrong application configuration cause cluster down in standalone mode

2015-01-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reopened SPARK-4989:
--

> wrong application configuration cause cluster down in standalone mode
> -
>
> Key: SPARK-4989
> URL: https://issues.apache.org/jira/browse/SPARK-4989
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.0.0, 1.1.0, 1.2.0
>Reporter: Zhang, Liye
>Assignee: Zhang, Liye
>Priority: Critical
> Fix For: 1.3.0, 1.1.2
>
>
> when enabling eventlog in standalone mode, if give the wrong configuration, 
> the standalone cluster will down (cause master restart, lose connection with 
> workers). 
> How to reproduce: just give an invalid value to "spark.eventLog.dir", for 
> example: *spark.eventLog.dir=hdfs://tmp/logdir1, hdfs://tmp/logdir2*. This 
> will throw illegalArgumentException, which will cause the *Master* restart. 
> And the whole cluster is not available.
> This is not acceptable that cluster is crashed by one application's wrong 
> setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5417) Remove redundant executor-ID set() call

2015-01-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5417:
-
Target Version/s: 1.3.0, 1.2.1
   Fix Version/s: 1.3.0
Assignee: Ryan Williams
  Labels: backport-needed  (was: )

> Remove redundant executor-ID set() call
> ---
>
> Key: SPARK-5417
> URL: https://issues.apache.org/jira/browse/SPARK-5417
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>Assignee: Ryan Williams
>Priority: Minor
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> {{spark.executor.id}} no longer [needs to be set in 
> Executor.scala|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L79],
>  as of [#4194|https://github.com/apache/spark/pull/4194]; it is set upstream 
> in 
> [SparkEnv|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/SparkEnv.scala#L332].
>  Might as well remove the redundant set() in Executor.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5434) Preserve spaces in path to spark-ec2

2015-01-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5434:
-
Target Version/s: 1.3.0, 1.2.1
   Fix Version/s: 1.3.0
  Labels: backport-needed  (was: )

> Preserve spaces in path to spark-ec2
> 
>
> Key: SPARK-5434
> URL: https://issues.apache.org/jira/browse/SPARK-5434
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> If the path to {{spark-ec2}} contains spaces, the script won't run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5434) Preserve spaces in path to spark-ec2

2015-01-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5434:
-
Assignee: Nicholas Chammas

> Preserve spaces in path to spark-ec2
> 
>
> Key: SPARK-5434
> URL: https://issues.apache.org/jira/browse/SPARK-5434
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.2.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> If the path to {{spark-ec2}} contains spaces, the script won't run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4955) Dynamic allocation doesn't work in YARN cluster mode

2015-01-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4955.

   Resolution: Fixed
Fix Version/s: 1.3.0

> Dynamic allocation doesn't work in YARN cluster mode
> 
>
> Key: SPARK-4955
> URL: https://issues.apache.org/jira/browse/SPARK-4955
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Chengxiang Li
>Assignee: Lianhui Wang
>Priority: Blocker
> Fix For: 1.3.0
>
>
> With executor dynamic scaling enabled, in yarn-cluster mode, after query 
> finished and spark.dynamicAllocation.executorIdleTimeout interval, executor 
> number is not reduced to configured min number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5440) Add toLocalIterator to pyspark rdd

2015-01-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5440:
--
Assignee: Michael Nazario

> Add toLocalIterator to pyspark rdd
> --
>
> Key: SPARK-5440
> URL: https://issues.apache.org/jira/browse/SPARK-5440
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Michael Nazario
>Assignee: Michael Nazario
> Fix For: 1.3.0
>
>
> toLocalIterator is available in Java and Scala. If we add this functionality 
> to Python, then we can also be able to use PySpark to iterate over a dataset 
> partition by partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5440) Add toLocalIterator to pyspark rdd

2015-01-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5440.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4237
[https://github.com/apache/spark/pull/4237]

> Add toLocalIterator to pyspark rdd
> --
>
> Key: SPARK-5440
> URL: https://issues.apache.org/jira/browse/SPARK-5440
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Michael Nazario
> Fix For: 1.3.0
>
>
> toLocalIterator is available in Java and Scala. If we add this functionality 
> to Python, then we can also be able to use PySpark to iterate over a dataset 
> partition by partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1934) "this" reference escape to "selectorThread" during construction in ConnectionManager

2015-01-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1934.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Sean Owen

> "this" reference escape to "selectorThread" during construction in 
> ConnectionManager
> 
>
> Key: SPARK-1934
> URL: https://issues.apache.org/jira/browse/SPARK-1934
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.3.0
>
>
> `selectorThread` starts in the construction of 
> `org.apache.spark.network.ConnectionManager`, which may cause 
> `writeRunnableStarted` and `readRunnableStarted` are uninitialized before 
> them are used.
> Indirectly, `BlockManager.this` also escape since it calls `new 
> ConnectionManager(...)` and will be used in some threads of 
> `ConnectionManager`. Some threads may view an uninitialized `BlockManager`.
> In summary, it's dangerous and hard to analyse the correctness of 
> concurrency. Such escape should be avoided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5188) make-distribution.sh should support curl, not only wget to get Tachyon

2015-01-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5188:
---
Fix Version/s: 1.3.0

> make-distribution.sh should support curl, not only wget to get Tachyon
> --
>
> Key: SPARK-5188
> URL: https://issues.apache.org/jira/browse/SPARK-5188
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
> Fix For: 1.3.0
>
>
> When we use `make-distribution.sh` with `--with-tachyon` option, Tachyon will 
> be downloaded by `wget` command but some systems don't have `wget` by default 
> (MacOS X doesn't have).
> Other scripts like build/mvn, build/sbt support not only `wget` but also 
> `curl` so `make-distribution.sh` should support `curl` too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5188) make-distribution.sh should support curl, not only wget to get Tachyon

2015-01-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5188.

Resolution: Fixed
  Assignee: Kousuke Saruta

> make-distribution.sh should support curl, not only wget to get Tachyon
> --
>
> Key: SPARK-5188
> URL: https://issues.apache.org/jira/browse/SPARK-5188
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>
> When we use `make-distribution.sh` with `--with-tachyon` option, Tachyon will 
> be downloaded by `wget` command but some systems don't have `wget` by default 
> (MacOS X doesn't have).
> Other scripts like build/mvn, build/sbt support not only `wget` but also 
> `curl` so `make-distribution.sh` should support `curl` too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5458) Refer to aggregateByKey instead of combineByKey in docs

2015-01-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5458.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Sandy Ryza

> Refer to aggregateByKey instead of combineByKey in docs
> ---
>
> Key: SPARK-5458
> URL: https://issues.apache.org/jira/browse/SPARK-5458
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>Priority: Trivial
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5440) Add toLocalIterator to pyspark rdd

2015-01-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5440:
--
Affects Version/s: (was: 1.2.0)

I'm removing the "Affects Version(s)" field from this since it isn't a bug.

> Add toLocalIterator to pyspark rdd
> --
>
> Key: SPARK-5440
> URL: https://issues.apache.org/jira/browse/SPARK-5440
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Michael Nazario
>
> toLocalIterator is available in Java and Scala. If we add this functionality 
> to Python, then we can also be able to use PySpark to iterate over a dataset 
> partition by partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5461) Graph should have isCheckpointed, getCheckpointFiles methods

2015-01-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295817#comment-14295817
 ] 

Apache Spark commented on SPARK-5461:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/4253

> Graph should have isCheckpointed, getCheckpointFiles methods
> 
>
> Key: SPARK-5461
> URL: https://issues.apache.org/jira/browse/SPARK-5461
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> Graph has a checkpoint method but does not have other helper functionality 
> which RDD has.  Proposal:
> {code}
>   /**
>* Return whether this Graph has been checkpointed or not
>*/
>   def isCheckpointed: Boolean
>   /**
>* Gets the name of the files to which this Graph was checkpointed
>*/
>   def getCheckpointFiles: Seq[String]
> {code}
> I need this for [SPARK-1405].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5288) Stabilize Spark SQL data type API followup

2015-01-28 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295771#comment-14295771
 ] 

Reynold Xin commented on SPARK-5288:


Alright maybe we can expose NumericType as well, but hide the other non-leaf 
ones.

> Stabilize Spark SQL data type API followup 
> ---
>
> Key: SPARK-5288
> URL: https://issues.apache.org/jira/browse/SPARK-5288
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Several issues we need to address before release 1.3
> * Do we want to make all classes in 
> org.apache.spark.sql.types.dataTypes.scala public? Seems we do not need to 
> make those abstract classes public.
> * Seems NativeType is not a very clear and useful concept. Should we just 
> remove it?
> * We need to Stabilize the type hierarchy of our data types. Seems StringType 
> and Decimal Type should not be primitive types. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5288) Stabilize Spark SQL data type API followup

2015-01-28 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295765#comment-14295765
 ] 

Xiangrui Meng commented on SPARK-5288:
--

+1 on the use case [~prudenko] metnioned. [~rxin] If we only keep leaf types, 
we should provide methods to validate a group of types, e.g., whether a type 
could be casted to Double.

> Stabilize Spark SQL data type API followup 
> ---
>
> Key: SPARK-5288
> URL: https://issues.apache.org/jira/browse/SPARK-5288
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Several issues we need to address before release 1.3
> * Do we want to make all classes in 
> org.apache.spark.sql.types.dataTypes.scala public? Seems we do not need to 
> make those abstract classes public.
> * Seems NativeType is not a very clear and useful concept. Should we just 
> remove it?
> * We need to Stabilize the type hierarchy of our data types. Seems StringType 
> and Decimal Type should not be primitive types. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3996) Shade Jetty in Spark deliverables

2015-01-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295716#comment-14295716
 ] 

Apache Spark commented on SPARK-3996:
-

User 'pwendell' has created a pull request for this issue:
https://github.com/apache/spark/pull/4252

> Shade Jetty in Spark deliverables
> -
>
> Key: SPARK-3996
> URL: https://issues.apache.org/jira/browse/SPARK-3996
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Mingyu Kim
>Assignee: Matthew Cheah
>
> We'd like to use Spark in a Jetty 9 server, and it's causing a version 
> conflict. Given that Spark's dependency on Jetty is light, it'd be a good 
> idea to shade this dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5461) Graph should have isCheckpointed, getCheckpointFiles methods

2015-01-28 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-5461:


Assignee: Joseph K. Bradley

> Graph should have isCheckpointed, getCheckpointFiles methods
> 
>
> Key: SPARK-5461
> URL: https://issues.apache.org/jira/browse/SPARK-5461
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> Graph has a checkpoint method but does not have other helper functionality 
> which RDD has.  Proposal:
> {code}
>   /**
>* Return whether this Graph has been checkpointed or not
>*/
>   def isCheckpointed: Boolean
>   /**
>* Gets the name of the files to which this Graph was checkpointed
>*/
>   def getCheckpointFiles: Seq[String]
> {code}
> I need this for [SPARK-1405].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5461) Graph should have isCheckpointed, getCheckpointFiles methods

2015-01-28 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-5461:


 Summary: Graph should have isCheckpointed, getCheckpointFiles 
methods
 Key: SPARK-5461
 URL: https://issues.apache.org/jira/browse/SPARK-5461
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor


Graph has a checkpoint method but does not have other helper functionality 
which RDD has.  Proposal:
{code}
  /**
   * Return whether this Graph has been checkpointed or not
   */
  def isCheckpointed: Boolean

  /**
   * Gets the name of the files to which this Graph was checkpointed
   */
  def getCheckpointFiles: Seq[String]
{code}

I need this for [SPARK-1405].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5460) RandomForest should catch exceptions when removing checkpoint files

2015-01-28 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-5460:


 Summary: RandomForest should catch exceptions when removing 
checkpoint files
 Key: SPARK-5460
 URL: https://issues.apache.org/jira/browse/SPARK-5460
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor


RandomForest can optionally use checkpointing.  When it tries to remove 
checkpoint files, it could fail (if a user has write but not delete access on 
some filesystem).  There should be a try-catch to catch exceptions when trying 
to remove checkpoint files in NodeIdCache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4768) Add Support For Impala Encoded Timestamp (INT96)

2015-01-28 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295663#comment-14295663
 ] 

Yin Huai commented on SPARK-4768:
-

[~taiji] Seems the attached parquet file only has the last row ('test row 4', 
null). Can you upload the correct file? Also, can you add a row with a 
nanosecond precision timestamp value?

> Add Support For Impala Encoded Timestamp (INT96)
> 
>
> Key: SPARK-4768
> URL: https://issues.apache.org/jira/browse/SPARK-4768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Pat McDonough
>Priority: Critical
> Attachments: 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq
>
>
> Impala is using INT96 for timestamps. Spark SQL should be able to read this 
> data despite the fact that it is not part of the spec.
> Perhaps adding a flag to act like impala when reading parquet (like we do for 
> strings already) would be useful.
> Here's an example of the error you might see:
> {code}
> Caused by: java.lang.RuntimeException: Potential loss of precision: cannot 
> convert INT96
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:61)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:113)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:314)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:311)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToAttributes(ParquetTypes.scala:310)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:441)
> at 
> org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:66)
> at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:141)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5361) Multiple Java RDD <-> Python RDD conversions not working correctly

2015-01-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5361:
--
Target Version/s: 1.2.2
   Fix Version/s: 1.3.0
  Labels: backport-needed  (was: )

This has been fixed by https://github.com/apache/spark/pull/4146 in 1.3.0.  I'd 
also like to backport this to {{branch-1.2}}, but I'm not doing that right away 
since we're voting on a 1.2.1 RC right now.  I've added the {{backport-needed}} 
label and I'll merge this to {{branch-1.2}} as soon as 1.2.1 is released.

> Multiple Java RDD <-> Python RDD conversions not working correctly
> --
>
> Key: SPARK-5361
> URL: https://issues.apache.org/jira/browse/SPARK-5361
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
>Reporter: Winston Chen
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> This is found through reading RDD from `sc.newAPIHadoopRDD` and writing it 
> back using `rdd.saveAsNewAPIHadoopFile` in pyspark.
> It turns out that whenever there are multiple RDD conversions from JavaRDD to 
> PythonRDD then back to JavaRDD, the exception below happens:
> {noformat}
> 15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7)
> java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
> java.util.ArrayList
>   at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157)
>   at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> {noformat}
> The test case code below reproduces it:
> {noformat}
> from pyspark.rdd import RDD
> dl = [
> (u'2', {u'director': u'David Lean'}), 
> (u'7', {u'director': u'Andrew Dominik'})
> ]
> dl_rdd = sc.parallelize(dl)
> tmp = dl_rdd._to_java_object_rdd()
> tmp2 = sc._jvm.SerDe.javaToPython(tmp)
> t = RDD(tmp2, sc)
> t.count()
> tmp = t._to_java_object_rdd()
> tmp2 = sc._jvm.SerDe.javaToPython(tmp)
> t = RDD(tmp2, sc)
> t.count() # it blows up here during the 2nd time of conversion
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5291) Add timestamp and reason why an executor is removed to SparkListenerExecutorAdded and SparkListenerExecutorRemoved

2015-01-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5291.
---
   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Kousuke Saruta

Fixed by https://github.com/apache/spark/pull/4082.

> Add timestamp and reason why an executor is removed to 
> SparkListenerExecutorAdded and SparkListenerExecutorRemoved
> --
>
> Key: SPARK-5291
> URL: https://issues.apache.org/jira/browse/SPARK-5291
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
> Fix For: 1.3.0
>
>
> Recently SparkListenerExecutorAdded and SparkListenerExecutorRemoved are 
> added.
> I think it's useful if they have timestamp and the reason why an executor is 
> removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

2015-01-28 Thread Alex Baretta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295623#comment-14295623
 ] 

Alex Baretta commented on SPARK-5236:
-

[~lian cheng][~imranr] Thanks for commenting and for taking interest in this 
issue. I definitely wish to help fix this, so that I don't run into this again. 
I'll try to reproduce this with a stock Spark checkout from master.

> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.MutableInt
> -
>
> Key: SPARK-5236
> URL: https://issues.apache.org/jira/browse/SPARK-5236
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Baretta
>
> {code}
> 15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 
> (TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet
> at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
> at 
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at 
> org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
> at 
> org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.MutableInt
> at 
> org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:241)
> at 
> org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter.updateInt(ParquetConverter.scala:375)
> at 
> org.apache.spark.sql.parquet.CatalystPrimitiveConverter.addInt(ParquetConverter.scala:434)
> at 
> parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:237)
> at 
> parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:353)
> at 
> parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:402)
> at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194)
> ... 27 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >