[jira] [Resolved] (SPARK-6390) Add MatrixUDT in PySpark

2015-06-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-6390.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6354
[https://github.com/apache/spark/pull/6354]

 Add MatrixUDT in PySpark
 

 Key: SPARK-6390
 URL: https://issues.apache.org/jira/browse/SPARK-6390
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Manoj Kumar
 Fix For: 1.5.0


 After SPARK-6309, we should support MatrixUDT in PySpark too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8410) Hive VersionsSuite RuntimeException

2015-06-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590270#comment-14590270
 ] 

Sean Owen commented on SPARK-8410:
--

Did you {{install}} the artifacts before this? because you're trying to only 
test a submodule.
If so, what about if you add the {{hive-thriftserver}} profile to both the 
build and test commands?

 Hive VersionsSuite RuntimeException
 ---

 Key: SPARK-8410
 URL: https://issues.apache.org/jira/browse/SPARK-8410
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 1.3.1, 1.4.0
 Environment: IBM Power system - P7
 running Ubuntu 14.04LE
 with IBM JDK version 1.7.0
Reporter: Josiah Samuel Sathiadass
Priority: Minor

 While testing Spark Project Hive, there are RuntimeExceptions as follows,
 VersionsSuite:
 - success sanity check *** FAILED ***
   java.lang.RuntimeException: [download failed: 
 org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed: 
 org.codehaus.groovy#groovy-all;2.1.6!groovy-all.jar, download failed: 
 asm#asm;3.2!asm.jar]
   at 
 org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978)
   at 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62)
   at 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62)
   at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:38)
   at 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$.org$apache$spark$sql$hive$client$IsolatedClientLoader$$downloadVersion(IsolatedClientLoader.scala:61)
   at 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$1.apply(IsolatedClientLoader.scala:44)
   at 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$1.apply(IsolatedClientLoader.scala:44)
   at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189)
   at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91)
   at 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:44)
   ...
 The tests are executed with the following set of options,
 build/mvn --pl sql/hive --fail-never -Pyarn -Phadoop-2.4 
 -Dhadoop.version=2.6.0 test
 Adding the following dependencies in the spark/sql/hive/pom.xml  file 
 solves this issue,
  dependency
  groupIdorg.jboss.netty/groupId
  artifactIdnetty/artifactId
  version3.2.2.Final/version
  scopetest/scope
  /dependency
  dependency
  groupIdorg.codehaus.groovy/groupId
  artifactIdgroovy-all/artifactId
  version2.1.6/version
  scopetest/scope
  /dependency
  
  dependency
  groupIdasm/groupId
  artifactIdasm/artifactId
  version3.2/version
  scopetest/scope
  /dependency
  
 The question is, Is this the correct way to fix this runtimeException ?
 If yes, Can a pull request fix this issue permanently ?
 If not, suggestions please.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8406) Race condition when writing Parquet files

2015-06-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8406:

Target Version/s: 1.4.1, 1.5.0  (was: 1.4.1)

 Race condition when writing Parquet files
 -

 Key: SPARK-8406
 URL: https://issues.apache.org/jira/browse/SPARK-8406
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker

 To support appending, the Parquet data source tries to find out the max part 
 number of part-files in the destination directory (the id in output file 
 name part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, 
 this step happens on driver side before any files are written. However, in 
 1.4.0, this is moved to task side. Thus, for tasks scheduled later, they may 
 see wrong max part number generated by newly written files by other finished 
 tasks within the same job. This actually causes a race condition. In most 
 cases, this only causes nonconsecutive IDs in output file names. But when the 
 DataFrame contains thousands of RDD partitions, it's likely that two tasks 
 may choose the same part number, thus one of them gets overwritten by the 
 other.
 The following Spark shell snippet can reproduce nonconsecutive part numbers:
 {code}
 sqlContext.range(0, 
 128).repartition(16).write.mode(overwrite).parquet(foo)
 {code}
 16 can be replaced with any integer that is greater than the default 
 parallelism on your machine (usually it means core number, on my machine it's 
 8).
 {noformat}
 -rw-r--r--   3 lian supergroup  0 2015-06-17 00:06 
 /user/lian/foo/_SUCCESS
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-1.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-2.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-3.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-4.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-5.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-6.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-7.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-8.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00017.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00018.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00019.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00020.gz.parquet
 -rw-r--r--   3 lian supergroup352 2015-06-17 00:06 
 /user/lian/foo/part-r-00021.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00022.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00023.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00024.gz.parquet
 {noformat}
 And here is another Spark shell snippet for reproducing overwriting:
 {code}
 sqlContext.range(0, 
 1).repartition(500).write.mode(overwrite).parquet(foo)
 sqlContext.read.parquet(foo).count()
 {code}
 Expected answer should be {{1}}, but you may see a number like {{9960}} 
 due to overwriting. The actual number varies for different run and different 
 nodes.
 Notice that the newly added ORC data source doesn't suffer this issue because 
 it uses both part number and {{System.currentTimeMills()}} to generate the 
 output file name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8406) Race condition when writing Parquet files

2015-06-17 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590315#comment-14590315
 ] 

Michael Armbrust commented on SPARK-8406:
-

It seems to me that ORC is not free of this bug, but instead just more likely 
to avoid a problem, right?

 Race condition when writing Parquet files
 -

 Key: SPARK-8406
 URL: https://issues.apache.org/jira/browse/SPARK-8406
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker

 To support appending, the Parquet data source tries to find out the max part 
 number of part-files in the destination directory (the id in output file 
 name part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, 
 this step happens on driver side before any files are written. However, in 
 1.4.0, this is moved to task side. Thus, for tasks scheduled later, they may 
 see wrong max part number generated by newly written files by other finished 
 tasks within the same job. This actually causes a race condition. In most 
 cases, this only causes nonconsecutive IDs in output file names. But when the 
 DataFrame contains thousands of RDD partitions, it's likely that two tasks 
 may choose the same part number, thus one of them gets overwritten by the 
 other.
 The following Spark shell snippet can reproduce nonconsecutive part numbers:
 {code}
 sqlContext.range(0, 
 128).repartition(16).write.mode(overwrite).parquet(foo)
 {code}
 16 can be replaced with any integer that is greater than the default 
 parallelism on your machine (usually it means core number, on my machine it's 
 8).
 {noformat}
 -rw-r--r--   3 lian supergroup  0 2015-06-17 00:06 
 /user/lian/foo/_SUCCESS
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-1.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-2.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-3.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-4.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-5.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-6.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-7.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-8.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00017.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00018.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00019.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00020.gz.parquet
 -rw-r--r--   3 lian supergroup352 2015-06-17 00:06 
 /user/lian/foo/part-r-00021.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00022.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00023.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00024.gz.parquet
 {noformat}
 And here is another Spark shell snippet for reproducing overwriting:
 {code}
 sqlContext.range(0, 
 1).repartition(500).write.mode(overwrite).parquet(foo)
 sqlContext.read.parquet(foo).count()
 {code}
 Expected answer should be {{1}}, but you may see a number like {{9960}} 
 due to overwriting. The actual number varies for different run and different 
 nodes.
 Notice that the newly added ORC data source doesn't suffer this issue because 
 it uses both part number and {{System.currentTimeMills()}} to generate the 
 output file name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4123) Show dependency changes in pull requests

2015-06-17 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590389#comment-14590389
 ] 

Josh Rosen commented on SPARK-4123:
---

Hasn't this been re-enabled?  Did we ever end up fixing this?

 Show dependency changes in pull requests
 

 Key: SPARK-4123
 URL: https://issues.apache.org/jira/browse/SPARK-4123
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Brennon York
Priority: Critical

 We should inspect the classpath of Spark's assembly jar for every pull 
 request. This only takes a few seconds in Maven and it will help weed out 
 dependency changes from the master branch. Ideally we'd post any dependency 
 changes in the pull request message.
 {code}
 $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
 INFO | tr : \n | awk -F/ '{print $NF}' | sort  my-classpath
 $ git checkout apache/master
 $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
 INFO | tr : \n | awk -F/ '{print $NF}' | sort  master-classpath
 $ diff my-classpath master-classpath
  chill-java-0.3.6.jar
  chill_2.10-0.3.6.jar
 ---
  chill-java-0.5.0.jar
  chill_2.10-0.5.0.jar
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5787) Protect JVM from some not-important exceptions

2015-06-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5787:
---
Target Version/s: 1.5.0  (was: 1.4.0)

 Protect JVM from some not-important exceptions
 --

 Key: SPARK-5787
 URL: https://issues.apache.org/jira/browse/SPARK-5787
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Davies Liu
Priority: Critical

 Any un-captured exception will shutdown the executor JVM, so we should 
 capture all those exceptions which did not hurt executor much (executor is 
 still functional).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7448) Implement custom bye array serializer for use in PySpark shuffle

2015-06-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7448:
---
Target Version/s: 1.5.0  (was: 1.4.0)

 Implement custom bye array serializer for use in PySpark shuffle
 

 Key: SPARK-7448
 URL: https://issues.apache.org/jira/browse/SPARK-7448
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Shuffle
Reporter: Josh Rosen
Priority: Minor

 PySpark's shuffle typically shuffles Java RDDs that contain byte arrays. We 
 should implement a custom Serializer for use in these shuffles.  This will 
 allow us to take advantage of shuffle optimizations like SPARK-7311 for 
 PySpark without requiring users to change the default serializer to 
 KryoSerializer (this is useful for JobServer-type applications).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7078) Cache-aware binary processing in-memory sort

2015-06-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7078:
---
Target Version/s: 1.5.0  (was: 1.4.0)

 Cache-aware binary processing in-memory sort
 

 Key: SPARK-7078
 URL: https://issues.apache.org/jira/browse/SPARK-7078
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Reporter: Reynold Xin
Assignee: Josh Rosen

 A cache-friendly sort algorithm that can be used eventually for:
 * sort-merge join
 * shuffle
 See the old alpha sort paper: 
 http://research.microsoft.com/pubs/68249/alphasort.doc
 Note that state-of-the-art for sorting has improved quite a bit, but we can 
 easily optimize the sorting algorithm itself later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7041) Avoid writing empty files in BypassMergeSortShuffleWriter

2015-06-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7041:
---
Target Version/s: 1.5.0  (was: 1.4.0)

 Avoid writing empty files in BypassMergeSortShuffleWriter
 -

 Key: SPARK-7041
 URL: https://issues.apache.org/jira/browse/SPARK-7041
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Reporter: Josh Rosen
Assignee: Josh Rosen

 In BypassMergeSortShuffleWriter, we may end up opening disk writers files for 
 empty partitions; this occurs because we manually call {{open()}} after 
 creating the writer, causing serialization and compression input streams to 
 be created; these streams may write headers to the output stream, resulting 
 in non-zero-length files being created for partitions that contain no 
 records.  This is unnecessary, though, since the disk object writer will 
 automatically open itself when the first write is performed.  Removing this 
 eager {{open()}} call and rewriting the consumers to cope with the 
 non-existence of empty files results in a large performance benefit for 
 certain sparse workloads when using sort-based shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8368) ClassNotFoundException in closure for map

2015-06-17 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-8368:

Target Version/s: 1.4.1, 1.5.0

 ClassNotFoundException in closure for map 
 --

 Key: SPARK-8368
 URL: https://issues.apache.org/jira/browse/SPARK-8368
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
 Environment: Centos 6.5, java 1.7.0_67, scala 2.10.4. Build the 
 project on Windows 7 and run in a spark standalone cluster(or local) mode on 
 Centos 6.X. 
Reporter: CHEN Zhiwei

 After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the 
 following exception:
 ==begin exception
 {quote}
 Exception in thread main java.lang.ClassNotFoundException: 
 com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:278)
   at 
 org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
  Source)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
  Source)
   at 
 org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101)
   at 
 org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197)
   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
   at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
   at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294)
   at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293)
   at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
   at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
   at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
   at org.apache.spark.rdd.RDD.map(RDD.scala:293)
   at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210)
   at com.yhd.ycache.magic.Model$.main(SSExample.scala:239)
   at com.yhd.ycache.magic.Model.main(SSExample.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {quote}
 ===end exception===
 I simplify the code that cause this issue, as following:
 ==begin code==
 {noformat}
 object Model extends Serializable{
   def main(args: Array[String]) {
 val Array(sql) = args
 val sparkConf = new SparkConf().setAppName(Mode Example)
 val sc = new SparkContext(sparkConf)
 val hive = new HiveContext(sc)
 //get data by hive sql
 val rows = hive.sql(sql)
 val data = rows.map(r = { 
   val arr = r.toSeq.toArray
   val label = 1.0
   def fmap = ( input: Any ) = 1.0
   val feature = arr.map(_=1.0)
   LabeledPoint(label, Vectors.dense(feature))
 })
 data.count()
   }
 }
 {noformat}
 =end code===
 This code can run pretty well on spark-shell, but error when submit it to 
 spark cluster (standalone or local mode).  I try the same code on spark 
 1.3.0(local mode), and no exception is encountered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8356) Reconcile callUDF and callUdf

2015-06-17 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590491#comment-14590491
 ] 

Benjamin Fradet commented on SPARK-8356:


Somewhat related, about being coherent, there is {{PythonUDF}} and 
{{ScalaUdf}}. Maybe we should straighten this up as well.

 Reconcile callUDF and callUdf
 -

 Key: SPARK-8356
 URL: https://issues.apache.org/jira/browse/SPARK-8356
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical
  Labels: starter

 Right now we have two functions {{callUDF}} and {{callUdf}}.  I think the 
 former is used for calling Java functions (and the documentation is wrong) 
 and the latter is for calling functions by name.  Either way this is 
 confusing and we should unify or pick different names.  Also, lets make sure 
 the docs are right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7017) Refactor dev/run-tests into Python

2015-06-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-7017.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 5694
[https://github.com/apache/spark/pull/5694]

 Refactor dev/run-tests into Python
 --

 Key: SPARK-7017
 URL: https://issues.apache.org/jira/browse/SPARK-7017
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Project Infra
Reporter: Brennon York
Assignee: Brennon York
 Fix For: 1.5.0


 This issue is to specifically track the progress of the {{dev/run-tests}} 
 script into Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8406) Race condition when writing Parquet files

2015-06-17 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8406:
--
Description: 
To support appending, the Parquet data source tries to find out the max part 
number of part-files in the destination directory (the id in output file name 
part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this 
step happens on driver side before any files are written. However, in 1.4.0, 
this is moved to task side. Thus, for tasks scheduled later, they may see wrong 
max part number generated by newly written files by other finished tasks within 
the same job. This actually causes a race condition. In most cases, this only 
causes nonconsecutive IDs in output file names. But when the DataFrame contains 
thousands of RDD partitions, it's likely that two tasks may choose the same 
part number, thus one of them gets overwritten by the other.

The data loss situation is not quite easy to reproduce. But the following Spark 
shell snippet can reproduce nonconsecutive output file IDs:
{code}
sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo)
{code}
16 can be replaced with any integer that is greater than the default 
parallelism on your machine (usually it means core number, on my machine it's 
8).
{noformat}
-rw-r--r--   3 lian supergroup  0 2015-06-17 00:06 
/user/lian/foo/_SUCCESS
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-1.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-2.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-3.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-4.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-5.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-6.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-7.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-8.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00017.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00018.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00019.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00020.gz.parquet
-rw-r--r--   3 lian supergroup352 2015-06-17 00:06 
/user/lian/foo/part-r-00021.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00022.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00023.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00024.gz.parquet
{noformat}
Notice that the newly added ORC data source doesn't suffer this issue because 
it uses both part number and {{System.currentTimeMills()}} to generate the 
output file name.

  was:
To support appending, the Parquet data source tries to find out the max ID of 
part-files in the destination directory (the id in output file name 
part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this 
step happens on driver side before any files are written. However, in 1.4.0, 
this is moved to task side. Thus, for tasks scheduled later, they may see wrong 
max ID generated by newly written files by other finished tasks within the same 
job. This actually causes a race condition. In most cases, this only causes 
nonconsecutive IDs in output file names. But when the DataFrame contains 
thousands of RDD partitions, it's likely that two tasks may choose the same ID, 
thus one of them gets overwritten by the other.

The data loss situation is not quite easy to reproduce. But the following Spark 
shell snippet can reproduce nonconsecutive output file IDs:
{code}
sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo)
{code}
16 can be replaced with any integer that is greater than the default 
parallelism on your machine (usually it means core number, on my machine it's 
8).
{noformat}
-rw-r--r--   3 lian supergroup  0 2015-06-17 00:06 
/user/lian/foo/_SUCCESS
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-1.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-2.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-3.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-4.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-5.gz.parquet
-rw-r--r--   3 lian supergroup353 

[jira] [Updated] (SPARK-8391) showDagViz throws OutOfMemoryError, cause the whole jobPage dies

2015-06-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8391:
-
Affects Version/s: 1.4.0

 showDagViz throws OutOfMemoryError, cause the whole jobPage dies
 

 Key: SPARK-8391
 URL: https://issues.apache.org/jira/browse/SPARK-8391
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: meiyoula

 When the job is big, and has so many DAG nodes and edges.showDagViz throws 
 ERROR, then the whole jobPage render is down. I think it's unsuitable. An 
 element node can't down the whole page.
 Below is the exception stack trace:
 java.lang.OutOfMemoryError: Requested array size exceeds VM limit
 at java.util.Arrays.copyOf(Arrays.java:3332)
 at 
 java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
 at 
 java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
 at 
 java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
 at java.lang.StringBuilder.append(StringBuilder.java:136)
 at 
 scala.collection.mutable.StringBuilder.append(StringBuilder.scala:207)
 at 
 org.apache.spark.ui.scope.RDDOperationGraph$$anonfun$makeDotFile$1.apply(RDDOperationGraph.scala:171)
 at 
 org.apache.spark.ui.scope.RDDOperationGraph$$anonfun$makeDotFile$1.apply(RDDOperationGraph.scala:171)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
 at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
 at 
 org.apache.spark.ui.scope.RDDOperationGraph$.makeDotFile(RDDOperationGraph.scala:171)
 at 
 org.apache.spark.ui.UIUtils$$anonfun$showDagViz$1.apply(UIUtils.scala:389)
 at 
 org.apache.spark.ui.UIUtils$$anonfun$showDagViz$1.apply(UIUtils.scala:385)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at org.apache.spark.ui.UIUtils$.showDagViz(UIUtils.scala:385)
 at org.apache.spark.ui.UIUtils$.showDagVizForJob(UIUtils.scala:363)
 at org.apache.spark.ui.jobs.JobPage.render(JobPage.scala:317)
 at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
 at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
 at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:75)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
 at 
 org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
 at 
 org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1496)
 at 
 com.huawei.spark.web.filter.SessionTimeoutFilter.doFilter(SessionTimeoutFilter.java:80)
 at 
 org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
 at 
 org.jasig.cas.client.util.HttpServletRequestWrapperFilter.doFilter(HttpServletRequestWrapperFilter.java:75



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6393) Extra RPC to the AM during killExecutor invocation

2015-06-17 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590426#comment-14590426
 ] 

Patrick Wendell commented on SPARK-6393:


[~sandyryza] I'm un-targeting this. If you are planning on working on this for 
a specific version, feel free to retarget.

 Extra RPC to the AM during killExecutor invocation
 --

 Key: SPARK-6393
 URL: https://issues.apache.org/jira/browse/SPARK-6393
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.3.1
Reporter: Sandy Ryza

 This was introduced by SPARK-6325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6393) Extra RPC to the AM during killExecutor invocation

2015-06-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6393:
---
Target Version/s:   (was: 1.4.0)

 Extra RPC to the AM during killExecutor invocation
 --

 Key: SPARK-6393
 URL: https://issues.apache.org/jira/browse/SPARK-6393
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.3.1
Reporter: Sandy Ryza

 This was introduced by SPARK-6325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6783) Add timing and test output for PR tests

2015-06-17 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590439#comment-14590439
 ] 

Josh Rosen commented on SPARK-6783:
---

Don't we already get this via the Jenkins JUnit XML plugin?  Does this JIRA 
cover more than what that plugin provides us?

 Add timing and test output for PR tests
 ---

 Key: SPARK-6783
 URL: https://issues.apache.org/jira/browse/SPARK-6783
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Affects Versions: 1.3.0
Reporter: Brennon York

 Currently the PR tests that run under {{dev/tests/*}} do not provide any 
 output within the actual Jenkins run. It would be nice to not only have error 
 output, but also timing results from each test and have those surfaced within 
 the Jenkins output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7888) Be able to disable intercept in Linear Regression in ML package

2015-06-17 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590201#comment-14590201
 ] 

holdenk commented on SPARK-7888:


So it seems like scikit learn takes the easy approach and just asks that the 
data is already centered when the intercept is disabled, looking at the R code 
left me trying to trace some fortran which I'm not sure I was understanding 
correctly but lets sync up when you have some time :)

 Be able to disable intercept in Linear Regression in ML package
 ---

 Key: SPARK-7888
 URL: https://issues.apache.org/jira/browse/SPARK-7888
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: DB Tsai
Assignee: holdenk





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8411) No space left on device

2015-06-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8411.
--
Resolution: Invalid

This isn't nearly enough information to help with the issue; I also think 
you'll find some useful info searching through JIRA.

 No space left on device
 ---

 Key: SPARK-8411
 URL: https://issues.apache.org/jira/browse/SPARK-8411
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Mukund Sudarshan

 com.esotericsoftware.kryo.KryoException: java.io.IOException: No space left 
 on device
 This is the error I get when trying to run a program on my cluster. It 
 doesn't occur when I run it locally however. My cluster is certainly not out 
 of space



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3854) Scala style: require spaces before `{`

2015-06-17 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590394#comment-14590394
 ] 

Josh Rosen commented on SPARK-3854:
---

Does anyone know whether we're now enforcing this?  [~rxin] may have fixed this 
recently.

 Scala style: require spaces before `{`
 --

 Key: SPARK-3854
 URL: https://issues.apache.org/jira/browse/SPARK-3854
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Josh Rosen

 We should require spaces before opening curly braces.  This isn't in the 
 style guide, but it probably should be:
 {code}
 // Correct:
 if (true) {
   println(Wow!)
 }
 // Incorrect:
 if (true){
println(Wow!)
 }
 {code}
 See https://github.com/apache/spark/pull/1658#discussion-diff-18611791 for an 
 example in the wild.
 {{git grep ){}} shows only a few occurrences of this style.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8390) Update DirectKafkaWordCount examples to show how offset ranges can be used

2015-06-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-8390:
---
Issue Type: Improvement  (was: Bug)

 Update DirectKafkaWordCount examples to show how offset ranges can be used
 --

 Key: SPARK-8390
 URL: https://issues.apache.org/jira/browse/SPARK-8390
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Tathagata Das
Assignee: Cody Koeninger





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8389) Expose KafkaRDDs offsetRange in Java and Python

2015-06-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-8389:
---
Issue Type: New Feature  (was: Bug)

 Expose KafkaRDDs offsetRange in Java and Python
 ---

 Key: SPARK-8389
 URL: https://issues.apache.org/jira/browse/SPARK-8389
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Tathagata Das
Assignee: Cody Koeninger
Priority: Critical

 Probably requires creating a JavaKafkaPairRDD and also use that in the python 
 APIs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7689) Deprecate spark.cleaner.ttl

2015-06-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7689:
---
Target Version/s: 1.4.1  (was: 1.4.0)

 Deprecate spark.cleaner.ttl
 ---

 Key: SPARK-7689
 URL: https://issues.apache.org/jira/browse/SPARK-7689
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen

 With the introduction of ContextCleaner, I think there's no longer any reason 
 for most users to enable the MetadataCleaner / {{spark.cleaner.ttl}} (except 
 perhaps for super-long-lived Spark REPLs where you're worried about orphaning 
 RDDs or broadcast variables in your REPL history and having them never get 
 cleaned up, although I think this is an uncommon use-case).  I think that 
 this property used to be relevant for Spark Streaming jobs, but I think 
 that's no longer the case since the latest Streaming docs have removed all 
 mentions of {{spark.cleaner.ttl}} (see 
 https://github.com/apache/spark/pull/4956/files#diff-dbee746abf610b52d8a7cb65bf9ea765L1817,
  for example).
 See 
 http://apache-spark-user-list.1001560.n3.nabble.com/is-spark-cleaner-ttl-safe-td2557.html
  for an old, related discussion.  Also, see 
 https://github.com/apache/spark/pull/126, the PR that introduced the new 
 ContextCleaner mechanism.
 We should probably add a deprecation warning to {{spark.cleaner.ttl}} that 
 advises users against using it, since it's an unsafe configuration option 
 that can lead to confusing behavior if it's misused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7521) Allow all required release credentials to be specified with env vars

2015-06-17 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590451#comment-14590451
 ] 

Patrick Wendell commented on SPARK-7521:


Sort of - I still need to actually contribute my scripts back into the spark 
repo. I will work on getting a PR up for that.

 Allow all required release credentials to be specified with env vars
 

 Key: SPARK-7521
 URL: https://issues.apache.org/jira/browse/SPARK-7521
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell

 When creating releases the following credentials are needed:
 1. ASF private key, to post artifacts on people.apache.
 2. ASF username and password, to publish to maven and push tags to github.
 3. GPG private key and key password, to sign releases.
 Right now the assumption is that these are made present in the build 
 environment through env vars, installed the GPG and private keys, etc. This 
 makes it difficult for us to automate the build, such as allowing the full 
 build+publish to occur on any Jenkins machine.
 One way to fix this is to make sure all of these can be specified as env vars 
 which can then be securely threaded through to the jenkins builder. The 
 script itself would then e.g. create temporary GPG keys and RSA private key 
 for each build, using these env vars.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5178) Integrate Python unit tests into Jenkins

2015-06-17 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590452#comment-14590452
 ] 

Josh Rosen commented on SPARK-5178:
---

This is a duplicate of SPARK-7021.

 Integrate Python unit tests into Jenkins
 

 Key: SPARK-5178
 URL: https://issues.apache.org/jira/browse/SPARK-5178
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 From [~joshrosen]:
 {quote}
 The Test Result pages for Jenkins builds shows some nice statistics for
 the test run, including individual test times:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/
 Currently this only covers the Java / Scala tests, but we might be able to
 integrate the PySpark tests here, too (I think it's just a matter of
 getting the Python test runner to generate the correct test result XML
 output).
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5178) Integrate Python unit tests into Jenkins

2015-06-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5178.
---
Resolution: Duplicate

 Integrate Python unit tests into Jenkins
 

 Key: SPARK-5178
 URL: https://issues.apache.org/jira/browse/SPARK-5178
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 From [~joshrosen]:
 {quote}
 The Test Result pages for Jenkins builds shows some nice statistics for
 the test run, including individual test times:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/
 Currently this only covers the Java / Scala tests, but we might be able to
 integrate the PySpark tests here, too (I think it's just a matter of
 getting the Python test runner to generate the correct test result XML
 output).
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8077) Optimisation of TreeNode for large number of children

2015-06-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-8077.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6673
[https://github.com/apache/spark/pull/6673]

 Optimisation of TreeNode for large number of children
 -

 Key: SPARK-8077
 URL: https://issues.apache.org/jira/browse/SPARK-8077
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Mick Davies
Priority: Minor
 Fix For: 1.5.0


 Large IN clauses are parsed very slowly. For example SQL below (10K items in 
 IN) takes 45-50s. 
 {code}
 sSELECT * FROM Person WHERE ForeName IN ('${(1 to 1).map(n + 
 _).mkString(',')}')
 {code}
 This is principally due to TreeNode which repeatedly call contains on 
 children, where children in this case is a List that is 10K long. In effect 
 parsing for large IN clauses is O(N squared).
 A small change that uses a lazily initialised Set based on children for 
 contains reduces parse time to around 2.5s
 I'd like to create PR for change, as we often use IN clauses with a few 
 thousand items.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6557) Only set bold text on PR github test output for problems

2015-06-17 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590457#comment-14590457
 ] 

Josh Rosen commented on SPARK-6557:
---

We could also use GitHub Emoji or colored text to make good vs. bad more 
readily distinguishable.

 Only set bold text on PR github test output for problems
 

 Key: SPARK-6557
 URL: https://issues.apache.org/jira/browse/SPARK-6557
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Brennon York
Priority: Trivial
  Labels: starter

 Minor nit, but right now we highlight (i.e. place bold text) various comments 
 from the PR tests when the PR's are submitted. For example: we currently 
 highlight a PR that  merges successfully, and also highlight the fact that 
 the PR fails Spark tests.
 I propose that we **only highlight (bold) text when there is a problem with a 
 PR**. For instance, we should not bold that the patch merges cleanly, only 
 bold when it **does not**.
 The entire point is to make it easier for the committers and developers to 
 quickly glance over PR test output and understand what, if anything, they 
 need to dive into.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3244) Add fate sharing across related files in Jenkins

2015-06-17 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590462#comment-14590462
 ] 

Josh Rosen commented on SPARK-3244:
---

Now that dev/run-tests and the associated scripts are being ported to Python, 
this may now be more easily achievable.

 Add fate sharing across related files in Jenkins
 

 Key: SPARK-3244
 URL: https://issues.apache.org/jira/browse/SPARK-3244
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 1.1.0
Reporter: Andrew Or

 A few files are closely linked with each other. For instance, changes in 
 bin/spark-submit must be reflected in bin/spark-submit.cmd and 
 SparkSubmitDriverBootstrapper.scala. It would be good if Jenkins gives a 
 warning if one file is changed but not the related ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-595) Document local-cluster mode

2015-06-17 Thread Justin Uang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590486#comment-14590486
 ] 

Justin Uang commented on SPARK-595:
---

+1 We are using for internal testing to ensure that our kryo serialization 
works.

 Document local-cluster mode
 -

 Key: SPARK-595
 URL: https://issues.apache.org/jira/browse/SPARK-595
 Project: Spark
  Issue Type: New Feature
  Components: Documentation
Affects Versions: 0.6.0
Reporter: Josh Rosen
Priority: Minor

 The 'Spark Standalone Mode' guide describes how to manually launch a 
 standalone cluster, which can be done locally for testing, but it does not 
 mention SparkContext's `local-cluster` option.
 What are the differences between these approaches?  Which one should I prefer 
 for local testing?  Can I still use the standalone web interface if I use 
 'local-cluster' mode?
 It would be useful to document this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8356) Reconcile callUDF and callUdf

2015-06-17 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590483#comment-14590483
 ] 

Michael Armbrust commented on SPARK-8356:
-

Hmm, maybe not.  [~rxin] any idea why we have {{callUDF}} at all?  It seems 
like an uglier version of {{udf}} that doesn't handle input type coercion.

 Reconcile callUDF and callUdf
 -

 Key: SPARK-8356
 URL: https://issues.apache.org/jira/browse/SPARK-8356
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical
  Labels: starter

 Right now we have two functions {{callUDF}} and {{callUdf}}.  I think the 
 former is used for calling Java functions (and the documentation is wrong) 
 and the latter is for calling functions by name.  Either way this is 
 confusing and we should unify or pick different names.  Also, lets make sure 
 the docs are right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7546) Example code for ML Pipelines feature transformations

2015-06-17 Thread Ram Sriharsha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha updated SPARK-7546:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Example code for ML Pipelines feature transformations
 -

 Key: SPARK-7546
 URL: https://issues.apache.org/jira/browse/SPARK-7546
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Ram Sriharsha

 This should be added for Scala, Java, and Python.
 It should cover ML Pipelines using a complex series of feature 
 transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8392) RDDOperationGraph: getting cached nodes is slow

2015-06-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8392:
-
Assignee: meiyoula

 RDDOperationGraph: getting cached nodes is slow
 ---

 Key: SPARK-8392
 URL: https://issues.apache.org/jira/browse/SPARK-8392
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: meiyoula
Assignee: meiyoula
Priority: Minor

 def getAllNodes: Seq[RDDOperationNode] = {
 _childNodes ++ _childClusters.flatMap(_.childNodes)
   }
 when the _childClusters has so many nodes, the process will hang on. I think 
 we can improve the efficiency here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8412) java#KafkaUtils.createDirectStream Java(Pair)RDDs do not implement HasOffsetRanges

2015-06-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8412.
--
Resolution: Not A Problem

The JavaPairRDD doesn't implementing it, but the underlying RDD ({{.rdd()}}) 
does.

 java#KafkaUtils.createDirectStream Java(Pair)RDDs do not implement 
 HasOffsetRanges
 --

 Key: SPARK-8412
 URL: https://issues.apache.org/jira/browse/SPARK-8412
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: jweinste
Priority: Critical

 // Create direct kafka stream with brokers and topics
 final JavaPairInputDStreamString, String messages = 
 KafkaUtils.createDirectStream(jssc, String.class, String.class, 
 StringDecoder.class,
 StringDecoder.class, kafkaParams, topics);
 messages.foreachRDD(new FunctionJavaPairRDDString, String, 
 Void() {
 @Override
 public Void call(final JavaPairRDDString, String rdd) 
 throws Exception {
 if (rdd instanceof HasOffsetRanges) {
 //will never happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8371) improve unit test for MaxOf and MinOf

2015-06-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8371:

Target Version/s: 1.5.0
Shepherd: Davies Liu
Assignee: Wenchen Fan

 improve unit test for MaxOf and MinOf
 -

 Key: SPARK-8371
 URL: https://issues.apache.org/jira/browse/SPARK-8371
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8397) Allow custom configuration for TestHive

2015-06-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-8397:
---
Component/s: SQL

 Allow custom configuration for TestHive
 ---

 Key: SPARK-8397
 URL: https://issues.apache.org/jira/browse/SPARK-8397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Punya Biswal
Priority: Minor

 We encourage people to use {{TestHive}} in unit tests, because it's 
 impossible to create more than one {{HiveContext}} within one process. The 
 current implementation locks people into using a {{local[2]}} 
 {{SparkContext}} underlying their {{HiveContext}}. We should make it possible 
 to override this using a system property so that people can test against 
 {{local-cluster}} or remote spark clusters to make their tests more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8388) The script docs/_plugins/copy_api_dirs.rb should be run anywhere

2015-06-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-8388:
---
Target Version/s:   (was: 1.4.0)

 The script docs/_plugins/copy_api_dirs.rb should be run anywhere
 --

 Key: SPARK-8388
 URL: https://issues.apache.org/jira/browse/SPARK-8388
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: KaiXinXIaoLei
Priority: Minor

 The script copy_api_dirs.rb in spark/docs/_plugins should be run anywhere. 
 But now, you have to be in spark/docs, and run ruby 
 _plugins/copy_api_dirs.rb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8406) Race condition when writing Parquet files

2015-06-17 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8406:
--
Description: 
To support appending, the Parquet data source tries to find out the max part 
number of part-files in the destination directory (the id in output file name 
part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this 
step happens on driver side before any files are written. However, in 1.4.0, 
this is moved to task side. Thus, for tasks scheduled later, they may see wrong 
max part number generated by newly written files by other finished tasks within 
the same job. This actually causes a race condition. In most cases, this only 
causes nonconsecutive IDs in output file names. But when the DataFrame contains 
thousands of RDD partitions, it's likely that two tasks may choose the same 
part number, thus one of them gets overwritten by the other.

The following Spark shell snippet can reproduce nonconsecutive part numbers:
{code}
sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo)
{code}
16 can be replaced with any integer that is greater than the default 
parallelism on your machine (usually it means core number, on my machine it's 
8).
{noformat}
-rw-r--r--   3 lian supergroup  0 2015-06-17 00:06 
/user/lian/foo/_SUCCESS
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-1.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-2.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-3.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-4.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-5.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-6.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-7.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-8.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00017.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00018.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00019.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00020.gz.parquet
-rw-r--r--   3 lian supergroup352 2015-06-17 00:06 
/user/lian/foo/part-r-00021.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00022.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00023.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00024.gz.parquet
{noformat}

And here is another Spark shell snippet for reproducing overwriting:
{code}
sqlContext.range(0, 
1).repartition(500).write.mode(overwrite).parquet(foo)
sqlContext.read.parquet(foo).count()
{code}
Expected answer should be {{1}}, but you may see a number like {{9960}} due 
to overwriting. The actual number varies for different run and different nodes.

Notice that the newly added ORC data source doesn't suffer this issue because 
it uses both part number and {{System.currentTimeMills()}} to generate the 
output file name.

  was:
To support appending, the Parquet data source tries to find out the max part 
number of part-files in the destination directory (the id in output file name 
part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this 
step happens on driver side before any files are written. However, in 1.4.0, 
this is moved to task side. Thus, for tasks scheduled later, they may see wrong 
max part number generated by newly written files by other finished tasks within 
the same job. This actually causes a race condition. In most cases, this only 
causes nonconsecutive IDs in output file names. But when the DataFrame contains 
thousands of RDD partitions, it's likely that two tasks may choose the same 
part number, thus one of them gets overwritten by the other.

The following Spark shell snippet can reproduce nonconsecutive part numbers:
{code}
sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo)
{code}
16 can be replaced with any integer that is greater than the default 
parallelism on your machine (usually it means core number, on my machine it's 
8).
{noformat}
-rw-r--r--   3 lian supergroup  0 2015-06-17 00:06 
/user/lian/foo/_SUCCESS
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-1.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-2.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 

[jira] [Created] (SPARK-8414) Ensure ClosureCleaner actually triggers clean ups

2015-06-17 Thread Andrew Or (JIRA)
Andrew Or created SPARK-8414:


 Summary: Ensure ClosureCleaner actually triggers clean ups
 Key: SPARK-8414
 URL: https://issues.apache.org/jira/browse/SPARK-8414
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or


Right now it cleans up old references only through natural GCs, which may not 
occur if the driver has infinite RAM. We should do a periodic GC to make sure 
that we actually do clean things up. Something like once per 30 minutes seems 
relatively inexpensive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7888) Be able to disable intercept in Linear Regression in ML package

2015-06-17 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590380#comment-14590380
 ] 

DB Tsai commented on SPARK-7888:


Last night, I figured out how to do this. If you look at the comment from line 
237,
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

The means from features are removed before training (of course, removing the 
means will densify the vector which is not good, so we have equivalent 
formula), and then we fit a linear regression on the centralized data without 
intercept. When we interpret the model,  since it's trained in the centralized 
data, the intercept is not required of course. However, in the original 
problem, this centralization can be translated into intercept. As a result, we 
can compute the intercept using closed form in line 183. You may want to draw 
couple pictures to help you visualize this.

Back to the topic of disabling intercept, you can think this as training the 
model without `centralization` so the line will cross the origin. 

 Be able to disable intercept in Linear Regression in ML package
 ---

 Key: SPARK-7888
 URL: https://issues.apache.org/jira/browse/SPARK-7888
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: DB Tsai
Assignee: holdenk





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6923) Spark SQL CLI does not read Data Source schema correctly

2015-06-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6923:

Assignee: Cheng Hao

 Spark SQL CLI does not read Data Source schema correctly
 

 Key: SPARK-6923
 URL: https://issues.apache.org/jira/browse/SPARK-6923
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: pin_zhang
Assignee: Cheng Hao
Priority: Critical

 {code:java}
 HiveContext hctx = new HiveContext(sc);
 ListString sample = new ArrayListString();
 sample.add( {\id\: \id_1\, \age\:1} );
 RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd();   
 DataFrame df = hctx.jsonRDD(sampleRDD);
 String table=test;
 df.saveAsTable(table, json,SaveMode.Overwrite);
 Table t = hctx.catalog().client().getTable(table);
 System.out.println( t.getCols());
 {code}
 --
 With the code above to save DataFrame to hive table,
 Get table cols returns one column named 'col'
 [FieldSchema(name:col, type:arraystring, comment:from deserializer)]
 Expected return fields schema id, age.
 This results in the jdbc API cannot retrieves the table columns via ResultSet 
 DatabaseMetaData.getColumns(String catalog, String schemaPattern,String 
 tableNamePattern, String columnNamePattern)
 But resultset metadata for query  select * from test   contains fields id, 
 age.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6923) Spark SQL CLI does not read Data Source schema correctly

2015-06-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6923:

Shepherd: Cheng Lian

 Spark SQL CLI does not read Data Source schema correctly
 

 Key: SPARK-6923
 URL: https://issues.apache.org/jira/browse/SPARK-6923
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: pin_zhang
Assignee: Cheng Hao
Priority: Critical

 {code:java}
 HiveContext hctx = new HiveContext(sc);
 ListString sample = new ArrayListString();
 sample.add( {\id\: \id_1\, \age\:1} );
 RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd();   
 DataFrame df = hctx.jsonRDD(sampleRDD);
 String table=test;
 df.saveAsTable(table, json,SaveMode.Overwrite);
 Table t = hctx.catalog().client().getTable(table);
 System.out.println( t.getCols());
 {code}
 --
 With the code above to save DataFrame to hive table,
 Get table cols returns one column named 'col'
 [FieldSchema(name:col, type:arraystring, comment:from deserializer)]
 Expected return fields schema id, age.
 This results in the jdbc API cannot retrieves the table columns via ResultSet 
 DatabaseMetaData.getColumns(String catalog, String schemaPattern,String 
 tableNamePattern, String columnNamePattern)
 But resultset metadata for query  select * from test   contains fields id, 
 age.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8325) Ability to provide role based row level authorization through Spark SQL

2015-06-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-8325:
---
Target Version/s:   (was: 1.4.0)

 Ability to provide role based row level authorization through Spark SQL
 ---

 Key: SPARK-8325
 URL: https://issues.apache.org/jira/browse/SPARK-8325
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.4.0
Reporter: Mayoor Rao
 Attachments: Jira_request_table_authorization.docx


 Using Datasource API we can register a file as a table in through Beeline. 
 With the implementation of jira - SPARK-8324 where we can register queries as 
 views, the authorization requirement is not restricted to hive tables, it 
 could be Spark registered tables as well. 
 The Thriftserver currently enables us to use the JDBC clients to fetch the 
 data. Data authorization would be required for any enterprise usage.
 Following features are expected – 
 1.Role based authorization
 2.Ability to define roles
 3.Ability to add user to roles
 4.Ability to define authorization at the row level
 Following JDBC commands would be required to manage authorization – 
 ADD ROLE manager WITH DESCRIPTION ProjectManager; -- Create role
 ADD USER james WITH ROLES {roles:[manager,seniorManager]}; -- Create 
 user
 GRANT ACCESS ON EMPLOYEE FOR {roles:[manager]}; -- Grant access to the 
 user on table
 AUTHORIZE ROLE USING {role:manager, tableName:EMPLOYEE, 
 columnName:Employee_id, columnValues: [1]};  -- authorize at the row 
 level
 UPDATE ROLE AUTHORIZATION WITH {role:manager, tableName:EMPLOYEE, 
 columnName:Employee_id, columnValues: [2%,3%]}; -- update 
 authorization 
 REVOKE ACCESS ON EMPLOYEE FOR {roles:[manager]}; -- revoke access 
 DELETE USER james; -- delete user
 DROP ROLE manager; -- delete manager
 Advantage
 • Ability to restrict the data based on the logged in user role.
 • Data protection
 • The organization can control data access to prevent unauthorized usage 
 or viewing of the data
 • The users who are using the BI tools can be restricted to the data they 
 are authorized to see.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8388) The script docs/_plugins/copy_api_dirs.rb should be run anywhere

2015-06-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-8388:
---
Fix Version/s: (was: 1.4.1)

 The script docs/_plugins/copy_api_dirs.rb should be run anywhere
 --

 Key: SPARK-8388
 URL: https://issues.apache.org/jira/browse/SPARK-8388
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: KaiXinXIaoLei
Priority: Minor

 The script copy_api_dirs.rb in spark/docs/_plugins should be run anywhere. 
 But now, you have to be in spark/docs, and run ruby 
 _plugins/copy_api_dirs.rb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8388) The script docs/_plugins/copy_api_dirs.rb should be run anywhere

2015-06-17 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590408#comment-14590408
 ] 

Patrick Wendell commented on SPARK-8388:


Hi [~kaixin9ok] - please don't set fix version on JIRA's that haven't been 
fixed, thanks!

 The script docs/_plugins/copy_api_dirs.rb should be run anywhere
 --

 Key: SPARK-8388
 URL: https://issues.apache.org/jira/browse/SPARK-8388
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: KaiXinXIaoLei
Priority: Minor

 The script copy_api_dirs.rb in spark/docs/_plugins should be run anywhere. 
 But now, you have to be in spark/docs, and run ruby 
 _plugins/copy_api_dirs.rb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8325) Ability to provide role based row level authorization through Spark SQL

2015-06-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-8325:
---
Fix Version/s: (was: 1.4.1)

 Ability to provide role based row level authorization through Spark SQL
 ---

 Key: SPARK-8325
 URL: https://issues.apache.org/jira/browse/SPARK-8325
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.4.0
Reporter: Mayoor Rao
 Attachments: Jira_request_table_authorization.docx


 Using Datasource API we can register a file as a table in through Beeline. 
 With the implementation of jira - SPARK-8324 where we can register queries as 
 views, the authorization requirement is not restricted to hive tables, it 
 could be Spark registered tables as well. 
 The Thriftserver currently enables us to use the JDBC clients to fetch the 
 data. Data authorization would be required for any enterprise usage.
 Following features are expected – 
 1.Role based authorization
 2.Ability to define roles
 3.Ability to add user to roles
 4.Ability to define authorization at the row level
 Following JDBC commands would be required to manage authorization – 
 ADD ROLE manager WITH DESCRIPTION ProjectManager; -- Create role
 ADD USER james WITH ROLES {roles:[manager,seniorManager]}; -- Create 
 user
 GRANT ACCESS ON EMPLOYEE FOR {roles:[manager]}; -- Grant access to the 
 user on table
 AUTHORIZE ROLE USING {role:manager, tableName:EMPLOYEE, 
 columnName:Employee_id, columnValues: [1]};  -- authorize at the row 
 level
 UPDATE ROLE AUTHORIZATION WITH {role:manager, tableName:EMPLOYEE, 
 columnName:Employee_id, columnValues: [2%,3%]}; -- update 
 authorization 
 REVOKE ACCESS ON EMPLOYEE FOR {roles:[manager]}; -- revoke access 
 DELETE USER james; -- delete user
 DROP ROLE manager; -- delete manager
 Advantage
 • Ability to restrict the data based on the logged in user role.
 • Data protection
 • The organization can control data access to prevent unauthorized usage 
 or viewing of the data
 • The users who are using the BI tools can be restricted to the data they 
 are authorized to see.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8324) Register Query as view through JDBC interface

2015-06-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-8324:
---
Target Version/s:   (was: 1.4.0)

 Register Query as view through JDBC interface
 -

 Key: SPARK-8324
 URL: https://issues.apache.org/jira/browse/SPARK-8324
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.4.0
Reporter: Mayoor Rao
  Labels: features
 Attachments: Jira_request_register_query_as_view.docx


 We currently have capability of adding csv, json, parquet, etc files as table 
 through beeline using Datasource API. We need a mechanism to register a 
 complex queries as a table through jdbc interface. The query definition could 
 be composed using the table names which are again registered as spark tables 
 using datasource API. 
 The query definition should be persisted and should have an option to 
 re-register when the thriftserver is restarted.
 The sql command should be able to either take a filename which contains the 
 json content or it should take the json content directly.
 There should be an option to save the output of the queries and register the 
 output as table.
 Advantage
 • Create adhoc join statements across different data-sources using Spark 
 from external BI interface. So no persistence of pre-aggregated needed.
 • No dependency of creation of programs to generate adhoc analytics
 • Enable business users to model the data across diverse data sources in 
 real time without any programming
 • Enable persistence of the query output through jdbc interface. No extra 
 programming required.
 SQL Syntax for registering a set of queries or files as table - 
 REGISTERSQLJOB USING FILE/JSON FILENAME/JSONContent



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8324) Register Query as view through JDBC interface

2015-06-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-8324:
---
Fix Version/s: (was: 1.4.1)

 Register Query as view through JDBC interface
 -

 Key: SPARK-8324
 URL: https://issues.apache.org/jira/browse/SPARK-8324
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.4.0
Reporter: Mayoor Rao
  Labels: features
 Attachments: Jira_request_register_query_as_view.docx


 We currently have capability of adding csv, json, parquet, etc files as table 
 through beeline using Datasource API. We need a mechanism to register a 
 complex queries as a table through jdbc interface. The query definition could 
 be composed using the table names which are again registered as spark tables 
 using datasource API. 
 The query definition should be persisted and should have an option to 
 re-register when the thriftserver is restarted.
 The sql command should be able to either take a filename which contains the 
 json content or it should take the json content directly.
 There should be an option to save the output of the queries and register the 
 output as table.
 Advantage
 • Create adhoc join statements across different data-sources using Spark 
 from external BI interface. So no persistence of pre-aggregated needed.
 • No dependency of creation of programs to generate adhoc analytics
 • Enable business users to model the data across diverse data sources in 
 real time without any programming
 • Enable persistence of the query output through jdbc interface. No extra 
 programming required.
 SQL Syntax for registering a set of queries or files as table - 
 REGISTERSQLJOB USING FILE/JSON FILENAME/JSONContent



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7025) Create a Java-friendly input source API

2015-06-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7025:
---
Target Version/s: 1.5.0  (was: 1.4.0)

 Create a Java-friendly input source API
 ---

 Key: SPARK-7025
 URL: https://issues.apache.org/jira/browse/SPARK-7025
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 The goal of this ticket is to create a simple input source API that we can 
 maintain and support long term.
 Spark currently has two de facto input source API:
 1. RDD
 2. Hadoop MapReduce InputFormat
 Neither of the above is ideal:
 1. RDD: It is hard for Java developers to implement RDD, given the implicit 
 class tags. In addition, the RDD API depends on Scala's runtime library, 
 which does not preserve binary compatibility across Scala versions. If a 
 developer chooses Java to implement an input source, it would be great if 
 that input source can be binary compatible in years to come.
 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For 
 example, it forces key-value semantics, and does not support running 
 arbitrary code on the driver side (an example of why this is useful is 
 broadcast). In addition, it is somewhat awkward to tell developers that in 
 order to implement an input source for Spark, they should learn the Hadoop 
 MapReduce API first.
 So here's the proposal: an InputSource is described by:
 * an array of InputPartition that specifies the data partitioning
 * a RecordReader that specifies how data on each partition can be read
 This interface would be similar to Hadoop's InputFormat, except that there is 
 no explicit key/value separation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7521) Allow all required release credentials to be specified with env vars

2015-06-17 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590444#comment-14590444
 ] 

Josh Rosen commented on SPARK-7521:
---

Has this been done now that we're publishing nightly snapshots?

 Allow all required release credentials to be specified with env vars
 

 Key: SPARK-7521
 URL: https://issues.apache.org/jira/browse/SPARK-7521
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell

 When creating releases the following credentials are needed:
 1. ASF private key, to post artifacts on people.apache.
 2. ASF username and password, to publish to maven and push tags to github.
 3. GPG private key and key password, to sign releases.
 Right now the assumption is that these are made present in the build 
 environment through env vars, installed the GPG and private keys, etc. This 
 makes it difficult for us to automate the build, such as allowing the full 
 build+publish to occur on any Jenkins machine.
 One way to fix this is to make sure all of these can be specified as env vars 
 which can then be securely threaded through to the jenkins builder. The 
 script itself would then e.g. create temporary GPG keys and RSA private key 
 for each build, using these env vars.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8406) Race condition when writing Parquet files

2015-06-17 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590494#comment-14590494
 ] 

Cheng Lian commented on SPARK-8406:
---

Yeah, just updated the JIRA description.  ORC may hit this issue only when two 
tasks with the same task ID (which means they are in two concurrent jobs) are 
writing to the same location within the same millisecond.

 Race condition when writing Parquet files
 -

 Key: SPARK-8406
 URL: https://issues.apache.org/jira/browse/SPARK-8406
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker

 To support appending, the Parquet data source tries to find out the max part 
 number of part-files in the destination directory (the id in output file 
 name part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, 
 this step happens on driver side before any files are written. However, in 
 1.4.0, this is moved to task side. Thus, for tasks scheduled later, they may 
 see wrong max part number generated by newly written files by other finished 
 tasks within the same job. This actually causes a race condition. In most 
 cases, this only causes nonconsecutive IDs in output file names. But when the 
 DataFrame contains thousands of RDD partitions, it's likely that two tasks 
 may choose the same part number, thus one of them gets overwritten by the 
 other.
 The following Spark shell snippet can reproduce nonconsecutive part numbers:
 {code}
 sqlContext.range(0, 
 128).repartition(16).write.mode(overwrite).parquet(foo)
 {code}
 16 can be replaced with any integer that is greater than the default 
 parallelism on your machine (usually it means core number, on my machine it's 
 8).
 {noformat}
 -rw-r--r--   3 lian supergroup  0 2015-06-17 00:06 
 /user/lian/foo/_SUCCESS
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-1.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-2.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-3.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-4.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-5.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-6.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-7.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-8.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00017.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00018.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00019.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00020.gz.parquet
 -rw-r--r--   3 lian supergroup352 2015-06-17 00:06 
 /user/lian/foo/part-r-00021.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00022.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00023.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00024.gz.parquet
 {noformat}
 And here is another Spark shell snippet for reproducing overwriting:
 {code}
 sqlContext.range(0, 
 1).repartition(500).write.mode(overwrite).parquet(foo)
 sqlContext.read.parquet(foo).count()
 {code}
 Expected answer should be {{1}}, but you may see a number like {{9960}} 
 due to overwriting. The actual number varies for different run and different 
 nodes.
 Notice that the newly added ORC data source is less likely to hit this issue 
 because it uses task ID and {{System.currentTimeMills()}} to generate the 
 output file name. Thus, the ORC data source may hit this issue only when two 
 tasks with the same task ID (which means they are in two concurrent jobs) are 
 writing to the same location within the same millisecond.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8406) Race condition when writing Parquet files

2015-06-17 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8406:
--
Description: 
To support appending, the Parquet data source tries to find out the max part 
number of part-files in the destination directory (the id in output file name 
part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this 
step happens on driver side before any files are written. However, in 1.4.0, 
this is moved to task side. Thus, for tasks scheduled later, they may see wrong 
max part number generated by newly written files by other finished tasks within 
the same job. This actually causes a race condition. In most cases, this only 
causes nonconsecutive IDs in output file names. But when the DataFrame contains 
thousands of RDD partitions, it's likely that two tasks may choose the same 
part number, thus one of them gets overwritten by the other.

The following Spark shell snippet can reproduce nonconsecutive part numbers:
{code}
sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo)
{code}
16 can be replaced with any integer that is greater than the default 
parallelism on your machine (usually it means core number, on my machine it's 
8).
{noformat}
-rw-r--r--   3 lian supergroup  0 2015-06-17 00:06 
/user/lian/foo/_SUCCESS
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-1.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-2.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-3.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-4.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-5.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-6.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-7.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-8.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00017.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00018.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00019.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00020.gz.parquet
-rw-r--r--   3 lian supergroup352 2015-06-17 00:06 
/user/lian/foo/part-r-00021.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00022.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00023.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00024.gz.parquet
{noformat}

And here is another Spark shell snippet for reproducing overwriting:
{code}
sqlContext.range(0, 
1).repartition(500).write.mode(overwrite).parquet(foo)
sqlContext.read.parquet(foo).count()
{code}
Expected answer should be {{1}}, but you may see a number like {{9960}} due 
to overwriting. The actual number varies for different run and different nodes.

Notice that the newly added ORC data source is less likely to hit this issue 
because it uses task ID and {{System.currentTimeMills()}} to generate the 
output file name. Thus, the ORC data source may hit this issue only when two 
tasks with the same task ID (which means they are in two concurrent jobs) are 
writing to the same location within the same millisecond.

  was:
To support appending, the Parquet data source tries to find out the max part 
number of part-files in the destination directory (the id in output file name 
part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this 
step happens on driver side before any files are written. However, in 1.4.0, 
this is moved to task side. Thus, for tasks scheduled later, they may see wrong 
max part number generated by newly written files by other finished tasks within 
the same job. This actually causes a race condition. In most cases, this only 
causes nonconsecutive IDs in output file names. But when the DataFrame contains 
thousands of RDD partitions, it's likely that two tasks may choose the same 
part number, thus one of them gets overwritten by the other.

The following Spark shell snippet can reproduce nonconsecutive part numbers:
{code}
sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo)
{code}
16 can be replaced with any integer that is greater than the default 
parallelism on your machine (usually it means core number, on my machine it's 
8).
{noformat}
-rw-r--r--   3 lian supergroup  0 2015-06-17 00:06 
/user/lian/foo/_SUCCESS
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 

[jira] [Commented] (SPARK-8365) pyspark does not retain --packages or --jars passed on the command line as of 1.4.0

2015-06-17 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590290#comment-14590290
 ] 

Don Drake commented on SPARK-8365:
--

Is there a workaround that you are aware of?

 pyspark does not retain --packages or --jars passed on the command line as of 
 1.4.0
 ---

 Key: SPARK-8365
 URL: https://issues.apache.org/jira/browse/SPARK-8365
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
Reporter: Don Drake
Priority: Blocker

 I downloaded the pre-compiled Spark 1.4.0 and attempted to run an existing 
 Python Spark application against it and got the following error:
 py4j.protocol.Py4JJavaError: An error occurred while calling o90.save.
 : java.lang.RuntimeException: Failed to load class for data source: 
 com.databricks.spark.csv
 I pass the following on the command-line to my spark-submit:
 --packages com.databricks:spark-csv_2.10:1.0.3
 This worked fine on 1.3.1, but not in 1.4.
 I was able to replicate it with the following pyspark:
 {code}
 a = {'a':1.0, 'b':'asdf'}
 rdd = sc.parallelize([a])
 df = sqlContext.createDataFrame(rdd)
 df.save(/tmp/d.csv, com.databricks.spark.csv)
 {code}
 Even using the new 
 df.write.format('com.databricks.spark.csv').save('/tmp/d.csv') gives the same 
 error. 
 I see it was added in the web UI:
 file:/Users/drake/.ivy2/jars/com.databricks_spark-csv_2.10-1.0.3.jar  Added 
 By User
 file:/Users/drake/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar   Added 
 By User
 http://10.0.0.222:56871/jars/com.databricks_spark-csv_2.10-1.0.3.jar  Added 
 By User
 http://10.0.0.222:56871/jars/org.apache.commons_commons-csv-1.1.jar   Added 
 By User
 Thoughts?
 *I also attempted using the Scala spark-shell to load a csv using the same 
 package and it worked just fine, so this seems specific to pyspark.*
 -Don
 Gory details:
 {code}
 $ pyspark --packages com.databricks:spark-csv_2.10:1.0.3
 Python 2.7.6 (default, Sep  9 2014, 15:04:36)
 [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
 Type help, copyright, credits or license for more information.
 Ivy Default Cache set to: /Users/drake/.ivy2/cache
 The jars for the packages stored in: /Users/drake/.ivy2/jars
 :: loading settings :: url = 
 jar:file:/Users/drake/spark/spark-1.4.0-bin-hadoop2.6/lib/spark-assembly-1.4.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
 com.databricks#spark-csv_2.10 added as a dependency
 :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
   confs: [default]
   found com.databricks#spark-csv_2.10;1.0.3 in central
   found org.apache.commons#commons-csv;1.1 in central
 :: resolution report :: resolve 590ms :: artifacts dl 17ms
   :: modules in use:
   com.databricks#spark-csv_2.10;1.0.3 from central in [default]
   org.apache.commons#commons-csv;1.1 from central in [default]
   -
   |  |modules||   artifacts   |
   |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
   -
   |  default |   2   |   0   |   0   |   0   ||   2   |   0   |
   -
 :: retrieving :: org.apache.spark#spark-submit-parent
   confs: [default]
   0 artifacts copied, 2 already retrieved (0kB/15ms)
 Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 15/06/13 11:06:08 INFO SparkContext: Running Spark version 1.4.0
 2015-06-13 11:06:08.921 java[19233:2145789] Unable to load realm info from 
 SCDynamicStore
 15/06/13 11:06:09 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 15/06/13 11:06:09 WARN Utils: Your hostname, Dons-MacBook-Pro-2.local 
 resolves to a loopback address: 127.0.0.1; using 10.0.0.222 instead (on 
 interface en0)
 15/06/13 11:06:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
 another address
 15/06/13 11:06:09 INFO SecurityManager: Changing view acls to: drake
 15/06/13 11:06:09 INFO SecurityManager: Changing modify acls to: drake
 15/06/13 11:06:09 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(drake); users 
 with modify permissions: Set(drake)
 15/06/13 11:06:10 INFO Slf4jLogger: Slf4jLogger started
 15/06/13 11:06:10 INFO Remoting: Starting remoting
 15/06/13 11:06:10 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkDriver@10.0.0.222:56870]
 15/06/13 11:06:10 INFO Utils: Successfully started service 'sparkDriver' on 
 port 

[jira] [Updated] (SPARK-6208) executor-memory does not work when using local cluster

2015-06-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6208:
---
Target Version/s:   (was: 1.4.0)

 executor-memory does not work when using local cluster
 --

 Key: SPARK-6208
 URL: https://issues.apache.org/jira/browse/SPARK-6208
 Project: Spark
  Issue Type: New Feature
  Components: Spark Submit
Reporter: Yin Huai
Priority: Minor

 Seems executor memory set with a local cluster is not correctly set (see 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L377).
  Also, totalExecutorCores seems has the same issue 
 (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L379).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7019) Build docs on doc changes

2015-06-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7019:
---
Target Version/s: 1.5.0  (was: 1.4.0)

 Build docs on doc changes
 -

 Key: SPARK-7019
 URL: https://issues.apache.org/jira/browse/SPARK-7019
 Project: Spark
  Issue Type: New Feature
  Components: Build
Reporter: Brennon York

 Currently when a pull request changes the {{docs/}} directory, the docs 
 aren't actually built. When a PR is submitted the {{git}} history should be 
 checked to see if any doc changes were made and, if so, properly build the 
 docs and report any issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7018) Refactor dev/run-tests-jenkins into Python

2015-06-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7018:
---
Target Version/s: 1.5.0  (was: 1.4.0)

 Refactor dev/run-tests-jenkins into Python
 --

 Key: SPARK-7018
 URL: https://issues.apache.org/jira/browse/SPARK-7018
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Project Infra
Reporter: Brennon York

 This issue is to specifically track the progress of the 
 {{dev/run-tests-jenkins}} script into Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7560) Make flaky tests easier to debug

2015-06-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-7560.
---
Resolution: Fixed

 Make flaky tests easier to debug
 

 Key: SPARK-7560
 URL: https://issues.apache.org/jira/browse/SPARK-7560
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra, Tests
Reporter: Patrick Wendell

 Right now it's really hard for people to even get the logs from a flakey 
 test. Once you get the logs, it's very difficult to figure out what logs are 
 associated with what tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6557) Only set bold text on PR github test output for problems

2015-06-17 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590454#comment-14590454
 ] 

Josh Rosen commented on SPARK-6557:
---

I'm going to mark this as being blocked by the refactoring of 
{{dev/run-tests-jeknins}} into Python.

 Only set bold text on PR github test output for problems
 

 Key: SPARK-6557
 URL: https://issues.apache.org/jira/browse/SPARK-6557
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Brennon York
Priority: Trivial
  Labels: starter

 Minor nit, but right now we highlight (i.e. place bold text) various comments 
 from the PR tests when the PR's are submitted. For example: we currently 
 highlight a PR that  merges successfully, and also highlight the fact that 
 the PR fails Spark tests.
 I propose that we **only highlight (bold) text when there is a problem with a 
 PR**. For instance, we should not bold that the patch merges cleanly, only 
 bold when it **does not**.
 The entire point is to make it easier for the committers and developers to 
 quickly glance over PR test output and understand what, if anything, they 
 need to dive into.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7889) Jobs progress of apps on complete page of HistoryServer shows uncompleted

2015-06-17 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590215#comment-14590215
 ] 

Steve Loughran commented on SPARK-7889:
---

Is JIRA about

(a) the status on the listing of complete/uncomplete being wrong in some way
(b) the actual job view (history/some-app-id) being stale when a job completes.

(b) is consistent with what I observed in SPARK-8275

Looking at your patch, and comparing it with my proposal, I prefer mine. All 
I'm proposing is invalidating the cache on work in progress, so that it is 
retrieved again.

Thinking about it some more, we can go one better: rely on the 
{{ApplicationHistoryInfo.lastUpdated}} field to tell us when the UI was last 
updated. If we cache the update time with the UI, on any GET of an appUI, we 
can look to see if the previous UI was not completed and if the lastupdated 
time has changed...if so. that triggers a refresh.

with this approach the entry you see will always be the one most recently 
published to the history store (of any implementation), and picked up by the 
history provider in its getListing()/background refresh operation.

 Jobs progress of apps on complete page of HistoryServer shows uncompleted
 -

 Key: SPARK-7889
 URL: https://issues.apache.org/jira/browse/SPARK-7889
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: meiyoula
Priority: Minor

 When running a SparkPi with 2000 tasks, cliking into the app on incomplete 
 page, the job progress shows 400/2000. After the app is completed, the app 
 goes to complete page from incomplete, and now cliking into the app, the  job 
 progress still shows 400/2000.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8406) Race condition when writing Parquet files

2015-06-17 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8406:
--
Description: 
To support appending, the Parquet data source tries to find out the max part 
number of part-files in the destination directory (the id in output file name 
part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this 
step happens on driver side before any files are written. However, in 1.4.0, 
this is moved to task side. Thus, for tasks scheduled later, they may see wrong 
max part number generated by newly written files by other finished tasks within 
the same job. This actually causes a race condition. In most cases, this only 
causes nonconsecutive IDs in output file names. But when the DataFrame contains 
thousands of RDD partitions, it's likely that two tasks may choose the same 
part number, thus one of them gets overwritten by the other.

The following Spark shell snippet can reproduce nonconsecutive part numbers:
{code}
sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo)
{code}
16 can be replaced with any integer that is greater than the default 
parallelism on your machine (usually it means core number, on my machine it's 
8).
{noformat}
-rw-r--r--   3 lian supergroup  0 2015-06-17 00:06 
/user/lian/foo/_SUCCESS
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-1.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-2.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-3.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-4.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-5.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-6.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-7.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-8.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00017.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00018.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00019.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00020.gz.parquet
-rw-r--r--   3 lian supergroup352 2015-06-17 00:06 
/user/lian/foo/part-r-00021.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00022.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00023.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00024.gz.parquet
{noformat}

And here is another Spark shell snippet for reproducing overwriting:
{code}
{code}

Notice that the newly added ORC data source doesn't suffer this issue because 
it uses both part number and {{System.currentTimeMills()}} to generate the 
output file name.

  was:
To support appending, the Parquet data source tries to find out the max part 
number of part-files in the destination directory (the id in output file name 
part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this 
step happens on driver side before any files are written. However, in 1.4.0, 
this is moved to task side. Thus, for tasks scheduled later, they may see wrong 
max part number generated by newly written files by other finished tasks within 
the same job. This actually causes a race condition. In most cases, this only 
causes nonconsecutive IDs in output file names. But when the DataFrame contains 
thousands of RDD partitions, it's likely that two tasks may choose the same 
part number, thus one of them gets overwritten by the other.

The data loss situation is not quite easy to reproduce. But the following Spark 
shell snippet can reproduce nonconsecutive output file IDs:
{code}
sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo)
{code}
16 can be replaced with any integer that is greater than the default 
parallelism on your machine (usually it means core number, on my machine it's 
8).
{noformat}
-rw-r--r--   3 lian supergroup  0 2015-06-17 00:06 
/user/lian/foo/_SUCCESS
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-1.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-2.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-3.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-4.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-5.gz.parquet

[jira] [Updated] (SPARK-4605) Proposed Contribution: Spark Kernel to enable interactive Spark applications

2015-06-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4605:
--
Component/s: (was: Project Infra)

 Proposed Contribution: Spark Kernel to enable interactive Spark applications
 

 Key: SPARK-4605
 URL: https://issues.apache.org/jira/browse/SPARK-4605
 Project: Spark
  Issue Type: New Feature
Reporter: Chip Senkbeil
 Attachments: Kernel Architecture Widescreen.pdf, Kernel 
 Architecture.pdf


 Project available on Github: https://github.com/ibm-et/spark-kernel
 
 This architecture is describing running kernel code that was demonstrated at 
 the StrataConf in Barcelona, Spain.
 
 Enables applications to interact with a Spark cluster using Scala in several 
 ways:
 * Defining and running core Spark Tasks
 * Collecting results from a cluster without needing to write to external data 
 store
 ** Ability to stream results using well-defined protocol
 * Arbitrary Scala code definition and execution (without submitting 
 heavy-weight jars)
 Applications can be hosted and managed separate from the Spark cluster using 
 the kernel as a proxy to communicate requests.
 The Spark Kernel implements the server side of the IPython Kernel protocol, 
 the rising “de-facto” protocol for language (Python, Haskell, etc.) execution.
 Inherits a suite of industry adopted clients such as the IPython Notebook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8415) Jenkins compilation spends lots of time re-resolving dependencies and waiting to acquire Ivy cache lock

2015-06-17 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-8415:
-

 Summary: Jenkins compilation spends lots of time re-resolving 
dependencies and waiting to acquire Ivy cache lock
 Key: SPARK-8415
 URL: https://issues.apache.org/jira/browse/SPARK-8415
 Project: Spark
  Issue Type: Bug
  Components: Build, Project Infra
Reporter: Josh Rosen


When watching a pull request build, I noticed that the compilation + packaging 
+ test compilation phases spent huge amounts of time waiting to acquire the Ivy 
cache lock.  We should see whether we can tell SBT to skip the resolution steps 
for some of these commands, since this could speed up the compilation process 
when Jenkins is heavily loaded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5178) Integrate Python unit tests into Jenkins

2015-06-17 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590442#comment-14590442
 ] 

Josh Rosen commented on SPARK-5178:
---

This is made slightly more complicated by the fact that the PRB tests three 
Python versions, so getting disambiguated test names might be tricky.

 Integrate Python unit tests into Jenkins
 

 Key: SPARK-5178
 URL: https://issues.apache.org/jira/browse/SPARK-5178
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 From [~joshrosen]:
 {quote}
 The Test Result pages for Jenkins builds shows some nice statistics for
 the test run, including individual test times:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/
 Currently this only covers the Java / Scala tests, but we might be able to
 integrate the PySpark tests here, too (I think it's just a matter of
 getting the Python test runner to generate the correct test result XML
 output).
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8356) Reconcile callUDF and callUdf

2015-06-17 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590513#comment-14590513
 ] 

Benjamin Fradet commented on SPARK-8356:


Ok, I'll make sure Udf disappear, should I open another JIRA or can I add it to 
the PR for this one?

 Reconcile callUDF and callUdf
 -

 Key: SPARK-8356
 URL: https://issues.apache.org/jira/browse/SPARK-8356
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical
  Labels: starter

 Right now we have two functions {{callUDF}} and {{callUdf}}.  I think the 
 former is used for calling Java functions (and the documentation is wrong) 
 and the latter is for calling functions by name.  Either way this is 
 confusing and we should unify or pick different names.  Also, lets make sure 
 the docs are right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8356) Reconcile callUDF and callUdf

2015-06-17 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590505#comment-14590505
 ] 

Michael Armbrust commented on SPARK-8356:
-

Sure, (and the convention in spark would be to use UDF), but those are internal 
APIs so I'm less concerned there.

 Reconcile callUDF and callUdf
 -

 Key: SPARK-8356
 URL: https://issues.apache.org/jira/browse/SPARK-8356
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical
  Labels: starter

 Right now we have two functions {{callUDF}} and {{callUdf}}.  I think the 
 former is used for calling Java functions (and the documentation is wrong) 
 and the latter is for calling functions by name.  Either way this is 
 confusing and we should unify or pick different names.  Also, lets make sure 
 the docs are right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7160) Support converting DataFrames to typed RDDs.

2015-06-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7160:

Assignee: Ray Ortigas

 Support converting DataFrames to typed RDDs.
 

 Key: SPARK-7160
 URL: https://issues.apache.org/jira/browse/SPARK-7160
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.1
Reporter: Ray Ortigas
Assignee: Ray Ortigas

 As a Spark user still working with RDDs, I'd like the ability to convert a 
 DataFrame to a typed RDD.
 For example, if I've converted RDDs to DataFrames so that I could save them 
 as Parquet or CSV files, I would like to rebuild the RDD from those files 
 automatically rather than writing the row-to-type conversion myself.
 {code}
 val rdd0 = sc.parallelize(Seq(Food(apple, 1), Food(banana, 2), 
 Food(cherry, 3)))
 val df0 = rdd0.toDF()
 df0.save(foods.parquet)
 val df1 = sqlContext.load(foods.parquet)
 val rdd1 = df1.toTypedRDD[Food]()
 // rdd0 and rdd1 should have the same elements
 {code}
 I originally submitted a smaller PR for spark-csv 
 https://github.com/databricks/spark-csv/pull/52, but Reynold Xin suggested 
 that converting a DataFrame to a typed RDD wasn't something specific to 
 spark-csv.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7160) Support converting DataFrames to typed RDDs.

2015-06-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7160:

Priority: Critical  (was: Major)
Target Version/s: 1.5.0
Shepherd: Michael Armbrust

 Support converting DataFrames to typed RDDs.
 

 Key: SPARK-7160
 URL: https://issues.apache.org/jira/browse/SPARK-7160
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.1
Reporter: Ray Ortigas
Assignee: Ray Ortigas
Priority: Critical

 As a Spark user still working with RDDs, I'd like the ability to convert a 
 DataFrame to a typed RDD.
 For example, if I've converted RDDs to DataFrames so that I could save them 
 as Parquet or CSV files, I would like to rebuild the RDD from those files 
 automatically rather than writing the row-to-type conversion myself.
 {code}
 val rdd0 = sc.parallelize(Seq(Food(apple, 1), Food(banana, 2), 
 Food(cherry, 3)))
 val df0 = rdd0.toDF()
 df0.save(foods.parquet)
 val df1 = sqlContext.load(foods.parquet)
 val rdd1 = df1.toTypedRDD[Food]()
 // rdd0 and rdd1 should have the same elements
 {code}
 I originally submitted a smaller PR for spark-csv 
 https://github.com/databricks/spark-csv/pull/52, but Reynold Xin suggested 
 that converting a DataFrame to a typed RDD wasn't something specific to 
 spark-csv.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3854) Scala style: require spaces before `{`

2015-06-17 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590596#comment-14590596
 ] 

Reynold Xin commented on SPARK-3854:


I don't think we have this yet.


 Scala style: require spaces before `{`
 --

 Key: SPARK-3854
 URL: https://issues.apache.org/jira/browse/SPARK-3854
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Josh Rosen

 We should require spaces before opening curly braces.  This isn't in the 
 style guide, but it probably should be:
 {code}
 // Correct:
 if (true) {
   println(Wow!)
 }
 // Incorrect:
 if (true){
println(Wow!)
 }
 {code}
 See https://github.com/apache/spark/pull/1658#discussion-diff-18611791 for an 
 example in the wild.
 {{git grep ){}} shows only a few occurrences of this style.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8144) For PySpark SQL, automatically convert values provided in readwriter options to string

2015-06-17 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-8144:

Summary: For PySpark SQL, automatically convert values provided in 
readwriter options to string  (was: PySpark SQL readwriter options() does not 
work)

 For PySpark SQL, automatically convert values provided in readwriter options 
 to string
 --

 Key: SPARK-8144
 URL: https://issues.apache.org/jira/browse/SPARK-8144
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Because of typos in lines 81 and 240 of:
 [https://github.com/apache/spark/blob/16fc49617e1dfcbe9122b224f7f63b7bfddb36ce/python/pyspark/sql/readwriter.py]
 (Search for option()
 CC: [~yhuai] [~davies]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7626) Actions on DataFrame created from HIVE table with newly added column throw NPE

2015-06-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7626:

Description: 
We recently added a new column page_context to a hive table named clicks, 
partitioned by data_date.  This leads to NPE being thrown on DataFrame created 
on older partitions without this column populated. For example:

{code}
val hc = new HiveContext(sc)
val clk = hc.sql(select * from clicks where data_date=20150302)
clk.show()
{code}

throws the following error msg:

{code}
java.lang.RuntimeException: cannot find field page_context from 
[0:log_format_number, .]
at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:415)
at 
org.apache.hadoop.hive.serde2.lazy.objectinspector.LazySimpleStructObjectInspector.getStructFieldRef(LazySimpleStructObjectInspector.java:173)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:278)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:277)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.hive.HadoopTableReader$.fillObject(TableReader.scala:277)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$4$$anonfun$9.apply(TableReader.scala:194)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$4$$anonfun$9.apply(TableReader.scala:188)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
{code}

  was:

We recently added a new column page_context to a hive table named clicks, 
partitioned by data_date.  This leads to NPE being thrown on DataFrame created 
on older partitions without this column populated. For example:

val hc = new HiveContext(sc)
val clk = hc.sql(select * from clicks where data_date=20150302)
clk.show()

throws the following error msg:

java.lang.RuntimeException: cannot find field page_context from 
[0:log_format_number, .]
at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:415)
at 
org.apache.hadoop.hive.serde2.lazy.objectinspector.LazySimpleStructObjectInspector.getStructFieldRef(LazySimpleStructObjectInspector.java:173)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:278)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:277)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.hive.HadoopTableReader$.fillObject(TableReader.scala:277)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$4$$anonfun$9.apply(TableReader.scala:194)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$4$$anonfun$9.apply(TableReader.scala:188)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at 

[jira] [Updated] (SPARK-7626) Actions on DataFrame created from HIVE table with newly added column throw NPE

2015-06-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7626:

Component/s: (was: Spark Core)
 SQL

 Actions on DataFrame created from HIVE table with newly added column throw 
 NPE 
 ---

 Key: SPARK-7626
 URL: https://issues.apache.org/jira/browse/SPARK-7626
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Zhiyang Guo

 We recently added a new column page_context to a hive table named clicks, 
 partitioned by data_date.  This leads to NPE being thrown on DataFrame 
 created on older partitions without this column populated. For example:
 val hc = new HiveContext(sc)
 val clk = hc.sql(select * from clicks where data_date=20150302)
 clk.show()
 throws the following error msg:
 java.lang.RuntimeException: cannot find field page_context from 
 [0:log_format_number, .]
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:415)
 at 
 org.apache.hadoop.hive.serde2.lazy.objectinspector.LazySimpleStructObjectInspector.getStructFieldRef(LazySimpleStructObjectInspector.java:173)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:278)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:277)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$.fillObject(TableReader.scala:277)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$4$$anonfun$9.apply(TableReader.scala:194)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$4$$anonfun$9.apply(TableReader.scala:188)
 at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
 at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7067) Can't resolve nested column in ORDER BY

2015-06-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-7067.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 5659
[https://github.com/apache/spark/pull/5659]

 Can't resolve nested column in ORDER BY
 ---

 Key: SPARK-7067
 URL: https://issues.apache.org/jira/browse/SPARK-7067
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
 Fix For: 1.5.0


 In order to avoid breaking existing HiveQL queries, the current way we 
 resolve column in ORDER BY is: first resolve based on what comes from the 
 select clause and then fall back on its child only when this fails.
 However, this case will fail:
 {code}
 test(orderby queries) {
   jsonRDD(sparkContext.makeRDD(
 {a: {b: [{c: 1}]}, b: [{d: 1}]} :: 
 Nil)).registerTempTable(t)
   sql(SELECT a.b FROM t ORDER BY b[0].d).queryExecution.analyzed
 }
 {code}
 As hive doesn't support resolve ORDER BY attribute not exist in select 
 clause, so this problem is spark sql only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7026) LeftSemiJoin can not work when it has both equal condition and not equal condition.

2015-06-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7026:

Target Version/s: 1.5.0

 LeftSemiJoin can not work when it  has both equal condition and not equal 
 condition. 
 -

 Key: SPARK-7026
 URL: https://issues.apache.org/jira/browse/SPARK-7026
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Zhongshuai Pei
Assignee: Adrian Wang

 Run sql like that 
 {panel}
 select *
 from
 web_sales ws1
 left semi join
 web_sales ws2
 on ws1.ws_order_number = ws2.ws_order_number
 and ws1.ws_warehouse_sk  ws2.ws_warehouse_sk 
 {panel}
  then get an exception
 {panel}
 Couldn't find ws_warehouse_sk#287 in 
 {ws_sold_date_sk#237,ws_sold_time_sk#238,ws_ship_date_sk#239,ws_item_sk#240,ws_bill_customer_sk#241,ws_bill_cdemo_sk#242,ws_bill_hdemo_sk#243,ws_bill_addr_sk#244,ws_ship_customer_sk#245,ws_ship_cdemo_sk#246,ws_ship_hdemo_sk#247,ws_ship_addr_sk#248,ws_web_page_sk#249,ws_web_site_sk#250,ws_ship_mode_sk#251,ws_warehouse_sk#252,ws_promo_sk#253,ws_order_number#254,ws_quantity#255,ws_wholesale_cost#256,ws_list_price#257,ws_sales_price#258,ws_ext_discount_amt#259,ws_ext_sales_price#260,ws_ext_wholesale_cost#261,ws_ext_list_price#262,ws_ext_tax#263,ws_coupon_amt#264,ws_ext_ship_cost#265,ws_net_paid#266,ws_net_paid_inc_tax#267,ws_net_paid_inc_ship#268,ws_net_paid_inc_ship_tax#269,ws_net_profit#270,ws_sold_date#236}
 at scala.sys.package$.error(package.scala:27)
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8390) Update DirectKafkaWordCount examples to show how offset ranges can be used

2015-06-17 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590720#comment-14590720
 ] 

Cody Koeninger commented on SPARK-8390:
---

Did we actually want to update the wordcount examples (might confuse, since you 
dont need offset ranges for minimal wordcount usage)... or just fix the part of 
the docs about the offset ranges?

PR is just fixing the docs for now.

I'd personally prefer to link to the talk / slides about direct stream once its 
available... not sure how you feel about external links in the doc.

 Update DirectKafkaWordCount examples to show how offset ranges can be used
 --

 Key: SPARK-8390
 URL: https://issues.apache.org/jira/browse/SPARK-8390
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Tathagata Das
Assignee: Cody Koeninger





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8368) ClassNotFoundException in closure for map

2015-06-17 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-8368:

Priority: Blocker  (was: Major)

 ClassNotFoundException in closure for map 
 --

 Key: SPARK-8368
 URL: https://issues.apache.org/jira/browse/SPARK-8368
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
 Environment: Centos 6.5, java 1.7.0_67, scala 2.10.4. Build the 
 project on Windows 7 and run in a spark standalone cluster(or local) mode on 
 Centos 6.X. 
Reporter: CHEN Zhiwei
Priority: Blocker

 After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the 
 following exception:
 ==begin exception
 {quote}
 Exception in thread main java.lang.ClassNotFoundException: 
 com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:278)
   at 
 org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
  Source)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
  Source)
   at 
 org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101)
   at 
 org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197)
   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
   at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
   at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294)
   at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293)
   at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
   at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
   at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
   at org.apache.spark.rdd.RDD.map(RDD.scala:293)
   at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210)
   at com.yhd.ycache.magic.Model$.main(SSExample.scala:239)
   at com.yhd.ycache.magic.Model.main(SSExample.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {quote}
 ===end exception===
 I simplify the code that cause this issue, as following:
 ==begin code==
 {noformat}
 object Model extends Serializable{
   def main(args: Array[String]) {
 val Array(sql) = args
 val sparkConf = new SparkConf().setAppName(Mode Example)
 val sc = new SparkContext(sparkConf)
 val hive = new HiveContext(sc)
 //get data by hive sql
 val rows = hive.sql(sql)
 val data = rows.map(r = { 
   val arr = r.toSeq.toArray
   val label = 1.0
   def fmap = ( input: Any ) = 1.0
   val feature = arr.map(_=1.0)
   LabeledPoint(label, Vectors.dense(feature))
 })
 data.count()
   }
 }
 {noformat}
 =end code===
 This code can run pretty well on spark-shell, but error when submit it to 
 spark cluster (standalone or local mode).  I try the same code on spark 
 1.3.0(local mode), and no exception is encountered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8368) ClassNotFoundException in closure for map

2015-06-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590731#comment-14590731
 ] 

Yin Huai commented on SPARK-8368:
-

Right now, looks like it is a problem caused by spark sql's isolated class 
loader.

 ClassNotFoundException in closure for map 
 --

 Key: SPARK-8368
 URL: https://issues.apache.org/jira/browse/SPARK-8368
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
 Environment: Centos 6.5, java 1.7.0_67, scala 2.10.4. Build the 
 project on Windows 7 and run in a spark standalone cluster(or local) mode on 
 Centos 6.X. 
Reporter: CHEN Zhiwei
Priority: Blocker

 After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the 
 following exception:
 ==begin exception
 {quote}
 Exception in thread main java.lang.ClassNotFoundException: 
 com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:278)
   at 
 org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
  Source)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
  Source)
   at 
 org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101)
   at 
 org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197)
   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
   at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
   at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294)
   at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293)
   at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
   at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
   at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
   at org.apache.spark.rdd.RDD.map(RDD.scala:293)
   at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210)
   at com.yhd.ycache.magic.Model$.main(SSExample.scala:239)
   at com.yhd.ycache.magic.Model.main(SSExample.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {quote}
 ===end exception===
 I simplify the code that cause this issue, as following:
 ==begin code==
 {noformat}
 object Model extends Serializable{
   def main(args: Array[String]) {
 val Array(sql) = args
 val sparkConf = new SparkConf().setAppName(Mode Example)
 val sc = new SparkContext(sparkConf)
 val hive = new HiveContext(sc)
 //get data by hive sql
 val rows = hive.sql(sql)
 val data = rows.map(r = { 
   val arr = r.toSeq.toArray
   val label = 1.0
   def fmap = ( input: Any ) = 1.0
   val feature = arr.map(_=1.0)
   LabeledPoint(label, Vectors.dense(feature))
 })
 data.count()
   }
 }
 {noformat}
 =end code===
 This code can run pretty well on spark-shell, but error when submit it to 
 spark cluster (standalone or local mode).  I try the same code on spark 
 1.3.0(local mode), and no exception is encountered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8397) Allow custom configuration for TestHive

2015-06-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-8397.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6844
[https://github.com/apache/spark/pull/6844]

 Allow custom configuration for TestHive
 ---

 Key: SPARK-8397
 URL: https://issues.apache.org/jira/browse/SPARK-8397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Punya Biswal
Priority: Minor
 Fix For: 1.5.0


 We encourage people to use {{TestHive}} in unit tests, because it's 
 impossible to create more than one {{HiveContext}} within one process. The 
 current implementation locks people into using a {{local[2]}} 
 {{SparkContext}} underlying their {{HiveContext}}. We should make it possible 
 to override this using a system property so that people can test against 
 {{local-cluster}} or remote spark clusters to make their tests more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8332) NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer

2015-06-17 Thread Olivier Girardot (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier Girardot updated SPARK-8332:

Description: 
I complied new spark 1.4.0 version. 
But when I run a simple WordCount demo, it throws NoSuchMethodError 
{code}
java.lang.NoSuchMethodError: 
com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
{code}

I found out that the default fasterxml.jackson.version is 2.4.4. 
Is there any wrong or conflict with the jackson version? 
Or is there possibly some project maven dependency containing the wrong version 
of jackson?

  was:
I complied new spark 1.4.0 versio. But when I run a simple WordCount demo, it 
throws NoSuchMethodError java.lang.NoSuchMethodError: 
com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer. 
I found the default fasterxml.jackson.version is 2.4.4. It's there any wrong 
or conflict with the jackson version? Or is there possible some project maven 
dependency contains wrong version jackson?


 NoSuchMethodError: 
 com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
 --

 Key: SPARK-8332
 URL: https://issues.apache.org/jira/browse/SPARK-8332
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
 Environment: spark 1.4  hadoop 2.3.0-cdh5.0.0
Reporter: Tao Li
Priority: Critical
  Labels: 1.4.0, NoSuchMethodError, com.fasterxml.jackson

 I complied new spark 1.4.0 version. 
 But when I run a simple WordCount demo, it throws NoSuchMethodError 
 {code}
 java.lang.NoSuchMethodError: 
 com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
 {code}
 I found out that the default fasterxml.jackson.version is 2.4.4. 
 Is there any wrong or conflict with the jackson version? 
 Or is there possibly some project maven dependency containing the wrong 
 version of jackson?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5971) Add Mesos support to spark-ec2

2015-06-17 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-5971:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Add Mesos support to spark-ec2
 --

 Key: SPARK-5971
 URL: https://issues.apache.org/jira/browse/SPARK-5971
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 Right now, spark-ec2 can only launch Spark clusters that use the standalone 
 manager.
 Adding support for Mesos would be useful mostly for automated performance 
 testing of Spark on Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6218) Upgrade spark-ec2 from optparse to argparse

2015-06-17 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590791#comment-14590791
 ] 

Shivaram Venkataraman commented on SPARK-6218:
--

[~nchammas] I just updated the target version to 1.5.0 for this. FWIW I don't 
have a strong opinion about which argument parsing library we use as long we 
can maintain compatibility with Python 2.6

 Upgrade spark-ec2 from optparse to argparse
 ---

 Key: SPARK-6218
 URL: https://issues.apache.org/jira/browse/SPARK-6218
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 spark-ec2 [currently uses 
 optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43].
 In Python 2.7, optparse was [deprecated in favor of 
 argparse|https://docs.python.org/2/library/optparse.html]. This is the main 
 motivation for moving away from optparse.
 Additionally, upgrading to argparse provides some [additional benefits noted 
 in the 
 docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html].
  The one we are mostly likely to benefit from is the better input validation.
 Specifically, being able to cleanly tie each input parameter to a validation 
 method will cut down the input validation code currently spread out across 
 the script.
 argparse is not include with Python 2.6, which is currently the minimum 
 version of Python we support in Spark, but it can easily be downloaded by 
 spark-ec2 with the work that has already been done in SPARK-6191.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5971) Add Mesos support to spark-ec2

2015-06-17 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590799#comment-14590799
 ] 

Shivaram Venkataraman commented on SPARK-5971:
--

Moving this to target 1.5.0

cc [~tnachen] who might be interested in this.

 Add Mesos support to spark-ec2
 --

 Key: SPARK-5971
 URL: https://issues.apache.org/jira/browse/SPARK-5971
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 Right now, spark-ec2 can only launch Spark clusters that use the standalone 
 manager.
 Adding support for Mesos would be useful mostly for automated performance 
 testing of Spark on Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7814) Turn code generation on by default

2015-06-17 Thread Herman van Hovell tot Westerflier (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590885#comment-14590885
 ] 

Herman van Hovell tot Westerflier commented on SPARK-7814:
--

I have build spark from the latest source using Hadoop 2.3/2.6 (tried them 
both), using the following command: 
{noformat}
make-distribution.sh -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive 
-Phive-thriftserver
{noformat}
When execute the following commands:
{noformat}
val otp = sqlContext.read.parquet(Input/otp.prq)
otp.count
{noformat}
I get the following Janino (Code Generation) error:
{noformat}
15/06/17 19:35:51 ERROR TaskSetManager: Task 1 in stage 0.0 failed 1 times; 
aborting job
15/06/17 19:35:51 ERROR GenerateProjection: failed to compile:
 
import org.apache.spark.sql.catalyst.InternalRow;

public SpecificProjection 
generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
  return new SpecificProjection(expr);
}

class SpecificProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProject {
  private org.apache.spark.sql.catalyst.expressions.Expression[] 
expressions = null;

  public 
SpecificProjection(org.apache.spark.sql.catalyst.expressions.Expression[] expr) 
{
expressions = expr;
  }

  @Override
  public Object apply(Object r) {
return new SpecificRow(expressions, (InternalRow) r);
  }
}

final class SpecificRow extends org.apache.spark.sql.BaseMutableRow {

  private long c0 = -1L;


  public SpecificRow(org.apache.spark.sql.catalyst.expressions.Expression[] 
expressions, InternalRow i) {

{
  // column0
  
  nullBits[0] = false;
  if (!false) {
c0 = 0L;
  }
}

  }

  public int size() { return 1;}
  protected boolean[] nullBits = new boolean[1];
  public void setNullAt(int i) { nullBits[i] = true; }
  public boolean isNullAt(int i) { return nullBits[i]; }

  public Object get(int i) {
if (isNullAt(i)) return null;
switch (i) {
case 0: return c0;
}
return null;
  }
  public void update(int i, Object value) {
if (value == null) {
  setNullAt(i);
  return;
}
nullBits[i] = false;
switch (i) {
case 0: { c0 = (Long)value; return;}
}
  }
  


  @Override
  public long getLong(int i) {
if (isNullAt(i)) {
  return -1L;
}
switch (i) {
case 0: return c0;
}
throw new IllegalArgumentException(Invalid index:  + i
  +  in getLong);
  }




  


  @Override
  public void setLong(int i, long value) {
nullBits[i] = false;
switch (i) {
case 0: { c0 = value; return; }
}
throw new IllegalArgumentException(Invalid index:  + i +
   in setLong);
  }





  @Override
  public int hashCode() {
int result = 37;

result *= 37; result += isNullAt(0) ? 0 : (c0 ^ (c0  32));
return result;
  }

  @Override
  public boolean equals(Object other) {
if (other instanceof SpecificRow) {
  SpecificRow row = (SpecificRow) other;
  
if (nullBits[0] != row.nullBits[0] ||
  (!nullBits[0]  !(c0 == row.c0))) {
  return false;
}
  
  return true;
}
return super.equals(other);
  }
}

org.codehaus.commons.compiler.CompileException: Line 16, Column 33: Object
at 
org.codehaus.janino.UnitCompiler.findTypeByName(UnitCompiler.java:6897)
at 
org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5331)
at 
org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5207)
at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5188)
at org.codehaus.janino.UnitCompiler.access$12600(UnitCompiler.java:185)
at 
org.codehaus.janino.UnitCompiler$16.visitReferenceType(UnitCompiler.java:5119)
at org.codehaus.janino.Java$ReferenceType.accept(Java.java:2880)
at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5159)
at org.codehaus.janino.UnitCompiler.access$16700(UnitCompiler.java:185)
at 
org.codehaus.janino.UnitCompiler$31.getParameterTypes2(UnitCompiler.java:8533)
at 
org.codehaus.janino.IClass$IInvocable.getParameterTypes(IClass.java:835)
at org.codehaus.janino.IClass$IMethod.getDescriptor2(IClass.java:1063)
at org.codehaus.janino.IClass$IInvocable.getDescriptor(IClass.java:849)
at org.codehaus.janino.IClass.getIMethods(IClass.java:211)
at org.codehaus.janino.IClass.getIMethods(IClass.java:199)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:409)
at 

[jira] [Created] (SPARK-8417) spark-class has illegal statement

2015-06-17 Thread jweinste (JIRA)
jweinste created SPARK-8417:
---

 Summary: spark-class has illegal statement
 Key: SPARK-8417
 URL: https://issues.apache.org/jira/browse/SPARK-8417
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.4.0
Reporter: jweinste
Priority: Blocker


spark-class

There is an illegal statement.

done  ($RUNNER -cp $LAUNCH_CLASSPATH org.apache.spark.launcher.Main $@)

Complaint is

./bin/spark-class: line 100: syntax error near unexpected token `'





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-8412) java#KafkaUtils.createDirectStream Java(Pair)RDDs do not implement HasOffsetRanges

2015-06-17 Thread jweinste (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jweinste reopened SPARK-8412:
-

This is improperly implemented or improperly documented. Either way something 
needs to be corrected.

http://spark.apache.org/docs/latest/streaming-kafka-integration.html

You'll have to scan down to

http://spark.apache.org/docs/latest/streaming-kafka-integration.html
#tab_java_2

directKafkaStream.foreachRDD(
 new FunctionJavaPairRDDString, String, Void() {
 @Override
 public Void call(JavaPairRDDString, Integer rdd) throws IOException {
 OffsetRange[] offsetRanges = ((HasOffsetRanges)rdd).offsetRanges
 // offsetRanges.length = # of Kafka partitions being consumed
 ...
 return null;
 }
 }
 );


 java#KafkaUtils.createDirectStream Java(Pair)RDDs do not implement 
 HasOffsetRanges
 --

 Key: SPARK-8412
 URL: https://issues.apache.org/jira/browse/SPARK-8412
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: jweinste
Priority: Critical

 // Create direct kafka stream with brokers and topics
 final JavaPairInputDStreamString, String messages = 
 KafkaUtils.createDirectStream(jssc, String.class, String.class, 
 StringDecoder.class,
 StringDecoder.class, kafkaParams, topics);
 messages.foreachRDD(new FunctionJavaPairRDDString, String, 
 Void() {
 @Override
 public Void call(final JavaPairRDDString, String rdd) 
 throws Exception {
 if (rdd instanceof HasOffsetRanges) {
 //will never happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7017) Refactor dev/run-tests into Python

2015-06-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590826#comment-14590826
 ] 

Apache Spark commented on SPARK-7017:
-

User 'brennonyork' has created a pull request for this issue:
https://github.com/apache/spark/pull/6865

 Refactor dev/run-tests into Python
 --

 Key: SPARK-7017
 URL: https://issues.apache.org/jira/browse/SPARK-7017
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Project Infra
Reporter: Brennon York
Assignee: Brennon York
 Fix For: 1.5.0


 This issue is to specifically track the progress of the {{dev/run-tests}} 
 script into Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8348) Add in operator to DataFrame Column

2015-06-17 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590880#comment-14590880
 ] 

Yu Ishikawa commented on SPARK-8348:


Hi [~shivaram], Thank you for letting me know another PR to add operations into 
SparkR.

Can I ask you a couple of questions about adding a new operator? Those added 
operations doesn't include any method to deal with array or list. I am having 
trouble with how I can deal with array or list in arguments to call a Java 
method. The gist includes the details of code and error messages. Please check 
it.
https://gist.github.com/yu-iskw/ba249f79ef338ff86967

Anyway, {{filter(df, age in (19))}} can work without problems. But How do I 
implement {{%in%}} in SparkR?

 Add in operator to DataFrame Column
 ---

 Key: SPARK-8348
 URL: https://issues.apache.org/jira/browse/SPARK-8348
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Xiangrui Meng

 It is convenient to add in operator to column, so we can filter values in a 
 set.
 {code}
 df.filter(col(brand).in(dell, sony))
 {code}
 In R, the operator should be `%in%`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8406) Race condition when writing Parquet files

2015-06-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8406:
---

Assignee: Apache Spark  (was: Cheng Lian)

 Race condition when writing Parquet files
 -

 Key: SPARK-8406
 URL: https://issues.apache.org/jira/browse/SPARK-8406
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Apache Spark
Priority: Blocker

 To support appending, the Parquet data source tries to find out the max part 
 number of part-files in the destination directory (the id in output file 
 name part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, 
 this step happens on driver side before any files are written. However, in 
 1.4.0, this is moved to task side. Thus, for tasks scheduled later, they may 
 see wrong max part number generated by newly written files by other finished 
 tasks within the same job. This actually causes a race condition. In most 
 cases, this only causes nonconsecutive IDs in output file names. But when the 
 DataFrame contains thousands of RDD partitions, it's likely that two tasks 
 may choose the same part number, thus one of them gets overwritten by the 
 other.
 The following Spark shell snippet can reproduce nonconsecutive part numbers:
 {code}
 sqlContext.range(0, 
 128).repartition(16).write.mode(overwrite).parquet(foo)
 {code}
 16 can be replaced with any integer that is greater than the default 
 parallelism on your machine (usually it means core number, on my machine it's 
 8).
 {noformat}
 -rw-r--r--   3 lian supergroup  0 2015-06-17 00:06 
 /user/lian/foo/_SUCCESS
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-1.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-2.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-3.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-4.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-5.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-6.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-7.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-8.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00017.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00018.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00019.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00020.gz.parquet
 -rw-r--r--   3 lian supergroup352 2015-06-17 00:06 
 /user/lian/foo/part-r-00021.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00022.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00023.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00024.gz.parquet
 {noformat}
 And here is another Spark shell snippet for reproducing overwriting:
 {code}
 sqlContext.range(0, 
 1).repartition(500).write.mode(overwrite).parquet(foo)
 sqlContext.read.parquet(foo).count()
 {code}
 Expected answer should be {{1}}, but you may see a number like {{9960}} 
 due to overwriting. The actual number varies for different runs and different 
 nodes.
 Notice that the newly added ORC data source is less likely to hit this issue 
 because it uses task ID and {{System.currentTimeMills()}} to generate the 
 output file name. Thus, the ORC data source may hit this issue only when two 
 tasks with the same task ID (which means they are in two concurrent jobs) are 
 writing to the same location within the same millisecond.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8406) Race condition when writing Parquet files

2015-06-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590770#comment-14590770
 ] 

Apache Spark commented on SPARK-8406:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/6864

 Race condition when writing Parquet files
 -

 Key: SPARK-8406
 URL: https://issues.apache.org/jira/browse/SPARK-8406
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker

 To support appending, the Parquet data source tries to find out the max part 
 number of part-files in the destination directory (the id in output file 
 name part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, 
 this step happens on driver side before any files are written. However, in 
 1.4.0, this is moved to task side. Thus, for tasks scheduled later, they may 
 see wrong max part number generated by newly written files by other finished 
 tasks within the same job. This actually causes a race condition. In most 
 cases, this only causes nonconsecutive IDs in output file names. But when the 
 DataFrame contains thousands of RDD partitions, it's likely that two tasks 
 may choose the same part number, thus one of them gets overwritten by the 
 other.
 The following Spark shell snippet can reproduce nonconsecutive part numbers:
 {code}
 sqlContext.range(0, 
 128).repartition(16).write.mode(overwrite).parquet(foo)
 {code}
 16 can be replaced with any integer that is greater than the default 
 parallelism on your machine (usually it means core number, on my machine it's 
 8).
 {noformat}
 -rw-r--r--   3 lian supergroup  0 2015-06-17 00:06 
 /user/lian/foo/_SUCCESS
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-1.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-2.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-3.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-4.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-5.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-6.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-7.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-8.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00017.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00018.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00019.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00020.gz.parquet
 -rw-r--r--   3 lian supergroup352 2015-06-17 00:06 
 /user/lian/foo/part-r-00021.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00022.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00023.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00024.gz.parquet
 {noformat}
 And here is another Spark shell snippet for reproducing overwriting:
 {code}
 sqlContext.range(0, 
 1).repartition(500).write.mode(overwrite).parquet(foo)
 sqlContext.read.parquet(foo).count()
 {code}
 Expected answer should be {{1}}, but you may see a number like {{9960}} 
 due to overwriting. The actual number varies for different runs and different 
 nodes.
 Notice that the newly added ORC data source is less likely to hit this issue 
 because it uses task ID and {{System.currentTimeMills()}} to generate the 
 output file name. Thus, the ORC data source may hit this issue only when two 
 tasks with the same task ID (which means they are in two concurrent jobs) are 
 writing to the same location within the same millisecond.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8406) Race condition when writing Parquet files

2015-06-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8406:
---

Assignee: Cheng Lian  (was: Apache Spark)

 Race condition when writing Parquet files
 -

 Key: SPARK-8406
 URL: https://issues.apache.org/jira/browse/SPARK-8406
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker

 To support appending, the Parquet data source tries to find out the max part 
 number of part-files in the destination directory (the id in output file 
 name part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, 
 this step happens on driver side before any files are written. However, in 
 1.4.0, this is moved to task side. Thus, for tasks scheduled later, they may 
 see wrong max part number generated by newly written files by other finished 
 tasks within the same job. This actually causes a race condition. In most 
 cases, this only causes nonconsecutive IDs in output file names. But when the 
 DataFrame contains thousands of RDD partitions, it's likely that two tasks 
 may choose the same part number, thus one of them gets overwritten by the 
 other.
 The following Spark shell snippet can reproduce nonconsecutive part numbers:
 {code}
 sqlContext.range(0, 
 128).repartition(16).write.mode(overwrite).parquet(foo)
 {code}
 16 can be replaced with any integer that is greater than the default 
 parallelism on your machine (usually it means core number, on my machine it's 
 8).
 {noformat}
 -rw-r--r--   3 lian supergroup  0 2015-06-17 00:06 
 /user/lian/foo/_SUCCESS
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-1.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-2.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-3.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-4.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-5.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-6.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-7.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-8.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00017.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00018.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00019.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00020.gz.parquet
 -rw-r--r--   3 lian supergroup352 2015-06-17 00:06 
 /user/lian/foo/part-r-00021.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00022.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00023.gz.parquet
 -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
 /user/lian/foo/part-r-00024.gz.parquet
 {noformat}
 And here is another Spark shell snippet for reproducing overwriting:
 {code}
 sqlContext.range(0, 
 1).repartition(500).write.mode(overwrite).parquet(foo)
 sqlContext.read.parquet(foo).count()
 {code}
 Expected answer should be {{1}}, but you may see a number like {{9960}} 
 due to overwriting. The actual number varies for different runs and different 
 nodes.
 Notice that the newly added ORC data source is less likely to hit this issue 
 because it uses task ID and {{System.currentTimeMills()}} to generate the 
 output file name. Thus, the ORC data source may hit this issue only when two 
 tasks with the same task ID (which means they are in two concurrent jobs) are 
 writing to the same location within the same millisecond.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6218) Upgrade spark-ec2 from optparse to argparse

2015-06-17 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6218:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Upgrade spark-ec2 from optparse to argparse
 ---

 Key: SPARK-6218
 URL: https://issues.apache.org/jira/browse/SPARK-6218
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 spark-ec2 [currently uses 
 optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43].
 In Python 2.7, optparse was [deprecated in favor of 
 argparse|https://docs.python.org/2/library/optparse.html]. This is the main 
 motivation for moving away from optparse.
 Additionally, upgrading to argparse provides some [additional benefits noted 
 in the 
 docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html].
  The one we are mostly likely to benefit from is the better input validation.
 Specifically, being able to cleanly tie each input parameter to a validation 
 method will cut down the input validation code currently spread out across 
 the script.
 argparse is not include with Python 2.6, which is currently the minimum 
 version of Python we support in Spark, but it can easily be downloaded by 
 spark-ec2 with the work that has already been done in SPARK-6191.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6813) SparkR style guide

2015-06-17 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590862#comment-14590862
 ] 

Yu Ishikawa commented on SPARK-6813:


Before discussing the details of coding style, let me confirm.
Should we consider the license of a lint software or not?

And in my opinion, merging the lint software to master branch is more important 
than making a perfect style guide. So first of all, we should create an almost 
perfect style guide and run unit tests on the official Jenkins to the style 
guide. If we think of a new idea for the style guide, we can add it later. 

 SparkR style guide
 --

 Key: SPARK-6813
 URL: https://issues.apache.org/jira/browse/SPARK-6813
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman

 We should develop a SparkR style guide document based on the some of the 
 guidelines we use and some of the best practices in R.
 Some examples of R style guide are:
 http://r-pkgs.had.co.nz/r.html#style 
 http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html
 A related issue is to work on a automatic style checking tool. 
 https://github.com/jimhester/lintr seems promising
 We could have a R style guide based on the one from google [1], and adjust 
 some of them with the conversation in Spark:
 1. Line Length: maximum 100 characters
 2. no limit on function name (API should be similar as in other languages)
 3. Allow S4 objects/methods



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8418) Add single- and multi-value support to ML Transformers

2015-06-17 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-8418:


 Summary: Add single- and multi-value support to ML Transformers
 Key: SPARK-8418
 URL: https://issues.apache.org/jira/browse/SPARK-8418
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley


It would be convenient if all feature transformers supported transforming 
columns of single values and multiple values, specifically:
* one column with one value (e.g., type {{Double}})
* one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})

We could go as far as supporting multiple columns, but that may not be 
necessary since VectorAssembler could be used to handle that.

Estimators under {{ml.feature}} should also support this.

This will likely require a short design doc to describe:
* how input and output columns will be specified
* schema validation
* code sharing to reduce duplication




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8419) Statistics.colStats could avoid an extra count()

2015-06-17 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-8419:


 Summary: Statistics.colStats could avoid an extra count()
 Key: SPARK-8419
 URL: https://issues.apache.org/jira/browse/SPARK-8419
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Trivial


Statistics.colStats goes through RowMatrix to compute the stats.  But 
RowMatrix.computeColumnSummaryStatistics does an extra count() which could be 
avoided.  Not going through RowMatrix would skip this extra pass over the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7814) Turn code generation on by default

2015-06-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7814:
---
Description: Turn code gen on, find a lot of bugs, and see what happens.

 Turn code generation on by default
 --

 Key: SPARK-7814
 URL: https://issues.apache.org/jira/browse/SPARK-7814
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu
 Fix For: 1.5.0


 Turn code gen on, find a lot of bugs, and see what happens.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8335) DecisionTreeModel.predict() return type not convenient!

2015-06-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590922#comment-14590922
 ] 

Joseph K. Bradley commented on SPARK-8335:
--

I've discussed this with [~mengxr] and we decided to leave it alone.  I agree 
it's annoying, but we figured that people will use the Pipelines API in the 
future (where this is not an issue) and not breaking people's code would be 
best.  Does that sound tolerable?

 DecisionTreeModel.predict() return type not convenient!
 ---

 Key: SPARK-8335
 URL: https://issues.apache.org/jira/browse/SPARK-8335
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
Reporter: Sebastian Walz
Priority: Minor
  Labels: easyfix, machine_learning
   Original Estimate: 10m
  Remaining Estimate: 10m

 org.apache.spark.mllib.tree.model.DecisionTreeModel has a predict method:
 def predict(features: JavaRDD[Vector]): JavaRDD[Double]
 The problem here is the generic type of the return type JAVARDD[Double] 
 because its a scala Double and I would expect a java.lang.Double. (to be 
 convenient e.g. with 
 org.apache.spark.mllib.classification.ClassificationModel)
 I wanted to extend the DecisionTreeModel and use it only for Binary 
 Classification and wanted to implement the trait 
 org.apache.spark.mllib.classification.ClassificationModel . But its not 
 possible because the ClassificationModel already defines the predict method 
 but with an return type JAVARDD[java.lang.Double]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8010) Implict promote Numeric type to String type in HiveTypeCoercion

2015-06-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-8010.
-
   Resolution: Fixed
Fix Version/s: (was: 1.3.1)
   1.5.0

Issue resolved by pull request 6551
[https://github.com/apache/spark/pull/6551]

 Implict promote Numeric type to String type in HiveTypeCoercion
 ---

 Key: SPARK-8010
 URL: https://issues.apache.org/jira/browse/SPARK-8010
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Li Sheng
 Fix For: 1.5.0

   Original Estimate: 48h
  Remaining Estimate: 48h

 1. Given a query
 `select coalesce(null, 1, '1') from dual` will cause exception:
   
   java.lang.RuntimeException: Could not determine return type of Coalesce for 
 IntegerType,StringType
 2. Given a query:
 `select case when true then 1 else '1' end from dual` will cause exception:
   java.lang.RuntimeException: Types in CASE WHEN must be the same or 
 coercible to a common type: StringType != IntegerType
 I checked the code, the main cause is the HiveTypeCoercion doesn't do 
 implicit convert when there is a IntegerType and StringType.
 Numeric types can be promoted to string type in case throw exceptions.
 Since Hive will always do this. It need to be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8372) History server shows incorrect information for application not started

2015-06-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-8372.

  Resolution: Fixed
   Fix Version/s: 1.5.0
  1.4.1
Assignee: Carson Wang
Target Version/s: 1.4.1, 1.5.0

 History server shows incorrect information for application not started
 --

 Key: SPARK-8372
 URL: https://issues.apache.org/jira/browse/SPARK-8372
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Web UI
Affects Versions: 1.4.0
Reporter: Carson Wang
Assignee: Carson Wang
Priority: Minor
 Fix For: 1.4.1, 1.5.0

 Attachments: IncorrectAppInfo.png


 The history server may show an incorrect App ID for an incomplete application 
 like App ID.inprogress. This app info will never disappear even after the 
 app is completed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8373) When an RDD has no partition, Python sum will throw Can not reduce() empty RDD

2015-06-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-8373.

  Resolution: Fixed
   Fix Version/s: 1.5.0
  1.4.1
Assignee: Shixiong Zhu
Target Version/s: 1.4.1, 1.5.0

 When an RDD has no partition, Python sum will throw Can not reduce() empty 
 RDD
 

 Key: SPARK-8373
 URL: https://issues.apache.org/jira/browse/SPARK-8373
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor
 Fix For: 1.4.1, 1.5.0


 The issue is because sum uses reduce. Replacing it with fold will fix 
 it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8373) When an RDD has no partition, Python sum will throw Can not reduce() empty RDD

2015-06-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8373:
-
Affects Version/s: (was: 1.4.0)
   1.2.0

 When an RDD has no partition, Python sum will throw Can not reduce() empty 
 RDD
 

 Key: SPARK-8373
 URL: https://issues.apache.org/jira/browse/SPARK-8373
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Shixiong Zhu
Priority: Minor
 Fix For: 1.4.1, 1.5.0


 The issue is because sum uses reduce. Replacing it with fold will fix 
 it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8373) When an RDD has no partition, Python sum will throw Can not reduce() empty RDD

2015-06-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8373:
-
Affects Version/s: 1.4.0

 When an RDD has no partition, Python sum will throw Can not reduce() empty 
 RDD
 

 Key: SPARK-8373
 URL: https://issues.apache.org/jira/browse/SPARK-8373
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
Reporter: Shixiong Zhu
Priority: Minor

 The issue is because sum uses reduce. Replacing it with fold will fix 
 it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >