[jira] [Resolved] (SPARK-6390) Add MatrixUDT in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-6390. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6354 [https://github.com/apache/spark/pull/6354] Add MatrixUDT in PySpark Key: SPARK-6390 URL: https://issues.apache.org/jira/browse/SPARK-6390 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Manoj Kumar Fix For: 1.5.0 After SPARK-6309, we should support MatrixUDT in PySpark too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8410) Hive VersionsSuite RuntimeException
[ https://issues.apache.org/jira/browse/SPARK-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590270#comment-14590270 ] Sean Owen commented on SPARK-8410: -- Did you {{install}} the artifacts before this? because you're trying to only test a submodule. If so, what about if you add the {{hive-thriftserver}} profile to both the build and test commands? Hive VersionsSuite RuntimeException --- Key: SPARK-8410 URL: https://issues.apache.org/jira/browse/SPARK-8410 Project: Spark Issue Type: Question Components: SQL Affects Versions: 1.3.1, 1.4.0 Environment: IBM Power system - P7 running Ubuntu 14.04LE with IBM JDK version 1.7.0 Reporter: Josiah Samuel Sathiadass Priority: Minor While testing Spark Project Hive, there are RuntimeExceptions as follows, VersionsSuite: - success sanity check *** FAILED *** java.lang.RuntimeException: [download failed: org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed: org.codehaus.groovy#groovy-all;2.1.6!groovy-all.jar, download failed: asm#asm;3.2!asm.jar] at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978) at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62) at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62) at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:38) at org.apache.spark.sql.hive.client.IsolatedClientLoader$.org$apache$spark$sql$hive$client$IsolatedClientLoader$$downloadVersion(IsolatedClientLoader.scala:61) at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$1.apply(IsolatedClientLoader.scala:44) at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$1.apply(IsolatedClientLoader.scala:44) at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189) at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91) at org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:44) ... The tests are executed with the following set of options, build/mvn --pl sql/hive --fail-never -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 test Adding the following dependencies in the spark/sql/hive/pom.xml file solves this issue, dependency groupIdorg.jboss.netty/groupId artifactIdnetty/artifactId version3.2.2.Final/version scopetest/scope /dependency dependency groupIdorg.codehaus.groovy/groupId artifactIdgroovy-all/artifactId version2.1.6/version scopetest/scope /dependency dependency groupIdasm/groupId artifactIdasm/artifactId version3.2/version scopetest/scope /dependency The question is, Is this the correct way to fix this runtimeException ? If yes, Can a pull request fix this issue permanently ? If not, suggestions please. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8406) Race condition when writing Parquet files
[ https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-8406: Target Version/s: 1.4.1, 1.5.0 (was: 1.4.1) Race condition when writing Parquet files - Key: SPARK-8406 URL: https://issues.apache.org/jira/browse/SPARK-8406 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker To support appending, the Parquet data source tries to find out the max part number of part-files in the destination directory (the id in output file name part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this step happens on driver side before any files are written. However, in 1.4.0, this is moved to task side. Thus, for tasks scheduled later, they may see wrong max part number generated by newly written files by other finished tasks within the same job. This actually causes a race condition. In most cases, this only causes nonconsecutive IDs in output file names. But when the DataFrame contains thousands of RDD partitions, it's likely that two tasks may choose the same part number, thus one of them gets overwritten by the other. The following Spark shell snippet can reproduce nonconsecutive part numbers: {code} sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo) {code} 16 can be replaced with any integer that is greater than the default parallelism on your machine (usually it means core number, on my machine it's 8). {noformat} -rw-r--r-- 3 lian supergroup 0 2015-06-17 00:06 /user/lian/foo/_SUCCESS -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-1.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-2.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-3.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-4.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-5.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-6.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-7.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-8.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00017.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00018.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00019.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00020.gz.parquet -rw-r--r-- 3 lian supergroup352 2015-06-17 00:06 /user/lian/foo/part-r-00021.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00022.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00023.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00024.gz.parquet {noformat} And here is another Spark shell snippet for reproducing overwriting: {code} sqlContext.range(0, 1).repartition(500).write.mode(overwrite).parquet(foo) sqlContext.read.parquet(foo).count() {code} Expected answer should be {{1}}, but you may see a number like {{9960}} due to overwriting. The actual number varies for different run and different nodes. Notice that the newly added ORC data source doesn't suffer this issue because it uses both part number and {{System.currentTimeMills()}} to generate the output file name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8406) Race condition when writing Parquet files
[ https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590315#comment-14590315 ] Michael Armbrust commented on SPARK-8406: - It seems to me that ORC is not free of this bug, but instead just more likely to avoid a problem, right? Race condition when writing Parquet files - Key: SPARK-8406 URL: https://issues.apache.org/jira/browse/SPARK-8406 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker To support appending, the Parquet data source tries to find out the max part number of part-files in the destination directory (the id in output file name part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this step happens on driver side before any files are written. However, in 1.4.0, this is moved to task side. Thus, for tasks scheduled later, they may see wrong max part number generated by newly written files by other finished tasks within the same job. This actually causes a race condition. In most cases, this only causes nonconsecutive IDs in output file names. But when the DataFrame contains thousands of RDD partitions, it's likely that two tasks may choose the same part number, thus one of them gets overwritten by the other. The following Spark shell snippet can reproduce nonconsecutive part numbers: {code} sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo) {code} 16 can be replaced with any integer that is greater than the default parallelism on your machine (usually it means core number, on my machine it's 8). {noformat} -rw-r--r-- 3 lian supergroup 0 2015-06-17 00:06 /user/lian/foo/_SUCCESS -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-1.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-2.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-3.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-4.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-5.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-6.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-7.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-8.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00017.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00018.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00019.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00020.gz.parquet -rw-r--r-- 3 lian supergroup352 2015-06-17 00:06 /user/lian/foo/part-r-00021.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00022.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00023.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00024.gz.parquet {noformat} And here is another Spark shell snippet for reproducing overwriting: {code} sqlContext.range(0, 1).repartition(500).write.mode(overwrite).parquet(foo) sqlContext.read.parquet(foo).count() {code} Expected answer should be {{1}}, but you may see a number like {{9960}} due to overwriting. The actual number varies for different run and different nodes. Notice that the newly added ORC data source doesn't suffer this issue because it uses both part number and {{System.currentTimeMills()}} to generate the output file name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4123) Show dependency changes in pull requests
[ https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590389#comment-14590389 ] Josh Rosen commented on SPARK-4123: --- Hasn't this been re-enabled? Did we ever end up fixing this? Show dependency changes in pull requests Key: SPARK-4123 URL: https://issues.apache.org/jira/browse/SPARK-4123 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell Assignee: Brennon York Priority: Critical We should inspect the classpath of Spark's assembly jar for every pull request. This only takes a few seconds in Maven and it will help weed out dependency changes from the master branch. Ideally we'd post any dependency changes in the pull request message. {code} $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk -F/ '{print $NF}' | sort my-classpath $ git checkout apache/master $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk -F/ '{print $NF}' | sort master-classpath $ diff my-classpath master-classpath chill-java-0.3.6.jar chill_2.10-0.3.6.jar --- chill-java-0.5.0.jar chill_2.10-0.5.0.jar {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5787) Protect JVM from some not-important exceptions
[ https://issues.apache.org/jira/browse/SPARK-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5787: --- Target Version/s: 1.5.0 (was: 1.4.0) Protect JVM from some not-important exceptions -- Key: SPARK-5787 URL: https://issues.apache.org/jira/browse/SPARK-5787 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Davies Liu Priority: Critical Any un-captured exception will shutdown the executor JVM, so we should capture all those exceptions which did not hurt executor much (executor is still functional). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7448) Implement custom bye array serializer for use in PySpark shuffle
[ https://issues.apache.org/jira/browse/SPARK-7448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-7448: --- Target Version/s: 1.5.0 (was: 1.4.0) Implement custom bye array serializer for use in PySpark shuffle Key: SPARK-7448 URL: https://issues.apache.org/jira/browse/SPARK-7448 Project: Spark Issue Type: Improvement Components: PySpark, Shuffle Reporter: Josh Rosen Priority: Minor PySpark's shuffle typically shuffles Java RDDs that contain byte arrays. We should implement a custom Serializer for use in these shuffles. This will allow us to take advantage of shuffle optimizations like SPARK-7311 for PySpark without requiring users to change the default serializer to KryoSerializer (this is useful for JobServer-type applications). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7078) Cache-aware binary processing in-memory sort
[ https://issues.apache.org/jira/browse/SPARK-7078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-7078: --- Target Version/s: 1.5.0 (was: 1.4.0) Cache-aware binary processing in-memory sort Key: SPARK-7078 URL: https://issues.apache.org/jira/browse/SPARK-7078 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: Reynold Xin Assignee: Josh Rosen A cache-friendly sort algorithm that can be used eventually for: * sort-merge join * shuffle See the old alpha sort paper: http://research.microsoft.com/pubs/68249/alphasort.doc Note that state-of-the-art for sorting has improved quite a bit, but we can easily optimize the sorting algorithm itself later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7041) Avoid writing empty files in BypassMergeSortShuffleWriter
[ https://issues.apache.org/jira/browse/SPARK-7041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-7041: --- Target Version/s: 1.5.0 (was: 1.4.0) Avoid writing empty files in BypassMergeSortShuffleWriter - Key: SPARK-7041 URL: https://issues.apache.org/jira/browse/SPARK-7041 Project: Spark Issue Type: Improvement Components: Shuffle Reporter: Josh Rosen Assignee: Josh Rosen In BypassMergeSortShuffleWriter, we may end up opening disk writers files for empty partitions; this occurs because we manually call {{open()}} after creating the writer, causing serialization and compression input streams to be created; these streams may write headers to the output stream, resulting in non-zero-length files being created for partitions that contain no records. This is unnecessary, though, since the disk object writer will automatically open itself when the first write is performed. Removing this eager {{open()}} call and rewriting the consumers to cope with the non-existence of empty files results in a large performance benefit for certain sparse workloads when using sort-based shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8368) ClassNotFoundException in closure for map
[ https://issues.apache.org/jira/browse/SPARK-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-8368: Target Version/s: 1.4.1, 1.5.0 ClassNotFoundException in closure for map -- Key: SPARK-8368 URL: https://issues.apache.org/jira/browse/SPARK-8368 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Environment: Centos 6.5, java 1.7.0_67, scala 2.10.4. Build the project on Windows 7 and run in a spark standalone cluster(or local) mode on Centos 6.X. Reporter: CHEN Zhiwei After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the following exception: ==begin exception {quote} Exception in thread main java.lang.ClassNotFoundException: com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:278) at org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) at org.apache.spark.SparkContext.clean(SparkContext.scala:1891) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.map(RDD.scala:293) at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210) at com.yhd.ycache.magic.Model$.main(SSExample.scala:239) at com.yhd.ycache.magic.Model.main(SSExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {quote} ===end exception=== I simplify the code that cause this issue, as following: ==begin code== {noformat} object Model extends Serializable{ def main(args: Array[String]) { val Array(sql) = args val sparkConf = new SparkConf().setAppName(Mode Example) val sc = new SparkContext(sparkConf) val hive = new HiveContext(sc) //get data by hive sql val rows = hive.sql(sql) val data = rows.map(r = { val arr = r.toSeq.toArray val label = 1.0 def fmap = ( input: Any ) = 1.0 val feature = arr.map(_=1.0) LabeledPoint(label, Vectors.dense(feature)) }) data.count() } } {noformat} =end code=== This code can run pretty well on spark-shell, but error when submit it to spark cluster (standalone or local mode). I try the same code on spark 1.3.0(local mode), and no exception is encountered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8356) Reconcile callUDF and callUdf
[ https://issues.apache.org/jira/browse/SPARK-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590491#comment-14590491 ] Benjamin Fradet commented on SPARK-8356: Somewhat related, about being coherent, there is {{PythonUDF}} and {{ScalaUdf}}. Maybe we should straighten this up as well. Reconcile callUDF and callUdf - Key: SPARK-8356 URL: https://issues.apache.org/jira/browse/SPARK-8356 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Priority: Critical Labels: starter Right now we have two functions {{callUDF}} and {{callUdf}}. I think the former is used for calling Java functions (and the documentation is wrong) and the latter is for calling functions by name. Either way this is confusing and we should unify or pick different names. Also, lets make sure the docs are right. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7017) Refactor dev/run-tests into Python
[ https://issues.apache.org/jira/browse/SPARK-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-7017. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 5694 [https://github.com/apache/spark/pull/5694] Refactor dev/run-tests into Python -- Key: SPARK-7017 URL: https://issues.apache.org/jira/browse/SPARK-7017 Project: Spark Issue Type: Sub-task Components: Build, Project Infra Reporter: Brennon York Assignee: Brennon York Fix For: 1.5.0 This issue is to specifically track the progress of the {{dev/run-tests}} script into Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8406) Race condition when writing Parquet files
[ https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8406: -- Description: To support appending, the Parquet data source tries to find out the max part number of part-files in the destination directory (the id in output file name part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this step happens on driver side before any files are written. However, in 1.4.0, this is moved to task side. Thus, for tasks scheduled later, they may see wrong max part number generated by newly written files by other finished tasks within the same job. This actually causes a race condition. In most cases, this only causes nonconsecutive IDs in output file names. But when the DataFrame contains thousands of RDD partitions, it's likely that two tasks may choose the same part number, thus one of them gets overwritten by the other. The data loss situation is not quite easy to reproduce. But the following Spark shell snippet can reproduce nonconsecutive output file IDs: {code} sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo) {code} 16 can be replaced with any integer that is greater than the default parallelism on your machine (usually it means core number, on my machine it's 8). {noformat} -rw-r--r-- 3 lian supergroup 0 2015-06-17 00:06 /user/lian/foo/_SUCCESS -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-1.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-2.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-3.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-4.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-5.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-6.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-7.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-8.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00017.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00018.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00019.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00020.gz.parquet -rw-r--r-- 3 lian supergroup352 2015-06-17 00:06 /user/lian/foo/part-r-00021.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00022.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00023.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00024.gz.parquet {noformat} Notice that the newly added ORC data source doesn't suffer this issue because it uses both part number and {{System.currentTimeMills()}} to generate the output file name. was: To support appending, the Parquet data source tries to find out the max ID of part-files in the destination directory (the id in output file name part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this step happens on driver side before any files are written. However, in 1.4.0, this is moved to task side. Thus, for tasks scheduled later, they may see wrong max ID generated by newly written files by other finished tasks within the same job. This actually causes a race condition. In most cases, this only causes nonconsecutive IDs in output file names. But when the DataFrame contains thousands of RDD partitions, it's likely that two tasks may choose the same ID, thus one of them gets overwritten by the other. The data loss situation is not quite easy to reproduce. But the following Spark shell snippet can reproduce nonconsecutive output file IDs: {code} sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo) {code} 16 can be replaced with any integer that is greater than the default parallelism on your machine (usually it means core number, on my machine it's 8). {noformat} -rw-r--r-- 3 lian supergroup 0 2015-06-17 00:06 /user/lian/foo/_SUCCESS -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-1.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-2.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-3.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-4.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-5.gz.parquet -rw-r--r-- 3 lian supergroup353
[jira] [Updated] (SPARK-8391) showDagViz throws OutOfMemoryError, cause the whole jobPage dies
[ https://issues.apache.org/jira/browse/SPARK-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8391: - Affects Version/s: 1.4.0 showDagViz throws OutOfMemoryError, cause the whole jobPage dies Key: SPARK-8391 URL: https://issues.apache.org/jira/browse/SPARK-8391 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: meiyoula When the job is big, and has so many DAG nodes and edges.showDagViz throws ERROR, then the whole jobPage render is down. I think it's unsuitable. An element node can't down the whole page. Below is the exception stack trace: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421) at java.lang.StringBuilder.append(StringBuilder.java:136) at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:207) at org.apache.spark.ui.scope.RDDOperationGraph$$anonfun$makeDotFile$1.apply(RDDOperationGraph.scala:171) at org.apache.spark.ui.scope.RDDOperationGraph$$anonfun$makeDotFile$1.apply(RDDOperationGraph.scala:171) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45) at org.apache.spark.ui.scope.RDDOperationGraph$.makeDotFile(RDDOperationGraph.scala:171) at org.apache.spark.ui.UIUtils$$anonfun$showDagViz$1.apply(UIUtils.scala:389) at org.apache.spark.ui.UIUtils$$anonfun$showDagViz$1.apply(UIUtils.scala:385) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.ui.UIUtils$.showDagViz(UIUtils.scala:385) at org.apache.spark.ui.UIUtils$.showDagVizForJob(UIUtils.scala:363) at org.apache.spark.ui.jobs.JobPage.render(JobPage.scala:317) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79) at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:75) at javax.servlet.http.HttpServlet.service(HttpServlet.java:735) at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) at org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1496) at com.huawei.spark.web.filter.SessionTimeoutFilter.doFilter(SessionTimeoutFilter.java:80) at org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467) at org.jasig.cas.client.util.HttpServletRequestWrapperFilter.doFilter(HttpServletRequestWrapperFilter.java:75 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6393) Extra RPC to the AM during killExecutor invocation
[ https://issues.apache.org/jira/browse/SPARK-6393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590426#comment-14590426 ] Patrick Wendell commented on SPARK-6393: [~sandyryza] I'm un-targeting this. If you are planning on working on this for a specific version, feel free to retarget. Extra RPC to the AM during killExecutor invocation -- Key: SPARK-6393 URL: https://issues.apache.org/jira/browse/SPARK-6393 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.3.1 Reporter: Sandy Ryza This was introduced by SPARK-6325 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6393) Extra RPC to the AM during killExecutor invocation
[ https://issues.apache.org/jira/browse/SPARK-6393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6393: --- Target Version/s: (was: 1.4.0) Extra RPC to the AM during killExecutor invocation -- Key: SPARK-6393 URL: https://issues.apache.org/jira/browse/SPARK-6393 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.3.1 Reporter: Sandy Ryza This was introduced by SPARK-6325 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6783) Add timing and test output for PR tests
[ https://issues.apache.org/jira/browse/SPARK-6783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590439#comment-14590439 ] Josh Rosen commented on SPARK-6783: --- Don't we already get this via the Jenkins JUnit XML plugin? Does this JIRA cover more than what that plugin provides us? Add timing and test output for PR tests --- Key: SPARK-6783 URL: https://issues.apache.org/jira/browse/SPARK-6783 Project: Spark Issue Type: Improvement Components: Build, Project Infra Affects Versions: 1.3.0 Reporter: Brennon York Currently the PR tests that run under {{dev/tests/*}} do not provide any output within the actual Jenkins run. It would be nice to not only have error output, but also timing results from each test and have those surfaced within the Jenkins output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7888) Be able to disable intercept in Linear Regression in ML package
[ https://issues.apache.org/jira/browse/SPARK-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590201#comment-14590201 ] holdenk commented on SPARK-7888: So it seems like scikit learn takes the easy approach and just asks that the data is already centered when the intercept is disabled, looking at the R code left me trying to trace some fortran which I'm not sure I was understanding correctly but lets sync up when you have some time :) Be able to disable intercept in Linear Regression in ML package --- Key: SPARK-7888 URL: https://issues.apache.org/jira/browse/SPARK-7888 Project: Spark Issue Type: New Feature Components: ML Reporter: DB Tsai Assignee: holdenk -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8411) No space left on device
[ https://issues.apache.org/jira/browse/SPARK-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8411. -- Resolution: Invalid This isn't nearly enough information to help with the issue; I also think you'll find some useful info searching through JIRA. No space left on device --- Key: SPARK-8411 URL: https://issues.apache.org/jira/browse/SPARK-8411 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Mukund Sudarshan com.esotericsoftware.kryo.KryoException: java.io.IOException: No space left on device This is the error I get when trying to run a program on my cluster. It doesn't occur when I run it locally however. My cluster is certainly not out of space -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3854) Scala style: require spaces before `{`
[ https://issues.apache.org/jira/browse/SPARK-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590394#comment-14590394 ] Josh Rosen commented on SPARK-3854: --- Does anyone know whether we're now enforcing this? [~rxin] may have fixed this recently. Scala style: require spaces before `{` -- Key: SPARK-3854 URL: https://issues.apache.org/jira/browse/SPARK-3854 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Josh Rosen We should require spaces before opening curly braces. This isn't in the style guide, but it probably should be: {code} // Correct: if (true) { println(Wow!) } // Incorrect: if (true){ println(Wow!) } {code} See https://github.com/apache/spark/pull/1658#discussion-diff-18611791 for an example in the wild. {{git grep ){}} shows only a few occurrences of this style. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8390) Update DirectKafkaWordCount examples to show how offset ranges can be used
[ https://issues.apache.org/jira/browse/SPARK-8390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-8390: --- Issue Type: Improvement (was: Bug) Update DirectKafkaWordCount examples to show how offset ranges can be used -- Key: SPARK-8390 URL: https://issues.apache.org/jira/browse/SPARK-8390 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.4.0 Reporter: Tathagata Das Assignee: Cody Koeninger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8389) Expose KafkaRDDs offsetRange in Java and Python
[ https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-8389: --- Issue Type: New Feature (was: Bug) Expose KafkaRDDs offsetRange in Java and Python --- Key: SPARK-8389 URL: https://issues.apache.org/jira/browse/SPARK-8389 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.4.0 Reporter: Tathagata Das Assignee: Cody Koeninger Priority: Critical Probably requires creating a JavaKafkaPairRDD and also use that in the python APIs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7689) Deprecate spark.cleaner.ttl
[ https://issues.apache.org/jira/browse/SPARK-7689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-7689: --- Target Version/s: 1.4.1 (was: 1.4.0) Deprecate spark.cleaner.ttl --- Key: SPARK-7689 URL: https://issues.apache.org/jira/browse/SPARK-7689 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen With the introduction of ContextCleaner, I think there's no longer any reason for most users to enable the MetadataCleaner / {{spark.cleaner.ttl}} (except perhaps for super-long-lived Spark REPLs where you're worried about orphaning RDDs or broadcast variables in your REPL history and having them never get cleaned up, although I think this is an uncommon use-case). I think that this property used to be relevant for Spark Streaming jobs, but I think that's no longer the case since the latest Streaming docs have removed all mentions of {{spark.cleaner.ttl}} (see https://github.com/apache/spark/pull/4956/files#diff-dbee746abf610b52d8a7cb65bf9ea765L1817, for example). See http://apache-spark-user-list.1001560.n3.nabble.com/is-spark-cleaner-ttl-safe-td2557.html for an old, related discussion. Also, see https://github.com/apache/spark/pull/126, the PR that introduced the new ContextCleaner mechanism. We should probably add a deprecation warning to {{spark.cleaner.ttl}} that advises users against using it, since it's an unsafe configuration option that can lead to confusing behavior if it's misused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7521) Allow all required release credentials to be specified with env vars
[ https://issues.apache.org/jira/browse/SPARK-7521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590451#comment-14590451 ] Patrick Wendell commented on SPARK-7521: Sort of - I still need to actually contribute my scripts back into the spark repo. I will work on getting a PR up for that. Allow all required release credentials to be specified with env vars Key: SPARK-7521 URL: https://issues.apache.org/jira/browse/SPARK-7521 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Patrick Wendell Assignee: Patrick Wendell When creating releases the following credentials are needed: 1. ASF private key, to post artifacts on people.apache. 2. ASF username and password, to publish to maven and push tags to github. 3. GPG private key and key password, to sign releases. Right now the assumption is that these are made present in the build environment through env vars, installed the GPG and private keys, etc. This makes it difficult for us to automate the build, such as allowing the full build+publish to occur on any Jenkins machine. One way to fix this is to make sure all of these can be specified as env vars which can then be securely threaded through to the jenkins builder. The script itself would then e.g. create temporary GPG keys and RSA private key for each build, using these env vars. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5178) Integrate Python unit tests into Jenkins
[ https://issues.apache.org/jira/browse/SPARK-5178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590452#comment-14590452 ] Josh Rosen commented on SPARK-5178: --- This is a duplicate of SPARK-7021. Integrate Python unit tests into Jenkins Key: SPARK-5178 URL: https://issues.apache.org/jira/browse/SPARK-5178 Project: Spark Issue Type: Improvement Components: Build, Project Infra Reporter: Nicholas Chammas Priority: Minor From [~joshrosen]: {quote} The Test Result pages for Jenkins builds shows some nice statistics for the test run, including individual test times: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/ Currently this only covers the Java / Scala tests, but we might be able to integrate the PySpark tests here, too (I think it's just a matter of getting the Python test runner to generate the correct test result XML output). {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5178) Integrate Python unit tests into Jenkins
[ https://issues.apache.org/jira/browse/SPARK-5178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-5178. --- Resolution: Duplicate Integrate Python unit tests into Jenkins Key: SPARK-5178 URL: https://issues.apache.org/jira/browse/SPARK-5178 Project: Spark Issue Type: Improvement Components: Build, Project Infra Reporter: Nicholas Chammas Priority: Minor From [~joshrosen]: {quote} The Test Result pages for Jenkins builds shows some nice statistics for the test run, including individual test times: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/ Currently this only covers the Java / Scala tests, but we might be able to integrate the PySpark tests here, too (I think it's just a matter of getting the Python test runner to generate the correct test result XML output). {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8077) Optimisation of TreeNode for large number of children
[ https://issues.apache.org/jira/browse/SPARK-8077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-8077. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6673 [https://github.com/apache/spark/pull/6673] Optimisation of TreeNode for large number of children - Key: SPARK-8077 URL: https://issues.apache.org/jira/browse/SPARK-8077 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Mick Davies Priority: Minor Fix For: 1.5.0 Large IN clauses are parsed very slowly. For example SQL below (10K items in IN) takes 45-50s. {code} sSELECT * FROM Person WHERE ForeName IN ('${(1 to 1).map(n + _).mkString(',')}') {code} This is principally due to TreeNode which repeatedly call contains on children, where children in this case is a List that is 10K long. In effect parsing for large IN clauses is O(N squared). A small change that uses a lazily initialised Set based on children for contains reduces parse time to around 2.5s I'd like to create PR for change, as we often use IN clauses with a few thousand items. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6557) Only set bold text on PR github test output for problems
[ https://issues.apache.org/jira/browse/SPARK-6557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590457#comment-14590457 ] Josh Rosen commented on SPARK-6557: --- We could also use GitHub Emoji or colored text to make good vs. bad more readily distinguishable. Only set bold text on PR github test output for problems Key: SPARK-6557 URL: https://issues.apache.org/jira/browse/SPARK-6557 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Brennon York Priority: Trivial Labels: starter Minor nit, but right now we highlight (i.e. place bold text) various comments from the PR tests when the PR's are submitted. For example: we currently highlight a PR that merges successfully, and also highlight the fact that the PR fails Spark tests. I propose that we **only highlight (bold) text when there is a problem with a PR**. For instance, we should not bold that the patch merges cleanly, only bold when it **does not**. The entire point is to make it easier for the committers and developers to quickly glance over PR test output and understand what, if anything, they need to dive into. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3244) Add fate sharing across related files in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590462#comment-14590462 ] Josh Rosen commented on SPARK-3244: --- Now that dev/run-tests and the associated scripts are being ported to Python, this may now be more easily achievable. Add fate sharing across related files in Jenkins Key: SPARK-3244 URL: https://issues.apache.org/jira/browse/SPARK-3244 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 1.1.0 Reporter: Andrew Or A few files are closely linked with each other. For instance, changes in bin/spark-submit must be reflected in bin/spark-submit.cmd and SparkSubmitDriverBootstrapper.scala. It would be good if Jenkins gives a warning if one file is changed but not the related ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-595) Document local-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590486#comment-14590486 ] Justin Uang commented on SPARK-595: --- +1 We are using for internal testing to ensure that our kryo serialization works. Document local-cluster mode - Key: SPARK-595 URL: https://issues.apache.org/jira/browse/SPARK-595 Project: Spark Issue Type: New Feature Components: Documentation Affects Versions: 0.6.0 Reporter: Josh Rosen Priority: Minor The 'Spark Standalone Mode' guide describes how to manually launch a standalone cluster, which can be done locally for testing, but it does not mention SparkContext's `local-cluster` option. What are the differences between these approaches? Which one should I prefer for local testing? Can I still use the standalone web interface if I use 'local-cluster' mode? It would be useful to document this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8356) Reconcile callUDF and callUdf
[ https://issues.apache.org/jira/browse/SPARK-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590483#comment-14590483 ] Michael Armbrust commented on SPARK-8356: - Hmm, maybe not. [~rxin] any idea why we have {{callUDF}} at all? It seems like an uglier version of {{udf}} that doesn't handle input type coercion. Reconcile callUDF and callUdf - Key: SPARK-8356 URL: https://issues.apache.org/jira/browse/SPARK-8356 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Priority: Critical Labels: starter Right now we have two functions {{callUDF}} and {{callUdf}}. I think the former is used for calling Java functions (and the documentation is wrong) and the latter is for calling functions by name. Either way this is confusing and we should unify or pick different names. Also, lets make sure the docs are right. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7546) Example code for ML Pipelines feature transformations
[ https://issues.apache.org/jira/browse/SPARK-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha updated SPARK-7546: - Target Version/s: 1.5.0 (was: 1.4.0) Example code for ML Pipelines feature transformations - Key: SPARK-7546 URL: https://issues.apache.org/jira/browse/SPARK-7546 Project: Spark Issue Type: New Feature Components: ML Reporter: Joseph K. Bradley Assignee: Ram Sriharsha This should be added for Scala, Java, and Python. It should cover ML Pipelines using a complex series of feature transformations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8392) RDDOperationGraph: getting cached nodes is slow
[ https://issues.apache.org/jira/browse/SPARK-8392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8392: - Assignee: meiyoula RDDOperationGraph: getting cached nodes is slow --- Key: SPARK-8392 URL: https://issues.apache.org/jira/browse/SPARK-8392 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.4.0 Reporter: meiyoula Assignee: meiyoula Priority: Minor def getAllNodes: Seq[RDDOperationNode] = { _childNodes ++ _childClusters.flatMap(_.childNodes) } when the _childClusters has so many nodes, the process will hang on. I think we can improve the efficiency here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8412) java#KafkaUtils.createDirectStream Java(Pair)RDDs do not implement HasOffsetRanges
[ https://issues.apache.org/jira/browse/SPARK-8412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8412. -- Resolution: Not A Problem The JavaPairRDD doesn't implementing it, but the underlying RDD ({{.rdd()}}) does. java#KafkaUtils.createDirectStream Java(Pair)RDDs do not implement HasOffsetRanges -- Key: SPARK-8412 URL: https://issues.apache.org/jira/browse/SPARK-8412 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: jweinste Priority: Critical // Create direct kafka stream with brokers and topics final JavaPairInputDStreamString, String messages = KafkaUtils.createDirectStream(jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topics); messages.foreachRDD(new FunctionJavaPairRDDString, String, Void() { @Override public Void call(final JavaPairRDDString, String rdd) throws Exception { if (rdd instanceof HasOffsetRanges) { //will never happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8371) improve unit test for MaxOf and MinOf
[ https://issues.apache.org/jira/browse/SPARK-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-8371: Target Version/s: 1.5.0 Shepherd: Davies Liu Assignee: Wenchen Fan improve unit test for MaxOf and MinOf - Key: SPARK-8371 URL: https://issues.apache.org/jira/browse/SPARK-8371 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8397) Allow custom configuration for TestHive
[ https://issues.apache.org/jira/browse/SPARK-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-8397: --- Component/s: SQL Allow custom configuration for TestHive --- Key: SPARK-8397 URL: https://issues.apache.org/jira/browse/SPARK-8397 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Punya Biswal Priority: Minor We encourage people to use {{TestHive}} in unit tests, because it's impossible to create more than one {{HiveContext}} within one process. The current implementation locks people into using a {{local[2]}} {{SparkContext}} underlying their {{HiveContext}}. We should make it possible to override this using a system property so that people can test against {{local-cluster}} or remote spark clusters to make their tests more realistic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8388) The script docs/_plugins/copy_api_dirs.rb should be run anywhere
[ https://issues.apache.org/jira/browse/SPARK-8388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-8388: --- Target Version/s: (was: 1.4.0) The script docs/_plugins/copy_api_dirs.rb should be run anywhere -- Key: SPARK-8388 URL: https://issues.apache.org/jira/browse/SPARK-8388 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: KaiXinXIaoLei Priority: Minor The script copy_api_dirs.rb in spark/docs/_plugins should be run anywhere. But now, you have to be in spark/docs, and run ruby _plugins/copy_api_dirs.rb -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8406) Race condition when writing Parquet files
[ https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8406: -- Description: To support appending, the Parquet data source tries to find out the max part number of part-files in the destination directory (the id in output file name part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this step happens on driver side before any files are written. However, in 1.4.0, this is moved to task side. Thus, for tasks scheduled later, they may see wrong max part number generated by newly written files by other finished tasks within the same job. This actually causes a race condition. In most cases, this only causes nonconsecutive IDs in output file names. But when the DataFrame contains thousands of RDD partitions, it's likely that two tasks may choose the same part number, thus one of them gets overwritten by the other. The following Spark shell snippet can reproduce nonconsecutive part numbers: {code} sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo) {code} 16 can be replaced with any integer that is greater than the default parallelism on your machine (usually it means core number, on my machine it's 8). {noformat} -rw-r--r-- 3 lian supergroup 0 2015-06-17 00:06 /user/lian/foo/_SUCCESS -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-1.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-2.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-3.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-4.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-5.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-6.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-7.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-8.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00017.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00018.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00019.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00020.gz.parquet -rw-r--r-- 3 lian supergroup352 2015-06-17 00:06 /user/lian/foo/part-r-00021.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00022.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00023.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00024.gz.parquet {noformat} And here is another Spark shell snippet for reproducing overwriting: {code} sqlContext.range(0, 1).repartition(500).write.mode(overwrite).parquet(foo) sqlContext.read.parquet(foo).count() {code} Expected answer should be {{1}}, but you may see a number like {{9960}} due to overwriting. The actual number varies for different run and different nodes. Notice that the newly added ORC data source doesn't suffer this issue because it uses both part number and {{System.currentTimeMills()}} to generate the output file name. was: To support appending, the Parquet data source tries to find out the max part number of part-files in the destination directory (the id in output file name part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this step happens on driver side before any files are written. However, in 1.4.0, this is moved to task side. Thus, for tasks scheduled later, they may see wrong max part number generated by newly written files by other finished tasks within the same job. This actually causes a race condition. In most cases, this only causes nonconsecutive IDs in output file names. But when the DataFrame contains thousands of RDD partitions, it's likely that two tasks may choose the same part number, thus one of them gets overwritten by the other. The following Spark shell snippet can reproduce nonconsecutive part numbers: {code} sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo) {code} 16 can be replaced with any integer that is greater than the default parallelism on your machine (usually it means core number, on my machine it's 8). {noformat} -rw-r--r-- 3 lian supergroup 0 2015-06-17 00:06 /user/lian/foo/_SUCCESS -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-1.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-2.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06
[jira] [Created] (SPARK-8414) Ensure ClosureCleaner actually triggers clean ups
Andrew Or created SPARK-8414: Summary: Ensure ClosureCleaner actually triggers clean ups Key: SPARK-8414 URL: https://issues.apache.org/jira/browse/SPARK-8414 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or Right now it cleans up old references only through natural GCs, which may not occur if the driver has infinite RAM. We should do a periodic GC to make sure that we actually do clean things up. Something like once per 30 minutes seems relatively inexpensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7888) Be able to disable intercept in Linear Regression in ML package
[ https://issues.apache.org/jira/browse/SPARK-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590380#comment-14590380 ] DB Tsai commented on SPARK-7888: Last night, I figured out how to do this. If you look at the comment from line 237, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala The means from features are removed before training (of course, removing the means will densify the vector which is not good, so we have equivalent formula), and then we fit a linear regression on the centralized data without intercept. When we interpret the model, since it's trained in the centralized data, the intercept is not required of course. However, in the original problem, this centralization can be translated into intercept. As a result, we can compute the intercept using closed form in line 183. You may want to draw couple pictures to help you visualize this. Back to the topic of disabling intercept, you can think this as training the model without `centralization` so the line will cross the origin. Be able to disable intercept in Linear Regression in ML package --- Key: SPARK-7888 URL: https://issues.apache.org/jira/browse/SPARK-7888 Project: Spark Issue Type: New Feature Components: ML Reporter: DB Tsai Assignee: holdenk -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6923) Spark SQL CLI does not read Data Source schema correctly
[ https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6923: Assignee: Cheng Hao Spark SQL CLI does not read Data Source schema correctly Key: SPARK-6923 URL: https://issues.apache.org/jira/browse/SPARK-6923 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: pin_zhang Assignee: Cheng Hao Priority: Critical {code:java} HiveContext hctx = new HiveContext(sc); ListString sample = new ArrayListString(); sample.add( {\id\: \id_1\, \age\:1} ); RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd(); DataFrame df = hctx.jsonRDD(sampleRDD); String table=test; df.saveAsTable(table, json,SaveMode.Overwrite); Table t = hctx.catalog().client().getTable(table); System.out.println( t.getCols()); {code} -- With the code above to save DataFrame to hive table, Get table cols returns one column named 'col' [FieldSchema(name:col, type:arraystring, comment:from deserializer)] Expected return fields schema id, age. This results in the jdbc API cannot retrieves the table columns via ResultSet DatabaseMetaData.getColumns(String catalog, String schemaPattern,String tableNamePattern, String columnNamePattern) But resultset metadata for query select * from test contains fields id, age. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6923) Spark SQL CLI does not read Data Source schema correctly
[ https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6923: Shepherd: Cheng Lian Spark SQL CLI does not read Data Source schema correctly Key: SPARK-6923 URL: https://issues.apache.org/jira/browse/SPARK-6923 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: pin_zhang Assignee: Cheng Hao Priority: Critical {code:java} HiveContext hctx = new HiveContext(sc); ListString sample = new ArrayListString(); sample.add( {\id\: \id_1\, \age\:1} ); RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd(); DataFrame df = hctx.jsonRDD(sampleRDD); String table=test; df.saveAsTable(table, json,SaveMode.Overwrite); Table t = hctx.catalog().client().getTable(table); System.out.println( t.getCols()); {code} -- With the code above to save DataFrame to hive table, Get table cols returns one column named 'col' [FieldSchema(name:col, type:arraystring, comment:from deserializer)] Expected return fields schema id, age. This results in the jdbc API cannot retrieves the table columns via ResultSet DatabaseMetaData.getColumns(String catalog, String schemaPattern,String tableNamePattern, String columnNamePattern) But resultset metadata for query select * from test contains fields id, age. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8325) Ability to provide role based row level authorization through Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-8325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-8325: --- Target Version/s: (was: 1.4.0) Ability to provide role based row level authorization through Spark SQL --- Key: SPARK-8325 URL: https://issues.apache.org/jira/browse/SPARK-8325 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.4.0 Reporter: Mayoor Rao Attachments: Jira_request_table_authorization.docx Using Datasource API we can register a file as a table in through Beeline. With the implementation of jira - SPARK-8324 where we can register queries as views, the authorization requirement is not restricted to hive tables, it could be Spark registered tables as well. The Thriftserver currently enables us to use the JDBC clients to fetch the data. Data authorization would be required for any enterprise usage. Following features are expected – 1.Role based authorization 2.Ability to define roles 3.Ability to add user to roles 4.Ability to define authorization at the row level Following JDBC commands would be required to manage authorization – ADD ROLE manager WITH DESCRIPTION ProjectManager; -- Create role ADD USER james WITH ROLES {roles:[manager,seniorManager]}; -- Create user GRANT ACCESS ON EMPLOYEE FOR {roles:[manager]}; -- Grant access to the user on table AUTHORIZE ROLE USING {role:manager, tableName:EMPLOYEE, columnName:Employee_id, columnValues: [1]}; -- authorize at the row level UPDATE ROLE AUTHORIZATION WITH {role:manager, tableName:EMPLOYEE, columnName:Employee_id, columnValues: [2%,3%]}; -- update authorization REVOKE ACCESS ON EMPLOYEE FOR {roles:[manager]}; -- revoke access DELETE USER james; -- delete user DROP ROLE manager; -- delete manager Advantage • Ability to restrict the data based on the logged in user role. • Data protection • The organization can control data access to prevent unauthorized usage or viewing of the data • The users who are using the BI tools can be restricted to the data they are authorized to see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8388) The script docs/_plugins/copy_api_dirs.rb should be run anywhere
[ https://issues.apache.org/jira/browse/SPARK-8388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-8388: --- Fix Version/s: (was: 1.4.1) The script docs/_plugins/copy_api_dirs.rb should be run anywhere -- Key: SPARK-8388 URL: https://issues.apache.org/jira/browse/SPARK-8388 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: KaiXinXIaoLei Priority: Minor The script copy_api_dirs.rb in spark/docs/_plugins should be run anywhere. But now, you have to be in spark/docs, and run ruby _plugins/copy_api_dirs.rb -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8388) The script docs/_plugins/copy_api_dirs.rb should be run anywhere
[ https://issues.apache.org/jira/browse/SPARK-8388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590408#comment-14590408 ] Patrick Wendell commented on SPARK-8388: Hi [~kaixin9ok] - please don't set fix version on JIRA's that haven't been fixed, thanks! The script docs/_plugins/copy_api_dirs.rb should be run anywhere -- Key: SPARK-8388 URL: https://issues.apache.org/jira/browse/SPARK-8388 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: KaiXinXIaoLei Priority: Minor The script copy_api_dirs.rb in spark/docs/_plugins should be run anywhere. But now, you have to be in spark/docs, and run ruby _plugins/copy_api_dirs.rb -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8325) Ability to provide role based row level authorization through Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-8325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-8325: --- Fix Version/s: (was: 1.4.1) Ability to provide role based row level authorization through Spark SQL --- Key: SPARK-8325 URL: https://issues.apache.org/jira/browse/SPARK-8325 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.4.0 Reporter: Mayoor Rao Attachments: Jira_request_table_authorization.docx Using Datasource API we can register a file as a table in through Beeline. With the implementation of jira - SPARK-8324 where we can register queries as views, the authorization requirement is not restricted to hive tables, it could be Spark registered tables as well. The Thriftserver currently enables us to use the JDBC clients to fetch the data. Data authorization would be required for any enterprise usage. Following features are expected – 1.Role based authorization 2.Ability to define roles 3.Ability to add user to roles 4.Ability to define authorization at the row level Following JDBC commands would be required to manage authorization – ADD ROLE manager WITH DESCRIPTION ProjectManager; -- Create role ADD USER james WITH ROLES {roles:[manager,seniorManager]}; -- Create user GRANT ACCESS ON EMPLOYEE FOR {roles:[manager]}; -- Grant access to the user on table AUTHORIZE ROLE USING {role:manager, tableName:EMPLOYEE, columnName:Employee_id, columnValues: [1]}; -- authorize at the row level UPDATE ROLE AUTHORIZATION WITH {role:manager, tableName:EMPLOYEE, columnName:Employee_id, columnValues: [2%,3%]}; -- update authorization REVOKE ACCESS ON EMPLOYEE FOR {roles:[manager]}; -- revoke access DELETE USER james; -- delete user DROP ROLE manager; -- delete manager Advantage • Ability to restrict the data based on the logged in user role. • Data protection • The organization can control data access to prevent unauthorized usage or viewing of the data • The users who are using the BI tools can be restricted to the data they are authorized to see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8324) Register Query as view through JDBC interface
[ https://issues.apache.org/jira/browse/SPARK-8324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-8324: --- Target Version/s: (was: 1.4.0) Register Query as view through JDBC interface - Key: SPARK-8324 URL: https://issues.apache.org/jira/browse/SPARK-8324 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.4.0 Reporter: Mayoor Rao Labels: features Attachments: Jira_request_register_query_as_view.docx We currently have capability of adding csv, json, parquet, etc files as table through beeline using Datasource API. We need a mechanism to register a complex queries as a table through jdbc interface. The query definition could be composed using the table names which are again registered as spark tables using datasource API. The query definition should be persisted and should have an option to re-register when the thriftserver is restarted. The sql command should be able to either take a filename which contains the json content or it should take the json content directly. There should be an option to save the output of the queries and register the output as table. Advantage • Create adhoc join statements across different data-sources using Spark from external BI interface. So no persistence of pre-aggregated needed. • No dependency of creation of programs to generate adhoc analytics • Enable business users to model the data across diverse data sources in real time without any programming • Enable persistence of the query output through jdbc interface. No extra programming required. SQL Syntax for registering a set of queries or files as table - REGISTERSQLJOB USING FILE/JSON FILENAME/JSONContent -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8324) Register Query as view through JDBC interface
[ https://issues.apache.org/jira/browse/SPARK-8324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-8324: --- Fix Version/s: (was: 1.4.1) Register Query as view through JDBC interface - Key: SPARK-8324 URL: https://issues.apache.org/jira/browse/SPARK-8324 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.4.0 Reporter: Mayoor Rao Labels: features Attachments: Jira_request_register_query_as_view.docx We currently have capability of adding csv, json, parquet, etc files as table through beeline using Datasource API. We need a mechanism to register a complex queries as a table through jdbc interface. The query definition could be composed using the table names which are again registered as spark tables using datasource API. The query definition should be persisted and should have an option to re-register when the thriftserver is restarted. The sql command should be able to either take a filename which contains the json content or it should take the json content directly. There should be an option to save the output of the queries and register the output as table. Advantage • Create adhoc join statements across different data-sources using Spark from external BI interface. So no persistence of pre-aggregated needed. • No dependency of creation of programs to generate adhoc analytics • Enable business users to model the data across diverse data sources in real time without any programming • Enable persistence of the query output through jdbc interface. No extra programming required. SQL Syntax for registering a set of queries or files as table - REGISTERSQLJOB USING FILE/JSON FILENAME/JSONContent -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7025) Create a Java-friendly input source API
[ https://issues.apache.org/jira/browse/SPARK-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-7025: --- Target Version/s: 1.5.0 (was: 1.4.0) Create a Java-friendly input source API --- Key: SPARK-7025 URL: https://issues.apache.org/jira/browse/SPARK-7025 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin The goal of this ticket is to create a simple input source API that we can maintain and support long term. Spark currently has two de facto input source API: 1. RDD 2. Hadoop MapReduce InputFormat Neither of the above is ideal: 1. RDD: It is hard for Java developers to implement RDD, given the implicit class tags. In addition, the RDD API depends on Scala's runtime library, which does not preserve binary compatibility across Scala versions. If a developer chooses Java to implement an input source, it would be great if that input source can be binary compatible in years to come. 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For example, it forces key-value semantics, and does not support running arbitrary code on the driver side (an example of why this is useful is broadcast). In addition, it is somewhat awkward to tell developers that in order to implement an input source for Spark, they should learn the Hadoop MapReduce API first. So here's the proposal: an InputSource is described by: * an array of InputPartition that specifies the data partitioning * a RecordReader that specifies how data on each partition can be read This interface would be similar to Hadoop's InputFormat, except that there is no explicit key/value separation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7521) Allow all required release credentials to be specified with env vars
[ https://issues.apache.org/jira/browse/SPARK-7521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590444#comment-14590444 ] Josh Rosen commented on SPARK-7521: --- Has this been done now that we're publishing nightly snapshots? Allow all required release credentials to be specified with env vars Key: SPARK-7521 URL: https://issues.apache.org/jira/browse/SPARK-7521 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Patrick Wendell Assignee: Patrick Wendell When creating releases the following credentials are needed: 1. ASF private key, to post artifacts on people.apache. 2. ASF username and password, to publish to maven and push tags to github. 3. GPG private key and key password, to sign releases. Right now the assumption is that these are made present in the build environment through env vars, installed the GPG and private keys, etc. This makes it difficult for us to automate the build, such as allowing the full build+publish to occur on any Jenkins machine. One way to fix this is to make sure all of these can be specified as env vars which can then be securely threaded through to the jenkins builder. The script itself would then e.g. create temporary GPG keys and RSA private key for each build, using these env vars. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8406) Race condition when writing Parquet files
[ https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590494#comment-14590494 ] Cheng Lian commented on SPARK-8406: --- Yeah, just updated the JIRA description. ORC may hit this issue only when two tasks with the same task ID (which means they are in two concurrent jobs) are writing to the same location within the same millisecond. Race condition when writing Parquet files - Key: SPARK-8406 URL: https://issues.apache.org/jira/browse/SPARK-8406 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker To support appending, the Parquet data source tries to find out the max part number of part-files in the destination directory (the id in output file name part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this step happens on driver side before any files are written. However, in 1.4.0, this is moved to task side. Thus, for tasks scheduled later, they may see wrong max part number generated by newly written files by other finished tasks within the same job. This actually causes a race condition. In most cases, this only causes nonconsecutive IDs in output file names. But when the DataFrame contains thousands of RDD partitions, it's likely that two tasks may choose the same part number, thus one of them gets overwritten by the other. The following Spark shell snippet can reproduce nonconsecutive part numbers: {code} sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo) {code} 16 can be replaced with any integer that is greater than the default parallelism on your machine (usually it means core number, on my machine it's 8). {noformat} -rw-r--r-- 3 lian supergroup 0 2015-06-17 00:06 /user/lian/foo/_SUCCESS -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-1.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-2.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-3.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-4.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-5.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-6.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-7.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-8.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00017.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00018.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00019.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00020.gz.parquet -rw-r--r-- 3 lian supergroup352 2015-06-17 00:06 /user/lian/foo/part-r-00021.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00022.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00023.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00024.gz.parquet {noformat} And here is another Spark shell snippet for reproducing overwriting: {code} sqlContext.range(0, 1).repartition(500).write.mode(overwrite).parquet(foo) sqlContext.read.parquet(foo).count() {code} Expected answer should be {{1}}, but you may see a number like {{9960}} due to overwriting. The actual number varies for different run and different nodes. Notice that the newly added ORC data source is less likely to hit this issue because it uses task ID and {{System.currentTimeMills()}} to generate the output file name. Thus, the ORC data source may hit this issue only when two tasks with the same task ID (which means they are in two concurrent jobs) are writing to the same location within the same millisecond. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8406) Race condition when writing Parquet files
[ https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8406: -- Description: To support appending, the Parquet data source tries to find out the max part number of part-files in the destination directory (the id in output file name part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this step happens on driver side before any files are written. However, in 1.4.0, this is moved to task side. Thus, for tasks scheduled later, they may see wrong max part number generated by newly written files by other finished tasks within the same job. This actually causes a race condition. In most cases, this only causes nonconsecutive IDs in output file names. But when the DataFrame contains thousands of RDD partitions, it's likely that two tasks may choose the same part number, thus one of them gets overwritten by the other. The following Spark shell snippet can reproduce nonconsecutive part numbers: {code} sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo) {code} 16 can be replaced with any integer that is greater than the default parallelism on your machine (usually it means core number, on my machine it's 8). {noformat} -rw-r--r-- 3 lian supergroup 0 2015-06-17 00:06 /user/lian/foo/_SUCCESS -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-1.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-2.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-3.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-4.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-5.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-6.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-7.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-8.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00017.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00018.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00019.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00020.gz.parquet -rw-r--r-- 3 lian supergroup352 2015-06-17 00:06 /user/lian/foo/part-r-00021.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00022.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00023.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00024.gz.parquet {noformat} And here is another Spark shell snippet for reproducing overwriting: {code} sqlContext.range(0, 1).repartition(500).write.mode(overwrite).parquet(foo) sqlContext.read.parquet(foo).count() {code} Expected answer should be {{1}}, but you may see a number like {{9960}} due to overwriting. The actual number varies for different run and different nodes. Notice that the newly added ORC data source is less likely to hit this issue because it uses task ID and {{System.currentTimeMills()}} to generate the output file name. Thus, the ORC data source may hit this issue only when two tasks with the same task ID (which means they are in two concurrent jobs) are writing to the same location within the same millisecond. was: To support appending, the Parquet data source tries to find out the max part number of part-files in the destination directory (the id in output file name part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this step happens on driver side before any files are written. However, in 1.4.0, this is moved to task side. Thus, for tasks scheduled later, they may see wrong max part number generated by newly written files by other finished tasks within the same job. This actually causes a race condition. In most cases, this only causes nonconsecutive IDs in output file names. But when the DataFrame contains thousands of RDD partitions, it's likely that two tasks may choose the same part number, thus one of them gets overwritten by the other. The following Spark shell snippet can reproduce nonconsecutive part numbers: {code} sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo) {code} 16 can be replaced with any integer that is greater than the default parallelism on your machine (usually it means core number, on my machine it's 8). {noformat} -rw-r--r-- 3 lian supergroup 0 2015-06-17 00:06 /user/lian/foo/_SUCCESS -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06
[jira] [Commented] (SPARK-8365) pyspark does not retain --packages or --jars passed on the command line as of 1.4.0
[ https://issues.apache.org/jira/browse/SPARK-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590290#comment-14590290 ] Don Drake commented on SPARK-8365: -- Is there a workaround that you are aware of? pyspark does not retain --packages or --jars passed on the command line as of 1.4.0 --- Key: SPARK-8365 URL: https://issues.apache.org/jira/browse/SPARK-8365 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Reporter: Don Drake Priority: Blocker I downloaded the pre-compiled Spark 1.4.0 and attempted to run an existing Python Spark application against it and got the following error: py4j.protocol.Py4JJavaError: An error occurred while calling o90.save. : java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv I pass the following on the command-line to my spark-submit: --packages com.databricks:spark-csv_2.10:1.0.3 This worked fine on 1.3.1, but not in 1.4. I was able to replicate it with the following pyspark: {code} a = {'a':1.0, 'b':'asdf'} rdd = sc.parallelize([a]) df = sqlContext.createDataFrame(rdd) df.save(/tmp/d.csv, com.databricks.spark.csv) {code} Even using the new df.write.format('com.databricks.spark.csv').save('/tmp/d.csv') gives the same error. I see it was added in the web UI: file:/Users/drake/.ivy2/jars/com.databricks_spark-csv_2.10-1.0.3.jar Added By User file:/Users/drake/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar Added By User http://10.0.0.222:56871/jars/com.databricks_spark-csv_2.10-1.0.3.jar Added By User http://10.0.0.222:56871/jars/org.apache.commons_commons-csv-1.1.jar Added By User Thoughts? *I also attempted using the Scala spark-shell to load a csv using the same package and it worked just fine, so this seems specific to pyspark.* -Don Gory details: {code} $ pyspark --packages com.databricks:spark-csv_2.10:1.0.3 Python 2.7.6 (default, Sep 9 2014, 15:04:36) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin Type help, copyright, credits or license for more information. Ivy Default Cache set to: /Users/drake/.ivy2/cache The jars for the packages stored in: /Users/drake/.ivy2/jars :: loading settings :: url = jar:file:/Users/drake/spark/spark-1.4.0-bin-hadoop2.6/lib/spark-assembly-1.4.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml com.databricks#spark-csv_2.10 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found com.databricks#spark-csv_2.10;1.0.3 in central found org.apache.commons#commons-csv;1.1 in central :: resolution report :: resolve 590ms :: artifacts dl 17ms :: modules in use: com.databricks#spark-csv_2.10;1.0.3 from central in [default] org.apache.commons#commons-csv;1.1 from central in [default] - | |modules|| artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | default | 2 | 0 | 0 | 0 || 2 | 0 | - :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 2 already retrieved (0kB/15ms) Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/06/13 11:06:08 INFO SparkContext: Running Spark version 1.4.0 2015-06-13 11:06:08.921 java[19233:2145789] Unable to load realm info from SCDynamicStore 15/06/13 11:06:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/06/13 11:06:09 WARN Utils: Your hostname, Dons-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 10.0.0.222 instead (on interface en0) 15/06/13 11:06:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 15/06/13 11:06:09 INFO SecurityManager: Changing view acls to: drake 15/06/13 11:06:09 INFO SecurityManager: Changing modify acls to: drake 15/06/13 11:06:09 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(drake); users with modify permissions: Set(drake) 15/06/13 11:06:10 INFO Slf4jLogger: Slf4jLogger started 15/06/13 11:06:10 INFO Remoting: Starting remoting 15/06/13 11:06:10 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.0.0.222:56870] 15/06/13 11:06:10 INFO Utils: Successfully started service 'sparkDriver' on port
[jira] [Updated] (SPARK-6208) executor-memory does not work when using local cluster
[ https://issues.apache.org/jira/browse/SPARK-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6208: --- Target Version/s: (was: 1.4.0) executor-memory does not work when using local cluster -- Key: SPARK-6208 URL: https://issues.apache.org/jira/browse/SPARK-6208 Project: Spark Issue Type: New Feature Components: Spark Submit Reporter: Yin Huai Priority: Minor Seems executor memory set with a local cluster is not correctly set (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L377). Also, totalExecutorCores seems has the same issue (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L379). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7019) Build docs on doc changes
[ https://issues.apache.org/jira/browse/SPARK-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-7019: --- Target Version/s: 1.5.0 (was: 1.4.0) Build docs on doc changes - Key: SPARK-7019 URL: https://issues.apache.org/jira/browse/SPARK-7019 Project: Spark Issue Type: New Feature Components: Build Reporter: Brennon York Currently when a pull request changes the {{docs/}} directory, the docs aren't actually built. When a PR is submitted the {{git}} history should be checked to see if any doc changes were made and, if so, properly build the docs and report any issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7018) Refactor dev/run-tests-jenkins into Python
[ https://issues.apache.org/jira/browse/SPARK-7018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-7018: --- Target Version/s: 1.5.0 (was: 1.4.0) Refactor dev/run-tests-jenkins into Python -- Key: SPARK-7018 URL: https://issues.apache.org/jira/browse/SPARK-7018 Project: Spark Issue Type: Sub-task Components: Build, Project Infra Reporter: Brennon York This issue is to specifically track the progress of the {{dev/run-tests-jenkins}} script into Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7560) Make flaky tests easier to debug
[ https://issues.apache.org/jira/browse/SPARK-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-7560. --- Resolution: Fixed Make flaky tests easier to debug Key: SPARK-7560 URL: https://issues.apache.org/jira/browse/SPARK-7560 Project: Spark Issue Type: New Feature Components: Project Infra, Tests Reporter: Patrick Wendell Right now it's really hard for people to even get the logs from a flakey test. Once you get the logs, it's very difficult to figure out what logs are associated with what tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6557) Only set bold text on PR github test output for problems
[ https://issues.apache.org/jira/browse/SPARK-6557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590454#comment-14590454 ] Josh Rosen commented on SPARK-6557: --- I'm going to mark this as being blocked by the refactoring of {{dev/run-tests-jeknins}} into Python. Only set bold text on PR github test output for problems Key: SPARK-6557 URL: https://issues.apache.org/jira/browse/SPARK-6557 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Brennon York Priority: Trivial Labels: starter Minor nit, but right now we highlight (i.e. place bold text) various comments from the PR tests when the PR's are submitted. For example: we currently highlight a PR that merges successfully, and also highlight the fact that the PR fails Spark tests. I propose that we **only highlight (bold) text when there is a problem with a PR**. For instance, we should not bold that the patch merges cleanly, only bold when it **does not**. The entire point is to make it easier for the committers and developers to quickly glance over PR test output and understand what, if anything, they need to dive into. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7889) Jobs progress of apps on complete page of HistoryServer shows uncompleted
[ https://issues.apache.org/jira/browse/SPARK-7889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590215#comment-14590215 ] Steve Loughran commented on SPARK-7889: --- Is JIRA about (a) the status on the listing of complete/uncomplete being wrong in some way (b) the actual job view (history/some-app-id) being stale when a job completes. (b) is consistent with what I observed in SPARK-8275 Looking at your patch, and comparing it with my proposal, I prefer mine. All I'm proposing is invalidating the cache on work in progress, so that it is retrieved again. Thinking about it some more, we can go one better: rely on the {{ApplicationHistoryInfo.lastUpdated}} field to tell us when the UI was last updated. If we cache the update time with the UI, on any GET of an appUI, we can look to see if the previous UI was not completed and if the lastupdated time has changed...if so. that triggers a refresh. with this approach the entry you see will always be the one most recently published to the history store (of any implementation), and picked up by the history provider in its getListing()/background refresh operation. Jobs progress of apps on complete page of HistoryServer shows uncompleted - Key: SPARK-7889 URL: https://issues.apache.org/jira/browse/SPARK-7889 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: meiyoula Priority: Minor When running a SparkPi with 2000 tasks, cliking into the app on incomplete page, the job progress shows 400/2000. After the app is completed, the app goes to complete page from incomplete, and now cliking into the app, the job progress still shows 400/2000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8406) Race condition when writing Parquet files
[ https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8406: -- Description: To support appending, the Parquet data source tries to find out the max part number of part-files in the destination directory (the id in output file name part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this step happens on driver side before any files are written. However, in 1.4.0, this is moved to task side. Thus, for tasks scheduled later, they may see wrong max part number generated by newly written files by other finished tasks within the same job. This actually causes a race condition. In most cases, this only causes nonconsecutive IDs in output file names. But when the DataFrame contains thousands of RDD partitions, it's likely that two tasks may choose the same part number, thus one of them gets overwritten by the other. The following Spark shell snippet can reproduce nonconsecutive part numbers: {code} sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo) {code} 16 can be replaced with any integer that is greater than the default parallelism on your machine (usually it means core number, on my machine it's 8). {noformat} -rw-r--r-- 3 lian supergroup 0 2015-06-17 00:06 /user/lian/foo/_SUCCESS -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-1.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-2.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-3.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-4.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-5.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-6.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-7.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-8.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00017.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00018.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00019.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00020.gz.parquet -rw-r--r-- 3 lian supergroup352 2015-06-17 00:06 /user/lian/foo/part-r-00021.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00022.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00023.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00024.gz.parquet {noformat} And here is another Spark shell snippet for reproducing overwriting: {code} {code} Notice that the newly added ORC data source doesn't suffer this issue because it uses both part number and {{System.currentTimeMills()}} to generate the output file name. was: To support appending, the Parquet data source tries to find out the max part number of part-files in the destination directory (the id in output file name part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this step happens on driver side before any files are written. However, in 1.4.0, this is moved to task side. Thus, for tasks scheduled later, they may see wrong max part number generated by newly written files by other finished tasks within the same job. This actually causes a race condition. In most cases, this only causes nonconsecutive IDs in output file names. But when the DataFrame contains thousands of RDD partitions, it's likely that two tasks may choose the same part number, thus one of them gets overwritten by the other. The data loss situation is not quite easy to reproduce. But the following Spark shell snippet can reproduce nonconsecutive output file IDs: {code} sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo) {code} 16 can be replaced with any integer that is greater than the default parallelism on your machine (usually it means core number, on my machine it's 8). {noformat} -rw-r--r-- 3 lian supergroup 0 2015-06-17 00:06 /user/lian/foo/_SUCCESS -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-1.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-2.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-3.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-4.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-5.gz.parquet
[jira] [Updated] (SPARK-4605) Proposed Contribution: Spark Kernel to enable interactive Spark applications
[ https://issues.apache.org/jira/browse/SPARK-4605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4605: -- Component/s: (was: Project Infra) Proposed Contribution: Spark Kernel to enable interactive Spark applications Key: SPARK-4605 URL: https://issues.apache.org/jira/browse/SPARK-4605 Project: Spark Issue Type: New Feature Reporter: Chip Senkbeil Attachments: Kernel Architecture Widescreen.pdf, Kernel Architecture.pdf Project available on Github: https://github.com/ibm-et/spark-kernel This architecture is describing running kernel code that was demonstrated at the StrataConf in Barcelona, Spain. Enables applications to interact with a Spark cluster using Scala in several ways: * Defining and running core Spark Tasks * Collecting results from a cluster without needing to write to external data store ** Ability to stream results using well-defined protocol * Arbitrary Scala code definition and execution (without submitting heavy-weight jars) Applications can be hosted and managed separate from the Spark cluster using the kernel as a proxy to communicate requests. The Spark Kernel implements the server side of the IPython Kernel protocol, the rising “de-facto” protocol for language (Python, Haskell, etc.) execution. Inherits a suite of industry adopted clients such as the IPython Notebook. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8415) Jenkins compilation spends lots of time re-resolving dependencies and waiting to acquire Ivy cache lock
Josh Rosen created SPARK-8415: - Summary: Jenkins compilation spends lots of time re-resolving dependencies and waiting to acquire Ivy cache lock Key: SPARK-8415 URL: https://issues.apache.org/jira/browse/SPARK-8415 Project: Spark Issue Type: Bug Components: Build, Project Infra Reporter: Josh Rosen When watching a pull request build, I noticed that the compilation + packaging + test compilation phases spent huge amounts of time waiting to acquire the Ivy cache lock. We should see whether we can tell SBT to skip the resolution steps for some of these commands, since this could speed up the compilation process when Jenkins is heavily loaded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5178) Integrate Python unit tests into Jenkins
[ https://issues.apache.org/jira/browse/SPARK-5178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590442#comment-14590442 ] Josh Rosen commented on SPARK-5178: --- This is made slightly more complicated by the fact that the PRB tests three Python versions, so getting disambiguated test names might be tricky. Integrate Python unit tests into Jenkins Key: SPARK-5178 URL: https://issues.apache.org/jira/browse/SPARK-5178 Project: Spark Issue Type: Improvement Components: Build, Project Infra Reporter: Nicholas Chammas Priority: Minor From [~joshrosen]: {quote} The Test Result pages for Jenkins builds shows some nice statistics for the test run, including individual test times: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/ Currently this only covers the Java / Scala tests, but we might be able to integrate the PySpark tests here, too (I think it's just a matter of getting the Python test runner to generate the correct test result XML output). {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8356) Reconcile callUDF and callUdf
[ https://issues.apache.org/jira/browse/SPARK-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590513#comment-14590513 ] Benjamin Fradet commented on SPARK-8356: Ok, I'll make sure Udf disappear, should I open another JIRA or can I add it to the PR for this one? Reconcile callUDF and callUdf - Key: SPARK-8356 URL: https://issues.apache.org/jira/browse/SPARK-8356 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Priority: Critical Labels: starter Right now we have two functions {{callUDF}} and {{callUdf}}. I think the former is used for calling Java functions (and the documentation is wrong) and the latter is for calling functions by name. Either way this is confusing and we should unify or pick different names. Also, lets make sure the docs are right. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8356) Reconcile callUDF and callUdf
[ https://issues.apache.org/jira/browse/SPARK-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590505#comment-14590505 ] Michael Armbrust commented on SPARK-8356: - Sure, (and the convention in spark would be to use UDF), but those are internal APIs so I'm less concerned there. Reconcile callUDF and callUdf - Key: SPARK-8356 URL: https://issues.apache.org/jira/browse/SPARK-8356 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Priority: Critical Labels: starter Right now we have two functions {{callUDF}} and {{callUdf}}. I think the former is used for calling Java functions (and the documentation is wrong) and the latter is for calling functions by name. Either way this is confusing and we should unify or pick different names. Also, lets make sure the docs are right. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7160) Support converting DataFrames to typed RDDs.
[ https://issues.apache.org/jira/browse/SPARK-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7160: Assignee: Ray Ortigas Support converting DataFrames to typed RDDs. Key: SPARK-7160 URL: https://issues.apache.org/jira/browse/SPARK-7160 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.1 Reporter: Ray Ortigas Assignee: Ray Ortigas As a Spark user still working with RDDs, I'd like the ability to convert a DataFrame to a typed RDD. For example, if I've converted RDDs to DataFrames so that I could save them as Parquet or CSV files, I would like to rebuild the RDD from those files automatically rather than writing the row-to-type conversion myself. {code} val rdd0 = sc.parallelize(Seq(Food(apple, 1), Food(banana, 2), Food(cherry, 3))) val df0 = rdd0.toDF() df0.save(foods.parquet) val df1 = sqlContext.load(foods.parquet) val rdd1 = df1.toTypedRDD[Food]() // rdd0 and rdd1 should have the same elements {code} I originally submitted a smaller PR for spark-csv https://github.com/databricks/spark-csv/pull/52, but Reynold Xin suggested that converting a DataFrame to a typed RDD wasn't something specific to spark-csv. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7160) Support converting DataFrames to typed RDDs.
[ https://issues.apache.org/jira/browse/SPARK-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7160: Priority: Critical (was: Major) Target Version/s: 1.5.0 Shepherd: Michael Armbrust Support converting DataFrames to typed RDDs. Key: SPARK-7160 URL: https://issues.apache.org/jira/browse/SPARK-7160 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.1 Reporter: Ray Ortigas Assignee: Ray Ortigas Priority: Critical As a Spark user still working with RDDs, I'd like the ability to convert a DataFrame to a typed RDD. For example, if I've converted RDDs to DataFrames so that I could save them as Parquet or CSV files, I would like to rebuild the RDD from those files automatically rather than writing the row-to-type conversion myself. {code} val rdd0 = sc.parallelize(Seq(Food(apple, 1), Food(banana, 2), Food(cherry, 3))) val df0 = rdd0.toDF() df0.save(foods.parquet) val df1 = sqlContext.load(foods.parquet) val rdd1 = df1.toTypedRDD[Food]() // rdd0 and rdd1 should have the same elements {code} I originally submitted a smaller PR for spark-csv https://github.com/databricks/spark-csv/pull/52, but Reynold Xin suggested that converting a DataFrame to a typed RDD wasn't something specific to spark-csv. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3854) Scala style: require spaces before `{`
[ https://issues.apache.org/jira/browse/SPARK-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590596#comment-14590596 ] Reynold Xin commented on SPARK-3854: I don't think we have this yet. Scala style: require spaces before `{` -- Key: SPARK-3854 URL: https://issues.apache.org/jira/browse/SPARK-3854 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Josh Rosen We should require spaces before opening curly braces. This isn't in the style guide, but it probably should be: {code} // Correct: if (true) { println(Wow!) } // Incorrect: if (true){ println(Wow!) } {code} See https://github.com/apache/spark/pull/1658#discussion-diff-18611791 for an example in the wild. {{git grep ){}} shows only a few occurrences of this style. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8144) For PySpark SQL, automatically convert values provided in readwriter options to string
[ https://issues.apache.org/jira/browse/SPARK-8144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-8144: Summary: For PySpark SQL, automatically convert values provided in readwriter options to string (was: PySpark SQL readwriter options() does not work) For PySpark SQL, automatically convert values provided in readwriter options to string -- Key: SPARK-8144 URL: https://issues.apache.org/jira/browse/SPARK-8144 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Because of typos in lines 81 and 240 of: [https://github.com/apache/spark/blob/16fc49617e1dfcbe9122b224f7f63b7bfddb36ce/python/pyspark/sql/readwriter.py] (Search for option() CC: [~yhuai] [~davies] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7626) Actions on DataFrame created from HIVE table with newly added column throw NPE
[ https://issues.apache.org/jira/browse/SPARK-7626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7626: Description: We recently added a new column page_context to a hive table named clicks, partitioned by data_date. This leads to NPE being thrown on DataFrame created on older partitions without this column populated. For example: {code} val hc = new HiveContext(sc) val clk = hc.sql(select * from clicks where data_date=20150302) clk.show() {code} throws the following error msg: {code} java.lang.RuntimeException: cannot find field page_context from [0:log_format_number, .] at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:415) at org.apache.hadoop.hive.serde2.lazy.objectinspector.LazySimpleStructObjectInspector.getStructFieldRef(LazySimpleStructObjectInspector.java:173) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:278) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:277) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.HadoopTableReader$.fillObject(TableReader.scala:277) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$4$$anonfun$9.apply(TableReader.scala:194) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$4$$anonfun$9.apply(TableReader.scala:188) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) {code} was: We recently added a new column page_context to a hive table named clicks, partitioned by data_date. This leads to NPE being thrown on DataFrame created on older partitions without this column populated. For example: val hc = new HiveContext(sc) val clk = hc.sql(select * from clicks where data_date=20150302) clk.show() throws the following error msg: java.lang.RuntimeException: cannot find field page_context from [0:log_format_number, .] at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:415) at org.apache.hadoop.hive.serde2.lazy.objectinspector.LazySimpleStructObjectInspector.getStructFieldRef(LazySimpleStructObjectInspector.java:173) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:278) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:277) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.HadoopTableReader$.fillObject(TableReader.scala:277) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$4$$anonfun$9.apply(TableReader.scala:194) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$4$$anonfun$9.apply(TableReader.scala:188) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at
[jira] [Updated] (SPARK-7626) Actions on DataFrame created from HIVE table with newly added column throw NPE
[ https://issues.apache.org/jira/browse/SPARK-7626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7626: Component/s: (was: Spark Core) SQL Actions on DataFrame created from HIVE table with newly added column throw NPE --- Key: SPARK-7626 URL: https://issues.apache.org/jira/browse/SPARK-7626 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Zhiyang Guo We recently added a new column page_context to a hive table named clicks, partitioned by data_date. This leads to NPE being thrown on DataFrame created on older partitions without this column populated. For example: val hc = new HiveContext(sc) val clk = hc.sql(select * from clicks where data_date=20150302) clk.show() throws the following error msg: java.lang.RuntimeException: cannot find field page_context from [0:log_format_number, .] at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:415) at org.apache.hadoop.hive.serde2.lazy.objectinspector.LazySimpleStructObjectInspector.getStructFieldRef(LazySimpleStructObjectInspector.java:173) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:278) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:277) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.HadoopTableReader$.fillObject(TableReader.scala:277) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$4$$anonfun$9.apply(TableReader.scala:194) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$4$$anonfun$9.apply(TableReader.scala:188) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7067) Can't resolve nested column in ORDER BY
[ https://issues.apache.org/jira/browse/SPARK-7067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-7067. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 5659 [https://github.com/apache/spark/pull/5659] Can't resolve nested column in ORDER BY --- Key: SPARK-7067 URL: https://issues.apache.org/jira/browse/SPARK-7067 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Fix For: 1.5.0 In order to avoid breaking existing HiveQL queries, the current way we resolve column in ORDER BY is: first resolve based on what comes from the select clause and then fall back on its child only when this fails. However, this case will fail: {code} test(orderby queries) { jsonRDD(sparkContext.makeRDD( {a: {b: [{c: 1}]}, b: [{d: 1}]} :: Nil)).registerTempTable(t) sql(SELECT a.b FROM t ORDER BY b[0].d).queryExecution.analyzed } {code} As hive doesn't support resolve ORDER BY attribute not exist in select clause, so this problem is spark sql only. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7026) LeftSemiJoin can not work when it has both equal condition and not equal condition.
[ https://issues.apache.org/jira/browse/SPARK-7026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7026: Target Version/s: 1.5.0 LeftSemiJoin can not work when it has both equal condition and not equal condition. - Key: SPARK-7026 URL: https://issues.apache.org/jira/browse/SPARK-7026 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Zhongshuai Pei Assignee: Adrian Wang Run sql like that {panel} select * from web_sales ws1 left semi join web_sales ws2 on ws1.ws_order_number = ws2.ws_order_number and ws1.ws_warehouse_sk ws2.ws_warehouse_sk {panel} then get an exception {panel} Couldn't find ws_warehouse_sk#287 in {ws_sold_date_sk#237,ws_sold_time_sk#238,ws_ship_date_sk#239,ws_item_sk#240,ws_bill_customer_sk#241,ws_bill_cdemo_sk#242,ws_bill_hdemo_sk#243,ws_bill_addr_sk#244,ws_ship_customer_sk#245,ws_ship_cdemo_sk#246,ws_ship_hdemo_sk#247,ws_ship_addr_sk#248,ws_web_page_sk#249,ws_web_site_sk#250,ws_ship_mode_sk#251,ws_warehouse_sk#252,ws_promo_sk#253,ws_order_number#254,ws_quantity#255,ws_wholesale_cost#256,ws_list_price#257,ws_sales_price#258,ws_ext_discount_amt#259,ws_ext_sales_price#260,ws_ext_wholesale_cost#261,ws_ext_list_price#262,ws_ext_tax#263,ws_coupon_amt#264,ws_ext_ship_cost#265,ws_net_paid#266,ws_net_paid_inc_tax#267,ws_net_paid_inc_ship#268,ws_net_paid_inc_ship_tax#269,ws_net_profit#270,ws_sold_date#236} at scala.sys.package$.error(package.scala:27) {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8390) Update DirectKafkaWordCount examples to show how offset ranges can be used
[ https://issues.apache.org/jira/browse/SPARK-8390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590720#comment-14590720 ] Cody Koeninger commented on SPARK-8390: --- Did we actually want to update the wordcount examples (might confuse, since you dont need offset ranges for minimal wordcount usage)... or just fix the part of the docs about the offset ranges? PR is just fixing the docs for now. I'd personally prefer to link to the talk / slides about direct stream once its available... not sure how you feel about external links in the doc. Update DirectKafkaWordCount examples to show how offset ranges can be used -- Key: SPARK-8390 URL: https://issues.apache.org/jira/browse/SPARK-8390 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.4.0 Reporter: Tathagata Das Assignee: Cody Koeninger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8368) ClassNotFoundException in closure for map
[ https://issues.apache.org/jira/browse/SPARK-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-8368: Priority: Blocker (was: Major) ClassNotFoundException in closure for map -- Key: SPARK-8368 URL: https://issues.apache.org/jira/browse/SPARK-8368 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Environment: Centos 6.5, java 1.7.0_67, scala 2.10.4. Build the project on Windows 7 and run in a spark standalone cluster(or local) mode on Centos 6.X. Reporter: CHEN Zhiwei Priority: Blocker After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the following exception: ==begin exception {quote} Exception in thread main java.lang.ClassNotFoundException: com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:278) at org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) at org.apache.spark.SparkContext.clean(SparkContext.scala:1891) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.map(RDD.scala:293) at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210) at com.yhd.ycache.magic.Model$.main(SSExample.scala:239) at com.yhd.ycache.magic.Model.main(SSExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {quote} ===end exception=== I simplify the code that cause this issue, as following: ==begin code== {noformat} object Model extends Serializable{ def main(args: Array[String]) { val Array(sql) = args val sparkConf = new SparkConf().setAppName(Mode Example) val sc = new SparkContext(sparkConf) val hive = new HiveContext(sc) //get data by hive sql val rows = hive.sql(sql) val data = rows.map(r = { val arr = r.toSeq.toArray val label = 1.0 def fmap = ( input: Any ) = 1.0 val feature = arr.map(_=1.0) LabeledPoint(label, Vectors.dense(feature)) }) data.count() } } {noformat} =end code=== This code can run pretty well on spark-shell, but error when submit it to spark cluster (standalone or local mode). I try the same code on spark 1.3.0(local mode), and no exception is encountered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8368) ClassNotFoundException in closure for map
[ https://issues.apache.org/jira/browse/SPARK-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590731#comment-14590731 ] Yin Huai commented on SPARK-8368: - Right now, looks like it is a problem caused by spark sql's isolated class loader. ClassNotFoundException in closure for map -- Key: SPARK-8368 URL: https://issues.apache.org/jira/browse/SPARK-8368 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Environment: Centos 6.5, java 1.7.0_67, scala 2.10.4. Build the project on Windows 7 and run in a spark standalone cluster(or local) mode on Centos 6.X. Reporter: CHEN Zhiwei Priority: Blocker After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the following exception: ==begin exception {quote} Exception in thread main java.lang.ClassNotFoundException: com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:278) at org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) at org.apache.spark.SparkContext.clean(SparkContext.scala:1891) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.map(RDD.scala:293) at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210) at com.yhd.ycache.magic.Model$.main(SSExample.scala:239) at com.yhd.ycache.magic.Model.main(SSExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {quote} ===end exception=== I simplify the code that cause this issue, as following: ==begin code== {noformat} object Model extends Serializable{ def main(args: Array[String]) { val Array(sql) = args val sparkConf = new SparkConf().setAppName(Mode Example) val sc = new SparkContext(sparkConf) val hive = new HiveContext(sc) //get data by hive sql val rows = hive.sql(sql) val data = rows.map(r = { val arr = r.toSeq.toArray val label = 1.0 def fmap = ( input: Any ) = 1.0 val feature = arr.map(_=1.0) LabeledPoint(label, Vectors.dense(feature)) }) data.count() } } {noformat} =end code=== This code can run pretty well on spark-shell, but error when submit it to spark cluster (standalone or local mode). I try the same code on spark 1.3.0(local mode), and no exception is encountered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8397) Allow custom configuration for TestHive
[ https://issues.apache.org/jira/browse/SPARK-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-8397. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6844 [https://github.com/apache/spark/pull/6844] Allow custom configuration for TestHive --- Key: SPARK-8397 URL: https://issues.apache.org/jira/browse/SPARK-8397 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Punya Biswal Priority: Minor Fix For: 1.5.0 We encourage people to use {{TestHive}} in unit tests, because it's impossible to create more than one {{HiveContext}} within one process. The current implementation locks people into using a {{local[2]}} {{SparkContext}} underlying their {{HiveContext}}. We should make it possible to override this using a system property so that people can test against {{local-cluster}} or remote spark clusters to make their tests more realistic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8332) NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
[ https://issues.apache.org/jira/browse/SPARK-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olivier Girardot updated SPARK-8332: Description: I complied new spark 1.4.0 version. But when I run a simple WordCount demo, it throws NoSuchMethodError {code} java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer {code} I found out that the default fasterxml.jackson.version is 2.4.4. Is there any wrong or conflict with the jackson version? Or is there possibly some project maven dependency containing the wrong version of jackson? was: I complied new spark 1.4.0 versio. But when I run a simple WordCount demo, it throws NoSuchMethodError java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer. I found the default fasterxml.jackson.version is 2.4.4. It's there any wrong or conflict with the jackson version? Or is there possible some project maven dependency contains wrong version jackson? NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer -- Key: SPARK-8332 URL: https://issues.apache.org/jira/browse/SPARK-8332 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Environment: spark 1.4 hadoop 2.3.0-cdh5.0.0 Reporter: Tao Li Priority: Critical Labels: 1.4.0, NoSuchMethodError, com.fasterxml.jackson I complied new spark 1.4.0 version. But when I run a simple WordCount demo, it throws NoSuchMethodError {code} java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer {code} I found out that the default fasterxml.jackson.version is 2.4.4. Is there any wrong or conflict with the jackson version? Or is there possibly some project maven dependency containing the wrong version of jackson? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5971) Add Mesos support to spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-5971: - Target Version/s: 1.5.0 (was: 1.4.0) Add Mesos support to spark-ec2 -- Key: SPARK-5971 URL: https://issues.apache.org/jira/browse/SPARK-5971 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Nicholas Chammas Priority: Minor Right now, spark-ec2 can only launch Spark clusters that use the standalone manager. Adding support for Mesos would be useful mostly for automated performance testing of Spark on Mesos. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6218) Upgrade spark-ec2 from optparse to argparse
[ https://issues.apache.org/jira/browse/SPARK-6218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590791#comment-14590791 ] Shivaram Venkataraman commented on SPARK-6218: -- [~nchammas] I just updated the target version to 1.5.0 for this. FWIW I don't have a strong opinion about which argument parsing library we use as long we can maintain compatibility with Python 2.6 Upgrade spark-ec2 from optparse to argparse --- Key: SPARK-6218 URL: https://issues.apache.org/jira/browse/SPARK-6218 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor spark-ec2 [currently uses optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43]. In Python 2.7, optparse was [deprecated in favor of argparse|https://docs.python.org/2/library/optparse.html]. This is the main motivation for moving away from optparse. Additionally, upgrading to argparse provides some [additional benefits noted in the docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html]. The one we are mostly likely to benefit from is the better input validation. Specifically, being able to cleanly tie each input parameter to a validation method will cut down the input validation code currently spread out across the script. argparse is not include with Python 2.6, which is currently the minimum version of Python we support in Spark, but it can easily be downloaded by spark-ec2 with the work that has already been done in SPARK-6191. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5971) Add Mesos support to spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590799#comment-14590799 ] Shivaram Venkataraman commented on SPARK-5971: -- Moving this to target 1.5.0 cc [~tnachen] who might be interested in this. Add Mesos support to spark-ec2 -- Key: SPARK-5971 URL: https://issues.apache.org/jira/browse/SPARK-5971 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Nicholas Chammas Priority: Minor Right now, spark-ec2 can only launch Spark clusters that use the standalone manager. Adding support for Mesos would be useful mostly for automated performance testing of Spark on Mesos. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7814) Turn code generation on by default
[ https://issues.apache.org/jira/browse/SPARK-7814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590885#comment-14590885 ] Herman van Hovell tot Westerflier commented on SPARK-7814: -- I have build spark from the latest source using Hadoop 2.3/2.6 (tried them both), using the following command: {noformat} make-distribution.sh -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver {noformat} When execute the following commands: {noformat} val otp = sqlContext.read.parquet(Input/otp.prq) otp.count {noformat} I get the following Janino (Code Generation) error: {noformat} 15/06/17 19:35:51 ERROR TaskSetManager: Task 1 in stage 0.0 failed 1 times; aborting job 15/06/17 19:35:51 ERROR GenerateProjection: failed to compile: import org.apache.spark.sql.catalyst.InternalRow; public SpecificProjection generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) { return new SpecificProjection(expr); } class SpecificProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProject { private org.apache.spark.sql.catalyst.expressions.Expression[] expressions = null; public SpecificProjection(org.apache.spark.sql.catalyst.expressions.Expression[] expr) { expressions = expr; } @Override public Object apply(Object r) { return new SpecificRow(expressions, (InternalRow) r); } } final class SpecificRow extends org.apache.spark.sql.BaseMutableRow { private long c0 = -1L; public SpecificRow(org.apache.spark.sql.catalyst.expressions.Expression[] expressions, InternalRow i) { { // column0 nullBits[0] = false; if (!false) { c0 = 0L; } } } public int size() { return 1;} protected boolean[] nullBits = new boolean[1]; public void setNullAt(int i) { nullBits[i] = true; } public boolean isNullAt(int i) { return nullBits[i]; } public Object get(int i) { if (isNullAt(i)) return null; switch (i) { case 0: return c0; } return null; } public void update(int i, Object value) { if (value == null) { setNullAt(i); return; } nullBits[i] = false; switch (i) { case 0: { c0 = (Long)value; return;} } } @Override public long getLong(int i) { if (isNullAt(i)) { return -1L; } switch (i) { case 0: return c0; } throw new IllegalArgumentException(Invalid index: + i + in getLong); } @Override public void setLong(int i, long value) { nullBits[i] = false; switch (i) { case 0: { c0 = value; return; } } throw new IllegalArgumentException(Invalid index: + i + in setLong); } @Override public int hashCode() { int result = 37; result *= 37; result += isNullAt(0) ? 0 : (c0 ^ (c0 32)); return result; } @Override public boolean equals(Object other) { if (other instanceof SpecificRow) { SpecificRow row = (SpecificRow) other; if (nullBits[0] != row.nullBits[0] || (!nullBits[0] !(c0 == row.c0))) { return false; } return true; } return super.equals(other); } } org.codehaus.commons.compiler.CompileException: Line 16, Column 33: Object at org.codehaus.janino.UnitCompiler.findTypeByName(UnitCompiler.java:6897) at org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5331) at org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5207) at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5188) at org.codehaus.janino.UnitCompiler.access$12600(UnitCompiler.java:185) at org.codehaus.janino.UnitCompiler$16.visitReferenceType(UnitCompiler.java:5119) at org.codehaus.janino.Java$ReferenceType.accept(Java.java:2880) at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5159) at org.codehaus.janino.UnitCompiler.access$16700(UnitCompiler.java:185) at org.codehaus.janino.UnitCompiler$31.getParameterTypes2(UnitCompiler.java:8533) at org.codehaus.janino.IClass$IInvocable.getParameterTypes(IClass.java:835) at org.codehaus.janino.IClass$IMethod.getDescriptor2(IClass.java:1063) at org.codehaus.janino.IClass$IInvocable.getDescriptor(IClass.java:849) at org.codehaus.janino.IClass.getIMethods(IClass.java:211) at org.codehaus.janino.IClass.getIMethods(IClass.java:199) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:409) at
[jira] [Created] (SPARK-8417) spark-class has illegal statement
jweinste created SPARK-8417: --- Summary: spark-class has illegal statement Key: SPARK-8417 URL: https://issues.apache.org/jira/browse/SPARK-8417 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.4.0 Reporter: jweinste Priority: Blocker spark-class There is an illegal statement. done ($RUNNER -cp $LAUNCH_CLASSPATH org.apache.spark.launcher.Main $@) Complaint is ./bin/spark-class: line 100: syntax error near unexpected token `' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-8412) java#KafkaUtils.createDirectStream Java(Pair)RDDs do not implement HasOffsetRanges
[ https://issues.apache.org/jira/browse/SPARK-8412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jweinste reopened SPARK-8412: - This is improperly implemented or improperly documented. Either way something needs to be corrected. http://spark.apache.org/docs/latest/streaming-kafka-integration.html You'll have to scan down to http://spark.apache.org/docs/latest/streaming-kafka-integration.html #tab_java_2 directKafkaStream.foreachRDD( new FunctionJavaPairRDDString, String, Void() { @Override public Void call(JavaPairRDDString, Integer rdd) throws IOException { OffsetRange[] offsetRanges = ((HasOffsetRanges)rdd).offsetRanges // offsetRanges.length = # of Kafka partitions being consumed ... return null; } } ); java#KafkaUtils.createDirectStream Java(Pair)RDDs do not implement HasOffsetRanges -- Key: SPARK-8412 URL: https://issues.apache.org/jira/browse/SPARK-8412 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: jweinste Priority: Critical // Create direct kafka stream with brokers and topics final JavaPairInputDStreamString, String messages = KafkaUtils.createDirectStream(jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topics); messages.foreachRDD(new FunctionJavaPairRDDString, String, Void() { @Override public Void call(final JavaPairRDDString, String rdd) throws Exception { if (rdd instanceof HasOffsetRanges) { //will never happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7017) Refactor dev/run-tests into Python
[ https://issues.apache.org/jira/browse/SPARK-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590826#comment-14590826 ] Apache Spark commented on SPARK-7017: - User 'brennonyork' has created a pull request for this issue: https://github.com/apache/spark/pull/6865 Refactor dev/run-tests into Python -- Key: SPARK-7017 URL: https://issues.apache.org/jira/browse/SPARK-7017 Project: Spark Issue Type: Sub-task Components: Build, Project Infra Reporter: Brennon York Assignee: Brennon York Fix For: 1.5.0 This issue is to specifically track the progress of the {{dev/run-tests}} script into Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8348) Add in operator to DataFrame Column
[ https://issues.apache.org/jira/browse/SPARK-8348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590880#comment-14590880 ] Yu Ishikawa commented on SPARK-8348: Hi [~shivaram], Thank you for letting me know another PR to add operations into SparkR. Can I ask you a couple of questions about adding a new operator? Those added operations doesn't include any method to deal with array or list. I am having trouble with how I can deal with array or list in arguments to call a Java method. The gist includes the details of code and error messages. Please check it. https://gist.github.com/yu-iskw/ba249f79ef338ff86967 Anyway, {{filter(df, age in (19))}} can work without problems. But How do I implement {{%in%}} in SparkR? Add in operator to DataFrame Column --- Key: SPARK-8348 URL: https://issues.apache.org/jira/browse/SPARK-8348 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Xiangrui Meng It is convenient to add in operator to column, so we can filter values in a set. {code} df.filter(col(brand).in(dell, sony)) {code} In R, the operator should be `%in%`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8406) Race condition when writing Parquet files
[ https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8406: --- Assignee: Apache Spark (was: Cheng Lian) Race condition when writing Parquet files - Key: SPARK-8406 URL: https://issues.apache.org/jira/browse/SPARK-8406 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Assignee: Apache Spark Priority: Blocker To support appending, the Parquet data source tries to find out the max part number of part-files in the destination directory (the id in output file name part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this step happens on driver side before any files are written. However, in 1.4.0, this is moved to task side. Thus, for tasks scheduled later, they may see wrong max part number generated by newly written files by other finished tasks within the same job. This actually causes a race condition. In most cases, this only causes nonconsecutive IDs in output file names. But when the DataFrame contains thousands of RDD partitions, it's likely that two tasks may choose the same part number, thus one of them gets overwritten by the other. The following Spark shell snippet can reproduce nonconsecutive part numbers: {code} sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo) {code} 16 can be replaced with any integer that is greater than the default parallelism on your machine (usually it means core number, on my machine it's 8). {noformat} -rw-r--r-- 3 lian supergroup 0 2015-06-17 00:06 /user/lian/foo/_SUCCESS -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-1.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-2.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-3.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-4.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-5.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-6.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-7.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-8.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00017.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00018.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00019.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00020.gz.parquet -rw-r--r-- 3 lian supergroup352 2015-06-17 00:06 /user/lian/foo/part-r-00021.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00022.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00023.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00024.gz.parquet {noformat} And here is another Spark shell snippet for reproducing overwriting: {code} sqlContext.range(0, 1).repartition(500).write.mode(overwrite).parquet(foo) sqlContext.read.parquet(foo).count() {code} Expected answer should be {{1}}, but you may see a number like {{9960}} due to overwriting. The actual number varies for different runs and different nodes. Notice that the newly added ORC data source is less likely to hit this issue because it uses task ID and {{System.currentTimeMills()}} to generate the output file name. Thus, the ORC data source may hit this issue only when two tasks with the same task ID (which means they are in two concurrent jobs) are writing to the same location within the same millisecond. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8406) Race condition when writing Parquet files
[ https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590770#comment-14590770 ] Apache Spark commented on SPARK-8406: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/6864 Race condition when writing Parquet files - Key: SPARK-8406 URL: https://issues.apache.org/jira/browse/SPARK-8406 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker To support appending, the Parquet data source tries to find out the max part number of part-files in the destination directory (the id in output file name part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this step happens on driver side before any files are written. However, in 1.4.0, this is moved to task side. Thus, for tasks scheduled later, they may see wrong max part number generated by newly written files by other finished tasks within the same job. This actually causes a race condition. In most cases, this only causes nonconsecutive IDs in output file names. But when the DataFrame contains thousands of RDD partitions, it's likely that two tasks may choose the same part number, thus one of them gets overwritten by the other. The following Spark shell snippet can reproduce nonconsecutive part numbers: {code} sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo) {code} 16 can be replaced with any integer that is greater than the default parallelism on your machine (usually it means core number, on my machine it's 8). {noformat} -rw-r--r-- 3 lian supergroup 0 2015-06-17 00:06 /user/lian/foo/_SUCCESS -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-1.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-2.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-3.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-4.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-5.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-6.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-7.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-8.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00017.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00018.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00019.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00020.gz.parquet -rw-r--r-- 3 lian supergroup352 2015-06-17 00:06 /user/lian/foo/part-r-00021.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00022.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00023.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00024.gz.parquet {noformat} And here is another Spark shell snippet for reproducing overwriting: {code} sqlContext.range(0, 1).repartition(500).write.mode(overwrite).parquet(foo) sqlContext.read.parquet(foo).count() {code} Expected answer should be {{1}}, but you may see a number like {{9960}} due to overwriting. The actual number varies for different runs and different nodes. Notice that the newly added ORC data source is less likely to hit this issue because it uses task ID and {{System.currentTimeMills()}} to generate the output file name. Thus, the ORC data source may hit this issue only when two tasks with the same task ID (which means they are in two concurrent jobs) are writing to the same location within the same millisecond. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8406) Race condition when writing Parquet files
[ https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8406: --- Assignee: Cheng Lian (was: Apache Spark) Race condition when writing Parquet files - Key: SPARK-8406 URL: https://issues.apache.org/jira/browse/SPARK-8406 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker To support appending, the Parquet data source tries to find out the max part number of part-files in the destination directory (the id in output file name part-r-id.gz.parquet) at the beginning of the write job. In 1.3.0, this step happens on driver side before any files are written. However, in 1.4.0, this is moved to task side. Thus, for tasks scheduled later, they may see wrong max part number generated by newly written files by other finished tasks within the same job. This actually causes a race condition. In most cases, this only causes nonconsecutive IDs in output file names. But when the DataFrame contains thousands of RDD partitions, it's likely that two tasks may choose the same part number, thus one of them gets overwritten by the other. The following Spark shell snippet can reproduce nonconsecutive part numbers: {code} sqlContext.range(0, 128).repartition(16).write.mode(overwrite).parquet(foo) {code} 16 can be replaced with any integer that is greater than the default parallelism on your machine (usually it means core number, on my machine it's 8). {noformat} -rw-r--r-- 3 lian supergroup 0 2015-06-17 00:06 /user/lian/foo/_SUCCESS -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-1.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-2.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-3.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-4.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-5.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-6.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-7.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-8.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00017.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00018.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00019.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00020.gz.parquet -rw-r--r-- 3 lian supergroup352 2015-06-17 00:06 /user/lian/foo/part-r-00021.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00022.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00023.gz.parquet -rw-r--r-- 3 lian supergroup353 2015-06-17 00:06 /user/lian/foo/part-r-00024.gz.parquet {noformat} And here is another Spark shell snippet for reproducing overwriting: {code} sqlContext.range(0, 1).repartition(500).write.mode(overwrite).parquet(foo) sqlContext.read.parquet(foo).count() {code} Expected answer should be {{1}}, but you may see a number like {{9960}} due to overwriting. The actual number varies for different runs and different nodes. Notice that the newly added ORC data source is less likely to hit this issue because it uses task ID and {{System.currentTimeMills()}} to generate the output file name. Thus, the ORC data source may hit this issue only when two tasks with the same task ID (which means they are in two concurrent jobs) are writing to the same location within the same millisecond. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6218) Upgrade spark-ec2 from optparse to argparse
[ https://issues.apache.org/jira/browse/SPARK-6218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6218: - Target Version/s: 1.5.0 (was: 1.4.0) Upgrade spark-ec2 from optparse to argparse --- Key: SPARK-6218 URL: https://issues.apache.org/jira/browse/SPARK-6218 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor spark-ec2 [currently uses optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43]. In Python 2.7, optparse was [deprecated in favor of argparse|https://docs.python.org/2/library/optparse.html]. This is the main motivation for moving away from optparse. Additionally, upgrading to argparse provides some [additional benefits noted in the docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html]. The one we are mostly likely to benefit from is the better input validation. Specifically, being able to cleanly tie each input parameter to a validation method will cut down the input validation code currently spread out across the script. argparse is not include with Python 2.6, which is currently the minimum version of Python we support in Spark, but it can easily be downloaded by spark-ec2 with the work that has already been done in SPARK-6191. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6813) SparkR style guide
[ https://issues.apache.org/jira/browse/SPARK-6813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590862#comment-14590862 ] Yu Ishikawa commented on SPARK-6813: Before discussing the details of coding style, let me confirm. Should we consider the license of a lint software or not? And in my opinion, merging the lint software to master branch is more important than making a perfect style guide. So first of all, we should create an almost perfect style guide and run unit tests on the official Jenkins to the style guide. If we think of a new idea for the style guide, we can add it later. SparkR style guide -- Key: SPARK-6813 URL: https://issues.apache.org/jira/browse/SPARK-6813 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman We should develop a SparkR style guide document based on the some of the guidelines we use and some of the best practices in R. Some examples of R style guide are: http://r-pkgs.had.co.nz/r.html#style http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html A related issue is to work on a automatic style checking tool. https://github.com/jimhester/lintr seems promising We could have a R style guide based on the one from google [1], and adjust some of them with the conversation in Spark: 1. Line Length: maximum 100 characters 2. no limit on function name (API should be similar as in other languages) 3. Allow S4 objects/methods -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8418) Add single- and multi-value support to ML Transformers
Joseph K. Bradley created SPARK-8418: Summary: Add single- and multi-value support to ML Transformers Key: SPARK-8418 URL: https://issues.apache.org/jira/browse/SPARK-8418 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley It would be convenient if all feature transformers supported transforming columns of single values and multiple values, specifically: * one column with one value (e.g., type {{Double}}) * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) We could go as far as supporting multiple columns, but that may not be necessary since VectorAssembler could be used to handle that. Estimators under {{ml.feature}} should also support this. This will likely require a short design doc to describe: * how input and output columns will be specified * schema validation * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8419) Statistics.colStats could avoid an extra count()
Joseph K. Bradley created SPARK-8419: Summary: Statistics.colStats could avoid an extra count() Key: SPARK-8419 URL: https://issues.apache.org/jira/browse/SPARK-8419 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Trivial Statistics.colStats goes through RowMatrix to compute the stats. But RowMatrix.computeColumnSummaryStatistics does an extra count() which could be avoided. Not going through RowMatrix would skip this extra pass over the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7814) Turn code generation on by default
[ https://issues.apache.org/jira/browse/SPARK-7814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7814: --- Description: Turn code gen on, find a lot of bugs, and see what happens. Turn code generation on by default -- Key: SPARK-7814 URL: https://issues.apache.org/jira/browse/SPARK-7814 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Davies Liu Fix For: 1.5.0 Turn code gen on, find a lot of bugs, and see what happens. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8335) DecisionTreeModel.predict() return type not convenient!
[ https://issues.apache.org/jira/browse/SPARK-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590922#comment-14590922 ] Joseph K. Bradley commented on SPARK-8335: -- I've discussed this with [~mengxr] and we decided to leave it alone. I agree it's annoying, but we figured that people will use the Pipelines API in the future (where this is not an issue) and not breaking people's code would be best. Does that sound tolerable? DecisionTreeModel.predict() return type not convenient! --- Key: SPARK-8335 URL: https://issues.apache.org/jira/browse/SPARK-8335 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Reporter: Sebastian Walz Priority: Minor Labels: easyfix, machine_learning Original Estimate: 10m Remaining Estimate: 10m org.apache.spark.mllib.tree.model.DecisionTreeModel has a predict method: def predict(features: JavaRDD[Vector]): JavaRDD[Double] The problem here is the generic type of the return type JAVARDD[Double] because its a scala Double and I would expect a java.lang.Double. (to be convenient e.g. with org.apache.spark.mllib.classification.ClassificationModel) I wanted to extend the DecisionTreeModel and use it only for Binary Classification and wanted to implement the trait org.apache.spark.mllib.classification.ClassificationModel . But its not possible because the ClassificationModel already defines the predict method but with an return type JAVARDD[java.lang.Double]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8010) Implict promote Numeric type to String type in HiveTypeCoercion
[ https://issues.apache.org/jira/browse/SPARK-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-8010. - Resolution: Fixed Fix Version/s: (was: 1.3.1) 1.5.0 Issue resolved by pull request 6551 [https://github.com/apache/spark/pull/6551] Implict promote Numeric type to String type in HiveTypeCoercion --- Key: SPARK-8010 URL: https://issues.apache.org/jira/browse/SPARK-8010 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Li Sheng Fix For: 1.5.0 Original Estimate: 48h Remaining Estimate: 48h 1. Given a query `select coalesce(null, 1, '1') from dual` will cause exception: java.lang.RuntimeException: Could not determine return type of Coalesce for IntegerType,StringType 2. Given a query: `select case when true then 1 else '1' end from dual` will cause exception: java.lang.RuntimeException: Types in CASE WHEN must be the same or coercible to a common type: StringType != IntegerType I checked the code, the main cause is the HiveTypeCoercion doesn't do implicit convert when there is a IntegerType and StringType. Numeric types can be promoted to string type in case throw exceptions. Since Hive will always do this. It need to be fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8372) History server shows incorrect information for application not started
[ https://issues.apache.org/jira/browse/SPARK-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8372. Resolution: Fixed Fix Version/s: 1.5.0 1.4.1 Assignee: Carson Wang Target Version/s: 1.4.1, 1.5.0 History server shows incorrect information for application not started -- Key: SPARK-8372 URL: https://issues.apache.org/jira/browse/SPARK-8372 Project: Spark Issue Type: Bug Components: Deploy, Web UI Affects Versions: 1.4.0 Reporter: Carson Wang Assignee: Carson Wang Priority: Minor Fix For: 1.4.1, 1.5.0 Attachments: IncorrectAppInfo.png The history server may show an incorrect App ID for an incomplete application like App ID.inprogress. This app info will never disappear even after the app is completed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8373) When an RDD has no partition, Python sum will throw Can not reduce() empty RDD
[ https://issues.apache.org/jira/browse/SPARK-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8373. Resolution: Fixed Fix Version/s: 1.5.0 1.4.1 Assignee: Shixiong Zhu Target Version/s: 1.4.1, 1.5.0 When an RDD has no partition, Python sum will throw Can not reduce() empty RDD Key: SPARK-8373 URL: https://issues.apache.org/jira/browse/SPARK-8373 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Minor Fix For: 1.4.1, 1.5.0 The issue is because sum uses reduce. Replacing it with fold will fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8373) When an RDD has no partition, Python sum will throw Can not reduce() empty RDD
[ https://issues.apache.org/jira/browse/SPARK-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8373: - Affects Version/s: (was: 1.4.0) 1.2.0 When an RDD has no partition, Python sum will throw Can not reduce() empty RDD Key: SPARK-8373 URL: https://issues.apache.org/jira/browse/SPARK-8373 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Shixiong Zhu Priority: Minor Fix For: 1.4.1, 1.5.0 The issue is because sum uses reduce. Replacing it with fold will fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8373) When an RDD has no partition, Python sum will throw Can not reduce() empty RDD
[ https://issues.apache.org/jira/browse/SPARK-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8373: - Affects Version/s: 1.4.0 When an RDD has no partition, Python sum will throw Can not reduce() empty RDD Key: SPARK-8373 URL: https://issues.apache.org/jira/browse/SPARK-8373 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Reporter: Shixiong Zhu Priority: Minor The issue is because sum uses reduce. Replacing it with fold will fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org