[jira] [Updated] (SPARK-10946) JDBC - Use Statement.executeUpdate instead of PreparedStatement.executeUpdate for DDLs
[ https://issues.apache.org/jira/browse/SPARK-10946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Priyadarshini updated SPARK-10946: -- Summary: JDBC - Use Statement.executeUpdate instead of PreparedStatement.executeUpdate for DDLs (was: JDBC - Use Statement.execute instead of PreparedStatement.execute for DDLs) > JDBC - Use Statement.executeUpdate instead of PreparedStatement.executeUpdate > for DDLs > -- > > Key: SPARK-10946 > URL: https://issues.apache.org/jira/browse/SPARK-10946 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 1.4.1, 1.5.1 >Reporter: Pallavi Priyadarshini >Priority: Minor > > Certain DataFrame APIs invoke DDLs such as CREATE TABLE and DROP TABLE under > the covers. Current code in DataFrameWriter and JDBCUtils uses > PreparedStatement.executeUpdate to issue the DDLs to the DBs. This causes the > DDLs to fail against couple of databases that do not support prepares of DDLs. > Can we use Statement.executeUpdate instead of > PreparedStatement.executeUpdate? DDL is not a repetitive activity, so there > shouldn't be a performance impact. > I can submit a PULL request if no one has objections. > Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10946) JDBC - Use Statement.execute instead of PreparedStatement.execute for DDLs
Pallavi Priyadarshini created SPARK-10946: - Summary: JDBC - Use Statement.execute instead of PreparedStatement.execute for DDLs Key: SPARK-10946 URL: https://issues.apache.org/jira/browse/SPARK-10946 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1, 1.4.1, 1.4.0 Reporter: Pallavi Priyadarshini Priority: Minor Certain DataFrame APIs invoke DDLs such as CREATE TABLE and DROP TABLE under the covers. Current code in DataFrameWriter and JDBCUtils uses PreparedStatement.executeUpdate to issue the DDLs to the DBs. This causes the DDLs to fail against couple of databases that do not support prepares of DDLs. Can we use Statement.executeUpdate instead of PreparedStatement.executeUpdate? DDL is not a repetitive activity, so there shouldn't be a performance impact. I can submit a PULL request if no one has objections. Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10942) Not all cached RDDs are unpersisted
[ https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944569#comment-14944569 ] Rekha Joshi commented on SPARK-10942: - great, thanks [~pnpritchard].In anycase will keep an eye open if i see this happening under specific conditions.thanks! > Not all cached RDDs are unpersisted > --- > > Key: SPARK-10942 > URL: https://issues.apache.org/jira/browse/SPARK-10942 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Nick Pritchard >Priority: Minor > Attachments: SPARK-10942_1.png, SPARK-10942_2.png, SPARK-10942_3.png > > > I have a Spark Streaming application that caches RDDs inside of a > {{transform}} closure. Looking at the Spark UI, it seems that most of these > RDDs are unpersisted after the batch completes, but not all. > I have copied a minimal reproducible example below to highlight the problem. > I run this and monitor the Spark UI "Storage" tab. The example generates and > caches 30 RDDs, and I see most get cleaned up. However in the end, some still > remain cached. There is some randomness going on because I see different RDDs > remain cached for each run. > I have marked this as Major because I haven't been able to workaround it and > it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} > but that did not change anything. > {code} > val inputRDDs = mutable.Queue.tabulate(30) { i => > sc.parallelize(Seq(i)) > } > val input: DStream[Int] = ssc.queueStream(inputRDDs) > val output = input.transform { rdd => > if (rdd.isEmpty()) { > rdd > } else { > val rdd2 = rdd.map(identity) > rdd2.setName(rdd.first().toString) > rdd2.cache() > val rdd3 = rdd2.map(identity) > rdd3 > } > } > output.print() > ssc.start() > ssc.awaitTermination() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10945) GraphX computes Pagerank with NaN (with some datasets)
Khaled Ammar created SPARK-10945: Summary: GraphX computes Pagerank with NaN (with some datasets) Key: SPARK-10945 URL: https://issues.apache.org/jira/browse/SPARK-10945 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.3.0 Environment: Linux Reporter: Khaled Ammar Hi, I run GraphX in a medium size standalone Spark 1.3.0 installation. The pagerank typically works fine, except with one dataset (Twitter: http://law.di.unimi.it/webdata/twitter-2010). This is a public dataset that is commonly used in research papers. I found that many vertices have an NaN values. This is true, even if the algorithm run for 1 iteration only. Thanks, -Khaled -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10944) org/slf4j/Logger is not provided in spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
[ https://issues.apache.org/jira/browse/SPARK-10944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944539#comment-14944539 ] Pranas Baliuka edited comment on SPARK-10944 at 10/6/15 5:42 AM: - If one wants to deploy Spark without Hadoop it should be possible. Currently even path names and jar names conflicts each other: {quote} spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar {quote} Long term solution: remove Hadoop mentioning in the paths and jar names. was (Author: pranas): If one wants to deploy Spark without Hadoop it should be possible. Currently even path names and jar names conflicts each other: {qoute} spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar {qoute} > org/slf4j/Logger is not provided in > spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar > --- > > Key: SPARK-10944 > URL: https://issues.apache.org/jira/browse/SPARK-10944 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.5.1 > Environment: Mac OS/Java 8/Spark 1.5.1 without hadoop >Reporter: Pranas Baliuka >Priority: Blocker > Labels: easyfix, patch > Original Estimate: 2h > Remaining Estimate: 2h > > Attempt to run Spark cluster on Mac OS machine fails > Invocation: > {code} > # cd $SPARK_HOME > Imin:spark-1.5.1-bin-without-hadoop pranas$ ./sbin/start-master.sh > {code} > Output: > {code} > starting org.apache.spark.deploy.master.Master, logging to > /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out > failed to launch org.apache.spark.deploy.master.Master: > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 7 more > full log in > /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out > {code} > Log: > {code} > # Options read when launching programs locally with > # ./bin/run-example or ./bin/spark-submit > Spark Command: > /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java -cp > /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../conf/:/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar > -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip Imin.local --port > 7077 --webui-port 8080 > > Error: A JNI error has occurred, please check your installation and try again > Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger > at java.lang.Class.getDeclaredMethods0(Native Method) > at java.lang.Class.privateGetDeclaredMethods(Class.java:2701) > at java.lang.Class.privateGetMethodRecursive(Class.java:3048) > at java.lang.Class.getMethod0(Class.java:3018) > at java.lang.Class.getMethod(Class.java:1784) > at > sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544) > at > sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526) > Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) > {code} > Proposed short term fix: > Bundle all required 3rd party libs to the uberjar and/or fix start-up script > to include required 3rd party libs. > Long term quality improvement proposal: Introduce integration tests to check > distribution before releasing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10944) org/slf4j/Logger is not provided in spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
[ https://issues.apache.org/jira/browse/SPARK-10944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944539#comment-14944539 ] Pranas Baliuka edited comment on SPARK-10944 at 10/6/15 5:42 AM: - If one wants to deploy Spark without Hadoop it should be possible. Currently even path names and jar names contradicts each other: {quote} spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar {quote} Long term solution: remove Hadoop mentioning in the paths and jar names. was (Author: pranas): If one wants to deploy Spark without Hadoop it should be possible. Currently even path names and jar names conflicts each other: {quote} spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar {quote} Long term solution: remove Hadoop mentioning in the paths and jar names. > org/slf4j/Logger is not provided in > spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar > --- > > Key: SPARK-10944 > URL: https://issues.apache.org/jira/browse/SPARK-10944 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.5.1 > Environment: Mac OS/Java 8/Spark 1.5.1 without hadoop >Reporter: Pranas Baliuka >Priority: Blocker > Labels: easyfix, patch > Original Estimate: 2h > Remaining Estimate: 2h > > Attempt to run Spark cluster on Mac OS machine fails > Invocation: > {code} > # cd $SPARK_HOME > Imin:spark-1.5.1-bin-without-hadoop pranas$ ./sbin/start-master.sh > {code} > Output: > {code} > starting org.apache.spark.deploy.master.Master, logging to > /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out > failed to launch org.apache.spark.deploy.master.Master: > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 7 more > full log in > /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out > {code} > Log: > {code} > # Options read when launching programs locally with > # ./bin/run-example or ./bin/spark-submit > Spark Command: > /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java -cp > /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../conf/:/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar > -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip Imin.local --port > 7077 --webui-port 8080 > > Error: A JNI error has occurred, please check your installation and try again > Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger > at java.lang.Class.getDeclaredMethods0(Native Method) > at java.lang.Class.privateGetDeclaredMethods(Class.java:2701) > at java.lang.Class.privateGetMethodRecursive(Class.java:3048) > at java.lang.Class.getMethod0(Class.java:3018) > at java.lang.Class.getMethod(Class.java:1784) > at > sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544) > at > sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526) > Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) > {code} > Proposed short term fix: > Bundle all required 3rd party libs to the uberjar and/or fix start-up script > to include required 3rd party libs. > Long term quality improvement proposal: Introduce integration tests to check > distribution before releasing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10944) org/slf4j/Logger is not provided in spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
[ https://issues.apache.org/jira/browse/SPARK-10944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944539#comment-14944539 ] Pranas Baliuka commented on SPARK-10944: If one wants to deploy Spark without Hadoop it should be possible. Currently even path names and jar names conflicts each other: {qoute} spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar {qoute} > org/slf4j/Logger is not provided in > spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar > --- > > Key: SPARK-10944 > URL: https://issues.apache.org/jira/browse/SPARK-10944 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.5.1 > Environment: Mac OS/Java 8/Spark 1.5.1 without hadoop >Reporter: Pranas Baliuka >Priority: Blocker > Labels: easyfix, patch > Original Estimate: 2h > Remaining Estimate: 2h > > Attempt to run Spark cluster on Mac OS machine fails > Invocation: > {code} > # cd $SPARK_HOME > Imin:spark-1.5.1-bin-without-hadoop pranas$ ./sbin/start-master.sh > {code} > Output: > {code} > starting org.apache.spark.deploy.master.Master, logging to > /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out > failed to launch org.apache.spark.deploy.master.Master: > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 7 more > full log in > /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out > {code} > Log: > {code} > # Options read when launching programs locally with > # ./bin/run-example or ./bin/spark-submit > Spark Command: > /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java -cp > /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../conf/:/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar > -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip Imin.local --port > 7077 --webui-port 8080 > > Error: A JNI error has occurred, please check your installation and try again > Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger > at java.lang.Class.getDeclaredMethods0(Native Method) > at java.lang.Class.privateGetDeclaredMethods(Class.java:2701) > at java.lang.Class.privateGetMethodRecursive(Class.java:3048) > at java.lang.Class.getMethod0(Class.java:3018) > at java.lang.Class.getMethod(Class.java:1784) > at > sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544) > at > sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526) > Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) > {code} > Proposed short term fix: > Bundle all required 3rd party libs to the uberjar and/or fix start-up script > to include required 3rd party libs. > Long term quality improvement proposal: Introduce integration tests to check > distribution before releasing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10944) org/slf4j/Logger is not provided in spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
Pranas Baliuka created SPARK-10944: -- Summary: org/slf4j/Logger is not provided in spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar Key: SPARK-10944 URL: https://issues.apache.org/jira/browse/SPARK-10944 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.5.1 Environment: Mac OS/Java 8/Spark 1.5.1 without hadoop Reporter: Pranas Baliuka Priority: Blocker Attempt to run Spark cluster on Mac OS machine fails Invocation: {code} # cd $SPARK_HOME Imin:spark-1.5.1-bin-without-hadoop pranas$ ./sbin/start-master.sh {code} Output: {code} starting org.apache.spark.deploy.master.Master, logging to /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out failed to launch org.apache.spark.deploy.master.Master: at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 7 more full log in /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out {code} Log: {code} # Options read when launching programs locally with # ./bin/run-example or ./bin/spark-submit Spark Command: /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java -cp /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../conf/:/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip Imin.local --port 7077 --webui-port 8080 Error: A JNI error has occurred, please check your installation and try again Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2701) at java.lang.Class.privateGetMethodRecursive(Class.java:3048) at java.lang.Class.getMethod0(Class.java:3018) at java.lang.Class.getMethod(Class.java:1784) at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526) Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) {code} Proposed short term fix: Bundle all required 3rd party libs to the uberjar and/or fix start-up script to include required 3rd party libs. Long term quality improvement proposal: Introduce integration tests to check distribution before releasing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10943) NullType Column cannot be written to Parquet
Jason Pohl created SPARK-10943: -- Summary: NullType Column cannot be written to Parquet Key: SPARK-10943 URL: https://issues.apache.org/jira/browse/SPARK-10943 Project: Spark Issue Type: Bug Reporter: Jason Pohl var data02 = sqlContext.sql("select 1 as id, \"cat in the hat\" as text, null as comments") //FAIL - Try writing a NullType column (where all the values are NULL) data02.write.parquet("/tmp/celtra-test/dataset2") at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in stage 179.0 (TID 39924, 10.0.196.208): org.apache.spark.sql.AnalysisException: Unsupported data type StructField(comments,NullType,true).dataType; at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:524) at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312) at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305) at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at org.apache.spark.sql.types.StructType.map(StructType.scala:92) at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305) at org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58) at org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94) at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:233) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRel
[jira] [Updated] (SPARK-10942) Not all cached RDDs are unpersisted
[ https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pritchard updated SPARK-10942: --- Priority: Minor (was: Major) > Not all cached RDDs are unpersisted > --- > > Key: SPARK-10942 > URL: https://issues.apache.org/jira/browse/SPARK-10942 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Nick Pritchard >Priority: Minor > Attachments: SPARK-10942_1.png, SPARK-10942_2.png, SPARK-10942_3.png > > > I have a Spark Streaming application that caches RDDs inside of a > {{transform}} closure. Looking at the Spark UI, it seems that most of these > RDDs are unpersisted after the batch completes, but not all. > I have copied a minimal reproducible example below to highlight the problem. > I run this and monitor the Spark UI "Storage" tab. The example generates and > caches 30 RDDs, and I see most get cleaned up. However in the end, some still > remain cached. There is some randomness going on because I see different RDDs > remain cached for each run. > I have marked this as Major because I haven't been able to workaround it and > it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} > but that did not change anything. > {code} > val inputRDDs = mutable.Queue.tabulate(30) { i => > sc.parallelize(Seq(i)) > } > val input: DStream[Int] = ssc.queueStream(inputRDDs) > val output = input.transform { rdd => > if (rdd.isEmpty()) { > rdd > } else { > val rdd2 = rdd.map(identity) > rdd2.setName(rdd.first().toString) > rdd2.cache() > val rdd3 = rdd2.map(identity) > rdd3 > } > } > output.print() > ssc.start() > ssc.awaitTermination() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10942) Not all cached RDDs are unpersisted
[ https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944508#comment-14944508 ] Nick Pritchard commented on SPARK-10942: [~rekhajoshm] Thanks for trying to reproduce it. Since you do not this see the same, this is most likely an issue on my end so I'll downgrade the priority. I am using 1.5.0 so will try 1.6.0-snapshot and also investigate the logs. > Not all cached RDDs are unpersisted > --- > > Key: SPARK-10942 > URL: https://issues.apache.org/jira/browse/SPARK-10942 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Nick Pritchard > Attachments: SPARK-10942_1.png, SPARK-10942_2.png, SPARK-10942_3.png > > > I have a Spark Streaming application that caches RDDs inside of a > {{transform}} closure. Looking at the Spark UI, it seems that most of these > RDDs are unpersisted after the batch completes, but not all. > I have copied a minimal reproducible example below to highlight the problem. > I run this and monitor the Spark UI "Storage" tab. The example generates and > caches 30 RDDs, and I see most get cleaned up. However in the end, some still > remain cached. There is some randomness going on because I see different RDDs > remain cached for each run. > I have marked this as Major because I haven't been able to workaround it and > it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} > but that did not change anything. > {code} > val inputRDDs = mutable.Queue.tabulate(30) { i => > sc.parallelize(Seq(i)) > } > val input: DStream[Int] = ssc.queueStream(inputRDDs) > val output = input.transform { rdd => > if (rdd.isEmpty()) { > rdd > } else { > val rdd2 = rdd.map(identity) > rdd2.setName(rdd.first().toString) > rdd2.cache() > val rdd3 = rdd2.map(identity) > rdd3 > } > } > output.print() > ssc.start() > ssc.awaitTermination() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10942) Not all cached RDDs are unpersisted
[ https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rekha Joshi updated SPARK-10942: Attachment: SPARK-10942_3.png SPARK-10942_2.png SPARK-10942_1.png SPARK-10942: TestStreaming job ran for checking cache and storage scenario.So far for my runs, the storage gets cleared out. > Not all cached RDDs are unpersisted > --- > > Key: SPARK-10942 > URL: https://issues.apache.org/jira/browse/SPARK-10942 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Nick Pritchard > Attachments: SPARK-10942_1.png, SPARK-10942_2.png, SPARK-10942_3.png > > > I have a Spark Streaming application that caches RDDs inside of a > {{transform}} closure. Looking at the Spark UI, it seems that most of these > RDDs are unpersisted after the batch completes, but not all. > I have copied a minimal reproducible example below to highlight the problem. > I run this and monitor the Spark UI "Storage" tab. The example generates and > caches 30 RDDs, and I see most get cleaned up. However in the end, some still > remain cached. There is some randomness going on because I see different RDDs > remain cached for each run. > I have marked this as Major because I haven't been able to workaround it and > it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} > but that did not change anything. > {code} > val inputRDDs = mutable.Queue.tabulate(30) { i => > sc.parallelize(Seq(i)) > } > val input: DStream[Int] = ssc.queueStream(inputRDDs) > val output = input.transform { rdd => > if (rdd.isEmpty()) { > rdd > } else { > val rdd2 = rdd.map(identity) > rdd2.setName(rdd.first().toString) > rdd2.cache() > val rdd3 = rdd2.map(identity) > rdd3 > } > } > output.print() > ssc.start() > ssc.awaitTermination() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10942) Not all cached RDDs are unpersisted
[ https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944500#comment-14944500 ] Rekha Joshi edited comment on SPARK-10942 at 10/6/15 4:50 AM: -- SPARK-10942: Attached job run screenshots for TestStreaming job ran for checking cache and storage scenario.So far for my runs, the storage gets cleared out. was (Author: rekhajoshm): SPARK-10942: TestStreaming job ran for checking cache and storage scenario.So far for my runs, the storage gets cleared out. > Not all cached RDDs are unpersisted > --- > > Key: SPARK-10942 > URL: https://issues.apache.org/jira/browse/SPARK-10942 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Nick Pritchard > Attachments: SPARK-10942_1.png, SPARK-10942_2.png, SPARK-10942_3.png > > > I have a Spark Streaming application that caches RDDs inside of a > {{transform}} closure. Looking at the Spark UI, it seems that most of these > RDDs are unpersisted after the batch completes, but not all. > I have copied a minimal reproducible example below to highlight the problem. > I run this and monitor the Spark UI "Storage" tab. The example generates and > caches 30 RDDs, and I see most get cleaned up. However in the end, some still > remain cached. There is some randomness going on because I see different RDDs > remain cached for each run. > I have marked this as Major because I haven't been able to workaround it and > it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} > but that did not change anything. > {code} > val inputRDDs = mutable.Queue.tabulate(30) { i => > sc.parallelize(Seq(i)) > } > val input: DStream[Int] = ssc.queueStream(inputRDDs) > val output = input.transform { rdd => > if (rdd.isEmpty()) { > rdd > } else { > val rdd2 = rdd.map(identity) > rdd2.setName(rdd.first().toString) > rdd2.cache() > val rdd3 = rdd2.map(identity) > rdd3 > } > } > output.print() > ssc.start() > ssc.awaitTermination() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10942) Not all cached RDDs are unpersisted
[ https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944497#comment-14944497 ] Rekha Joshi commented on SPARK-10942: - Thanks [~pnpritchard] I tried to replicate the issue few times now. So far I see Storage tab getting cleaned out.I do not even specify ttl. Attached job run screenshots.I am on 1.6.0-snapshot, and do not currently have any other load on the system but first level diagnosis seems automatic unpersist does happen.I do see below logs also stating persistence list is getting updated in background, and storage cleared. [~sowen] [~vanzin] Your thoughts? Thanks {panel} 15/10/05 21:42:24 INFO rdd.ParallelCollectionRDD: Removing RDD 30 from persistence list 15/10/05 21:42:24 INFO rdd.ParallelCollectionRDD: Removing RDD 30 from persistence list 15/10/05 21:42:24 INFO storage.BlockManager: Removing RDD 30 15/10/05 21:42:24 INFO storage.BlockManager: Removing RDD 30 15/10/05 21:42:24 INFO scheduler.ReceivedBlockTracker: Deleting batches ArrayBuffer() 15/10/05 21:42:24 INFO scheduler.InputInfoTracker: remove old batch metadata: {panel} > Not all cached RDDs are unpersisted > --- > > Key: SPARK-10942 > URL: https://issues.apache.org/jira/browse/SPARK-10942 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Nick Pritchard > > I have a Spark Streaming application that caches RDDs inside of a > {{transform}} closure. Looking at the Spark UI, it seems that most of these > RDDs are unpersisted after the batch completes, but not all. > I have copied a minimal reproducible example below to highlight the problem. > I run this and monitor the Spark UI "Storage" tab. The example generates and > caches 30 RDDs, and I see most get cleaned up. However in the end, some still > remain cached. There is some randomness going on because I see different RDDs > remain cached for each run. > I have marked this as Major because I haven't been able to workaround it and > it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} > but that did not change anything. > {code} > val inputRDDs = mutable.Queue.tabulate(30) { i => > sc.parallelize(Seq(i)) > } > val input: DStream[Int] = ssc.queueStream(inputRDDs) > val output = input.transform { rdd => > if (rdd.isEmpty()) { > rdd > } else { > val rdd2 = rdd.map(identity) > rdd2.setName(rdd.first().toString) > rdd2.cache() > val rdd3 = rdd2.map(identity) > rdd3 > } > } > output.print() > ssc.start() > ssc.awaitTermination() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10942) Not all cached RDDs are unpersisted
[ https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944497#comment-14944497 ] Rekha Joshi edited comment on SPARK-10942 at 10/6/15 4:48 AM: -- Thanks [~pnpritchard] I tried to replicate the issue few times now. So far I see Storage tab getting cleaned out.I do not even specify ttl.Attached job run screenshots.I am on 1.6.0-snapshot, and do not currently have any other load on the system but first level diagnosis seems automatic unpersist does happen.I do see below logs also stating persistence list is getting updated in background, and storage cleared. [~sowen] [~vanzin] Your thoughts? Thanks {panel} 15/10/05 21:42:24 INFO rdd.ParallelCollectionRDD: Removing RDD 30 from persistence list 15/10/05 21:42:24 INFO rdd.ParallelCollectionRDD: Removing RDD 30 from persistence list 15/10/05 21:42:24 INFO storage.BlockManager: Removing RDD 30 15/10/05 21:42:24 INFO storage.BlockManager: Removing RDD 30 15/10/05 21:42:24 INFO scheduler.ReceivedBlockTracker: Deleting batches ArrayBuffer() 15/10/05 21:42:24 INFO scheduler.InputInfoTracker: remove old batch metadata: {panel} was (Author: rekhajoshm): Thanks [~pnpritchard] I tried to replicate the issue few times now. So far I see Storage tab getting cleaned out.I do not even specify ttl. Attached job run screenshots.I am on 1.6.0-snapshot, and do not currently have any other load on the system but first level diagnosis seems automatic unpersist does happen.I do see below logs also stating persistence list is getting updated in background, and storage cleared. [~sowen] [~vanzin] Your thoughts? Thanks {panel} 15/10/05 21:42:24 INFO rdd.ParallelCollectionRDD: Removing RDD 30 from persistence list 15/10/05 21:42:24 INFO rdd.ParallelCollectionRDD: Removing RDD 30 from persistence list 15/10/05 21:42:24 INFO storage.BlockManager: Removing RDD 30 15/10/05 21:42:24 INFO storage.BlockManager: Removing RDD 30 15/10/05 21:42:24 INFO scheduler.ReceivedBlockTracker: Deleting batches ArrayBuffer() 15/10/05 21:42:24 INFO scheduler.InputInfoTracker: remove old batch metadata: {panel} > Not all cached RDDs are unpersisted > --- > > Key: SPARK-10942 > URL: https://issues.apache.org/jira/browse/SPARK-10942 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Nick Pritchard > > I have a Spark Streaming application that caches RDDs inside of a > {{transform}} closure. Looking at the Spark UI, it seems that most of these > RDDs are unpersisted after the batch completes, but not all. > I have copied a minimal reproducible example below to highlight the problem. > I run this and monitor the Spark UI "Storage" tab. The example generates and > caches 30 RDDs, and I see most get cleaned up. However in the end, some still > remain cached. There is some randomness going on because I see different RDDs > remain cached for each run. > I have marked this as Major because I haven't been able to workaround it and > it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} > but that did not change anything. > {code} > val inputRDDs = mutable.Queue.tabulate(30) { i => > sc.parallelize(Seq(i)) > } > val input: DStream[Int] = ssc.queueStream(inputRDDs) > val output = input.transform { rdd => > if (rdd.isEmpty()) { > rdd > } else { > val rdd2 = rdd.map(identity) > rdd2.setName(rdd.first().toString) > rdd2.cache() > val rdd3 = rdd2.map(identity) > rdd3 > } > } > output.print() > ssc.start() > ssc.awaitTermination() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944485#comment-14944485 ] Seth Hendrickson commented on SPARK-10641: -- My apologies, I haven't been able to devote much time to this lately. To your point, one of the bigger decisions for this PR we'll be how to combine these functions with other aggregates, since online algorithms for higher order statistical moments require the calculation of all the lower order moments. I can have a WIP PR up by tomorrow, so we can get some discussion going. This PR will also be affected by several other ongoing PRs. > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Reporter: Jihong MA >Assignee: Seth Hendrickson > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10382) Make example code in user guide testable
[ https://issues.apache.org/jira/browse/SPARK-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944480#comment-14944480 ] Xusen Yin commented on SPARK-10382: --- [~mengxr] I'd love to work on this if no one else keep on doing it. > Make example code in user guide testable > > > Key: SPARK-10382 > URL: https://issues.apache.org/jira/browse/SPARK-10382 > Project: Spark > Issue Type: Brainstorming > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Priority: Critical > > The example code in the user guide is embedded in the markdown and hence it > is not easy to test. It would be nice to automatically test them. This JIRA > is to discuss options to automate example code testing and see what we can do > in Spark 1.6. > One option I propose is to move actual example code to spark/examples and > test compilation in Jenkins builds. Then in the markdown, we can reference > part of the code to show in the user guide. This requires adding a Jekyll tag > that is similar to > https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, > e.g., called include_example. > {code} > {% include_example scala ml.KMeansExample guide %} > {code} > Jekyll will find > `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` > and pick code blocks marked "guide" and put them under `{% highlight %}` in > the markdown. We can discuss the syntax for marker comments. > Just one way to implement this. It would be nice to hear more ideas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10942) Not all cached RDDs are unpersisted
[ https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944478#comment-14944478 ] Nick Pritchard commented on SPARK-10942: Regardless, the documentation for {{spark.streaming.unpersist}} and {{spark.cleaner.ttl}} suggest that unpersisting will be handled automatically, by spark code. > Not all cached RDDs are unpersisted > --- > > Key: SPARK-10942 > URL: https://issues.apache.org/jira/browse/SPARK-10942 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Nick Pritchard > > I have a Spark Streaming application that caches RDDs inside of a > {{transform}} closure. Looking at the Spark UI, it seems that most of these > RDDs are unpersisted after the batch completes, but not all. > I have copied a minimal reproducible example below to highlight the problem. > I run this and monitor the Spark UI "Storage" tab. The example generates and > caches 30 RDDs, and I see most get cleaned up. However in the end, some still > remain cached. There is some randomness going on because I see different RDDs > remain cached for each run. > I have marked this as Major because I haven't been able to workaround it and > it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} > but that did not change anything. > {code} > val inputRDDs = mutable.Queue.tabulate(30) { i => > sc.parallelize(Seq(i)) > } > val input: DStream[Int] = ssc.queueStream(inputRDDs) > val output = input.transform { rdd => > if (rdd.isEmpty()) { > rdd > } else { > val rdd2 = rdd.map(identity) > rdd2.setName(rdd.first().toString) > rdd2.cache() > val rdd3 = rdd2.map(identity) > rdd3 > } > } > output.print() > ssc.start() > ssc.awaitTermination() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10942) Not all cached RDDs are unpersisted
[ https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944477#comment-14944477 ] Nick Pritchard commented on SPARK-10942: [~rekhajoshm] Yes, but calling {{rdd2.unpersist()}} negates the call to {{rdd2.cache()}}, no matter where I put it in the {{transform}} closure. This is because all the operations on {{rdd2}} are lazy. > Not all cached RDDs are unpersisted > --- > > Key: SPARK-10942 > URL: https://issues.apache.org/jira/browse/SPARK-10942 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Nick Pritchard > > I have a Spark Streaming application that caches RDDs inside of a > {{transform}} closure. Looking at the Spark UI, it seems that most of these > RDDs are unpersisted after the batch completes, but not all. > I have copied a minimal reproducible example below to highlight the problem. > I run this and monitor the Spark UI "Storage" tab. The example generates and > caches 30 RDDs, and I see most get cleaned up. However in the end, some still > remain cached. There is some randomness going on because I see different RDDs > remain cached for each run. > I have marked this as Major because I haven't been able to workaround it and > it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} > but that did not change anything. > {code} > val inputRDDs = mutable.Queue.tabulate(30) { i => > sc.parallelize(Seq(i)) > } > val input: DStream[Int] = ssc.queueStream(inputRDDs) > val output = input.transform { rdd => > if (rdd.isEmpty()) { > rdd > } else { > val rdd2 = rdd.map(identity) > rdd2.setName(rdd.first().toString) > rdd2.cache() > val rdd3 = rdd2.map(identity) > rdd3 > } > } > output.print() > ssc.start() > ssc.awaitTermination() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10942) Not all cached RDDs are unpersisted
[ https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944465#comment-14944465 ] Rekha Joshi edited comment on SPARK-10942 at 10/6/15 4:04 AM: -- [~pnpritchard] hi.did you try rdd2.unpersist()? was (Author: rekhajoshm): [~pnpritchard] hi.did you try rdd.unpersist()? > Not all cached RDDs are unpersisted > --- > > Key: SPARK-10942 > URL: https://issues.apache.org/jira/browse/SPARK-10942 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Nick Pritchard > > I have a Spark Streaming application that caches RDDs inside of a > {{transform}} closure. Looking at the Spark UI, it seems that most of these > RDDs are unpersisted after the batch completes, but not all. > I have copied a minimal reproducible example below to highlight the problem. > I run this and monitor the Spark UI "Storage" tab. The example generates and > caches 30 RDDs, and I see most get cleaned up. However in the end, some still > remain cached. There is some randomness going on because I see different RDDs > remain cached for each run. > I have marked this as Major because I haven't been able to workaround it and > it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} > but that did not change anything. > {code} > val inputRDDs = mutable.Queue.tabulate(30) { i => > sc.parallelize(Seq(i)) > } > val input: DStream[Int] = ssc.queueStream(inputRDDs) > val output = input.transform { rdd => > if (rdd.isEmpty()) { > rdd > } else { > val rdd2 = rdd.map(identity) > rdd2.setName(rdd.first().toString) > rdd2.cache() > val rdd3 = rdd2.map(identity) > rdd3 > } > } > output.print() > ssc.start() > ssc.awaitTermination() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10942) Not all cached RDDs are unpersisted
[ https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944465#comment-14944465 ] Rekha Joshi commented on SPARK-10942: - [~pnpritchard] hi.did you try rdd.unpersist()? > Not all cached RDDs are unpersisted > --- > > Key: SPARK-10942 > URL: https://issues.apache.org/jira/browse/SPARK-10942 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Nick Pritchard > > I have a Spark Streaming application that caches RDDs inside of a > {{transform}} closure. Looking at the Spark UI, it seems that most of these > RDDs are unpersisted after the batch completes, but not all. > I have copied a minimal reproducible example below to highlight the problem. > I run this and monitor the Spark UI "Storage" tab. The example generates and > caches 30 RDDs, and I see most get cleaned up. However in the end, some still > remain cached. There is some randomness going on because I see different RDDs > remain cached for each run. > I have marked this as Major because I haven't been able to workaround it and > it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} > but that did not change anything. > {code} > val inputRDDs = mutable.Queue.tabulate(30) { i => > sc.parallelize(Seq(i)) > } > val input: DStream[Int] = ssc.queueStream(inputRDDs) > val output = input.transform { rdd => > if (rdd.isEmpty()) { > rdd > } else { > val rdd2 = rdd.map(identity) > rdd2.setName(rdd.first().toString) > rdd2.cache() > val rdd3 = rdd2.map(identity) > rdd3 > } > } > output.print() > ssc.start() > ssc.awaitTermination() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10534) ORDER BY clause allows only columns that are present in SELECT statement
[ https://issues.apache.org/jira/browse/SPARK-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944462#comment-14944462 ] Dilip Biswal commented on SPARK-10534: -- I would like to work on this. > ORDER BY clause allows only columns that are present in SELECT statement > > > Key: SPARK-10534 > URL: https://issues.apache.org/jira/browse/SPARK-10534 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Michal Cwienczek > > When invoking query SELECT EmployeeID from Employees order by YEAR(HireDate) > Spark 1.5 throws exception: > {code} > cannot resolve 'MsSqlNorthwindJobServerTested_dbo_Employees.HireDate' given > input columns EmployeeID; line 2 pos 14 StackTrace: > org.apache.spark.sql.AnalysisException: cannot resolve > 'MsSqlNorthwindJobServerTested_dbo_Employees.HireDate' given input columns > EmployeeID; line 2 pos 14 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$7.apply(TreeNode.scala:268) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:266) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$c
[jira] [Created] (SPARK-10942) Not all cached RDDs are unpersisted
Nick Pritchard created SPARK-10942: -- Summary: Not all cached RDDs are unpersisted Key: SPARK-10942 URL: https://issues.apache.org/jira/browse/SPARK-10942 Project: Spark Issue Type: Bug Components: Streaming Reporter: Nick Pritchard I have a Spark Streaming application that caches RDDs inside of a {{transform}} closure. Looking at the Spark UI, it seems that most of these RDDs are unpersisted after the batch completes, but not all. I have copied a minimal reproducible example below to highlight the problem. I run this and monitor the Spark UI "Storage" tab. The example generates and caches 30 RDDs, and I see most get cleaned up. However in the end, some still remain cached. There is some randomness going on because I see different RDDs remain cached for each run. I have marked this as Major because I haven't been able to workaround it and it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} but that did not change anything. {code} val inputRDDs = mutable.Queue.tabulate(30) { i => sc.parallelize(Seq(i)) } val input: DStream[Int] = ssc.queueStream(inputRDDs) val output = input.transform { rdd => if (rdd.isEmpty()) { rdd } else { val rdd2 = rdd.map(identity) rdd2.setName(rdd.first().toString) rdd2.cache() val rdd3 = rdd2.map(identity) rdd3 } } output.print() ssc.start() ssc.awaitTermination() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6723) Model import/export for ChiSqSelector
[ https://issues.apache.org/jira/browse/SPARK-6723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944415#comment-14944415 ] Jayant Shekhar commented on SPARK-6723: --- [~fliang] Hey Feyman, I made changes to the PR based on your inputs and fixed the merge conflicts. You can try it out. Thanks. > Model import/export for ChiSqSelector > - > > Key: SPARK-6723 > URL: https://issues.apache.org/jira/browse/SPARK-6723 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6723) Model import/export for ChiSqSelector
[ https://issues.apache.org/jira/browse/SPARK-6723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944415#comment-14944415 ] Jayant Shekhar edited comment on SPARK-6723 at 10/6/15 2:25 AM: [~fliang] Hey Feynman, I made changes to the PR based on your inputs and fixed the merge conflicts. You can try it out. Thanks. was (Author: jayants): [~fliang] Hey Feyman, I made changes to the PR based on your inputs and fixed the merge conflicts. You can try it out. Thanks. > Model import/export for ChiSqSelector > - > > Key: SPARK-6723 > URL: https://issues.apache.org/jira/browse/SPARK-6723 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10900) Add output operation events to StreamingListener
[ https://issues.apache.org/jira/browse/SPARK-10900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-10900. --- Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 1.6.0 > Add output operation events to StreamingListener > > > Key: SPARK-10900 > URL: https://issues.apache.org/jira/browse/SPARK-10900 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10941) .Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve code clarity
[ https://issues.apache.org/jira/browse/SPARK-10941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944411#comment-14944411 ] Apache Spark commented on SPARK-10941: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/8973 > .Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve > code clarity > -- > > Key: SPARK-10941 > URL: https://issues.apache.org/jira/browse/SPARK-10941 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > > Spark SQL's new AlgebraicAggregate interface is confusingly named. > AlgebraicAggregate inherits from AggregateFunction2, adds a new set of > methods, then effectively bans the use of the inherited methods. This is > really confusing. I think that it's an anti-pattern / bad code smell if you > end up inheriting and wanting to remove methods inherited from the superclass. > I think that we should re-name this class and should refactor the class > hierarchy so that there's a clear distinction between which parts of the code > work with imperative aggregate functions vs. expression-based aggregates. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10941) .Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve code clarity
[ https://issues.apache.org/jira/browse/SPARK-10941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10941: Assignee: Apache Spark (was: Josh Rosen) > .Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve > code clarity > -- > > Key: SPARK-10941 > URL: https://issues.apache.org/jira/browse/SPARK-10941 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Josh Rosen >Assignee: Apache Spark > > Spark SQL's new AlgebraicAggregate interface is confusingly named. > AlgebraicAggregate inherits from AggregateFunction2, adds a new set of > methods, then effectively bans the use of the inherited methods. This is > really confusing. I think that it's an anti-pattern / bad code smell if you > end up inheriting and wanting to remove methods inherited from the superclass. > I think that we should re-name this class and should refactor the class > hierarchy so that there's a clear distinction between which parts of the code > work with imperative aggregate functions vs. expression-based aggregates. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10941) .Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve code clarity
[ https://issues.apache.org/jira/browse/SPARK-10941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10941: Assignee: Josh Rosen (was: Apache Spark) > .Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve > code clarity > -- > > Key: SPARK-10941 > URL: https://issues.apache.org/jira/browse/SPARK-10941 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > > Spark SQL's new AlgebraicAggregate interface is confusingly named. > AlgebraicAggregate inherits from AggregateFunction2, adds a new set of > methods, then effectively bans the use of the inherited methods. This is > really confusing. I think that it's an anti-pattern / bad code smell if you > end up inheriting and wanting to remove methods inherited from the superclass. > I think that we should re-name this class and should refactor the class > hierarchy so that there's a clear distinction between which parts of the code > work with imperative aggregate functions vs. expression-based aggregates. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10941) .Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve code clarity
Josh Rosen created SPARK-10941: -- Summary: .Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve code clarity Key: SPARK-10941 URL: https://issues.apache.org/jira/browse/SPARK-10941 Project: Spark Issue Type: Improvement Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen Spark SQL's new AlgebraicAggregate interface is confusingly named. AlgebraicAggregate inherits from AggregateFunction2, adds a new set of methods, then effectively bans the use of the inherited methods. This is really confusing. I think that it's an anti-pattern / bad code smell if you end up inheriting and wanting to remove methods inherited from the superclass. I think that we should re-name this class and should refactor the class hierarchy so that there's a clear distinction between which parts of the code work with imperative aggregate functions vs. expression-based aggregates. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9239) HiveUDAF support for AggregateFunction2
[ https://issues.apache.org/jira/browse/SPARK-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944405#comment-14944405 ] Josh Rosen commented on SPARK-9239: --- Is this duplicated by SPARK-10765? > HiveUDAF support for AggregateFunction2 > --- > > Key: SPARK-9239 > URL: https://issues.apache.org/jira/browse/SPARK-9239 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Blocker > > We need to build a wrapper for Hive UDAFs on top of AggregateFunction2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10560) Make StreamingLogisticRegressionWithSGD Python API equals with Scala one
[ https://issues.apache.org/jira/browse/SPARK-10560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944305#comment-14944305 ] Bryan Cutler edited comment on SPARK-10560 at 10/6/15 1:17 AM: --- Hi [~yanboliang], I just want to make sure I'm on the same page as to what we need to do here. Here are the differences I see between the Python and Scala APIs for StreamingLogisticRegressionWithSGD: - * The documentation for Python is missing the default parameter values, also the same for StreamingLinearRegressionWithSGD * In Python StreamingLogisticRegressionWithSGD the regularization defaults to 0.01 while the Scala version defaults to 0. I believe other SGD implementations default to non-zero, so maybe there is some reason to turn it off in Streaming implementations? In any case, these ones should probably default to the same value * The Scala StreamingLogisticRegressionWithSGD is missing a method to set convergence tolerance, it is in the Python one * StreamingLogisticRegressionWithSGD for Scala and Python are missing ability to set regularization parameter * Python Streaming**RegressionWithSGD are missing API methods to set parameters, i.e. setStepSize - How about for this JIRA, I fix the documentation to include default parameters and then I will make JIRAs for the other items? was (Author: bryanc): Hi [~yanboliang]], I just want to make sure I'm on the same page as to what we need to do here. Here are the differences I see between the Python and Scala APIs for StreamingLogisticRegressionWithSGD: - * The documentation for Python is missing the default parameter values, also the same for StreamingLinearRegressionWithSGD * In Python StreamingLogisticRegressionWithSGD the regularization defaults to 0.01 while the Scala version defaults to 0. I believe other SGD implementations default to non-zero, so maybe there is some reason to turn it off in Streaming implementations? In any case, these ones should probably default to the same value * The Scala StreamingLogisticRegressionWithSGD is missing a method to set convergence tolerance, it is in the Python one * StreamingLogisticRegressionWithSGD for Scala and Python are missing ability to set regularization parameter * Python Streaming**RegressionWithSGD are missing API methods to set parameters, i.e. setStepSize How about for this JIRA, I fix the documentation to include default parameters and then I will make JIRAs for the other items? > Make StreamingLogisticRegressionWithSGD Python API equals with Scala one > > > Key: SPARK-10560 > URL: https://issues.apache.org/jira/browse/SPARK-10560 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Yanbo Liang >Priority: Minor > > StreamingLogisticRegressionWithSGD Python API lacks of some parameters > compared with Scala one, here we make them equality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database
[ https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944340#comment-14944340 ] Michael Armbrust commented on SPARK-9776: - You should not create a HiveContext in the spark-shell. One is already created for you as sqlContext. > Another instance of Derby may have already booted the database > --- > > Key: SPARK-9776 > URL: https://issues.apache.org/jira/browse/SPARK-9776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: Mac Yosemite, spark-1.5.0 >Reporter: Sudhakar Thota > Attachments: SPARK-9776-FL1.rtf > > > val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in > error. Though the same works for spark-1.4.1. > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10585) only copy data once when generate unsafe projection
[ https://issues.apache.org/jira/browse/SPARK-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944337#comment-14944337 ] Apache Spark commented on SPARK-10585: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/8991 > only copy data once when generate unsafe projection > --- > > Key: SPARK-10585 > URL: https://issues.apache.org/jira/browse/SPARK-10585 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > Fix For: 1.6.0 > > > When we have nested struct, array or map, we will create a byte buffer for > each of them, and copy data to the buffer first, then copy them to the final > row buffer. We can save the first copy and directly copy data to final row > buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database
[ https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944335#comment-14944335 ] Alexander Pivovarov commented on SPARK-9776: To reproduce the issue 1. start emr-4.1.0 cluster (it comes with spark-1.5.0 and yarn) 2. ssh to master box 3. open spark-shell 4. run new org.apache.spark.sql.hive.HiveContext(sc) > Another instance of Derby may have already booted the database > --- > > Key: SPARK-9776 > URL: https://issues.apache.org/jira/browse/SPARK-9776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: Mac Yosemite, spark-1.5.0 >Reporter: Sudhakar Thota > Attachments: SPARK-9776-FL1.rtf > > > val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in > error. Though the same works for spark-1.4.1. > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10934) hashCode of unsafe array may crush
[ https://issues.apache.org/jira/browse/SPARK-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-10934: --- Assignee: Wenchen Fan > hashCode of unsafe array may crush > -- > > Key: SPARK-10934 > URL: https://issues.apache.org/jira/browse/SPARK-10934 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 1.5.2, 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10934) hashCode of unsafe array may crush
[ https://issues.apache.org/jira/browse/SPARK-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-10934: --- Fix Version/s: 1.5.2 > hashCode of unsafe array may crush > -- > > Key: SPARK-10934 > URL: https://issues.apache.org/jira/browse/SPARK-10934 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > Fix For: 1.5.2, 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10934) hashCode of unsafe array may crush
[ https://issues.apache.org/jira/browse/SPARK-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-10934. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8987 [https://github.com/apache/spark/pull/8987] > hashCode of unsafe array may crush > -- > > Key: SPARK-10934 > URL: https://issues.apache.org/jira/browse/SPARK-10934 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10940) Too many open files Spark Shuffle
[ https://issues.apache.org/jira/browse/SPARK-10940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944307#comment-14944307 ] Sandeep Pal commented on SPARK-10940: - I have already read similar issue as below but this does not help https://issues.apache.org/jira/browse/SPARK-9921 > Too many open files Spark Shuffle > - > > Key: SPARK-10940 > URL: https://issues.apache.org/jira/browse/SPARK-10940 > Project: Spark > Issue Type: Bug > Components: Shuffle, SQL >Affects Versions: 1.5.0 > Environment: 6 node standalone spark cluster with 1 master and 5 > worker nodes on Centos 6.6 for all nodes. Each node has > 100 GB memory and > 36 cores. >Reporter: Sandeep Pal > > Executing terasort by Spark-SQL on the data generated by teragen in hadoop. > Data size generated is ~456 GB. > Terasort passing with --total-executor-cores = 40, where as failing for > --total-executor-cores = 120. > I have tried to increase the ulimit to 10k but the problem persists. > Below is the error message from one of the executor node: > java.io.FileNotFoundException: > /tmp/spark-e15993e8-51a4-452a-8b86-da0169445065/executor-0c661152-3837-4711-bba2-2abf4fd15240/blockmgr-973aab72-feb8-4c60-ba3d-1b2ee27a1cc2/3f/temp_shuffle_7741538d-3ccf-4566-869f-265655ca9c90 > (Too many open files) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10560) Make StreamingLogisticRegressionWithSGD Python API equals with Scala one
[ https://issues.apache.org/jira/browse/SPARK-10560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944305#comment-14944305 ] Bryan Cutler commented on SPARK-10560: -- Hi [~yanboliang]], I just want to make sure I'm on the same page as to what we need to do here. Here are the differences I see between the Python and Scala APIs for StreamingLogisticRegressionWithSGD: - * The documentation for Python is missing the default parameter values, also the same for StreamingLinearRegressionWithSGD * In Python StreamingLogisticRegressionWithSGD the regularization defaults to 0.01 while the Scala version defaults to 0. I believe other SGD implementations default to non-zero, so maybe there is some reason to turn it off in Streaming implementations? In any case, these ones should probably default to the same value * The Scala StreamingLogisticRegressionWithSGD is missing a method to set convergence tolerance, it is in the Python one * StreamingLogisticRegressionWithSGD for Scala and Python are missing ability to set regularization parameter * Python Streaming**RegressionWithSGD are missing API methods to set parameters, i.e. setStepSize How about for this JIRA, I fix the documentation to include default parameters and then I will make JIRAs for the other items? > Make StreamingLogisticRegressionWithSGD Python API equals with Scala one > > > Key: SPARK-10560 > URL: https://issues.apache.org/jira/browse/SPARK-10560 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Yanbo Liang >Priority: Minor > > StreamingLogisticRegressionWithSGD Python API lacks of some parameters > compared with Scala one, here we make them equality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason C Lee updated SPARK-10925: Comment: was deleted (was: I removed your 2nd step "apply an UDF on column "name"" and was able to also recreate the problem. I reduced your test case to the following: import org.apache.spark.SparkContext import org.apache.spark.sql.SQLContext import org.apache.spark.sql.functions._ object TestCase2 { case class Individual(id: String, name: String, surname: String, birthDate: String) def main(args: Array[String]) { val sc = new SparkContext("local", "join DFs") val sqlContext = new SQLContext(sc) val rdd = sc.parallelize(Seq( Individual("14", "patrick", "andrews", "10/10/1970") )) val df = sqlContext.createDataFrame(rdd) df.show() val df1 = df; val df2 = df1.withColumn("surname1", df("surname")) df2.show() val df3 = df2.withColumn("birthDate1", df("birthDate")) df3.show() val cardinalityDF1 = df3.groupBy("name") .agg(count("name").as("cardinality_name")) cardinalityDF1.show() val df4 = df3.join(cardinalityDF1, df3("name") === cardinalityDF1("name")) df4.show() val cardinalityDF2 = df4.groupBy("surname1") .agg(count("surname1").as("cardinality_surname")) cardinalityDF2.show() val df5 = df4.join(cardinalityDF2, df4("surname") === cardinalityDF2("surname1")) df5.show() } }) > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.L
[jira] [Commented] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944303#comment-14944303 ] Jason C Lee commented on SPARK-10925: - I removed your 2nd step "apply an UDF on column "name"" and was able to also recreate the problem. I reduced your test case to the following: import org.apache.spark.SparkContext import org.apache.spark.sql.SQLContext import org.apache.spark.sql.functions._ object TestCase2 { case class Individual(id: String, name: String, surname: String, birthDate: String) def main(args: Array[String]) { val sc = new SparkContext("local", "join DFs") val sqlContext = new SQLContext(sc) val rdd = sc.parallelize(Seq( Individual("14", "patrick", "andrews", "10/10/1970") )) val df = sqlContext.createDataFrame(rdd) df.show() val df1 = df; val df2 = df1.withColumn("surname1", df("surname")) df2.show() val df3 = df2.withColumn("birthDate1", df("birthDate")) df3.show() val cardinalityDF1 = df3.groupBy("name") .agg(count("name").as("cardinality_name")) cardinalityDF1.show() val df4 = df3.join(cardinalityDF1, df3("name") === cardinalityDF1("name")) df4.show() val cardinalityDF2 = df4.groupBy("surname1") .agg(count("surname1").as("cardinality_surname")) cardinalityDF2.show() val df5 = df4.join(cardinalityDF2, df4("surname") === cardinalityDF2("surname1")) df5.show() } } > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at sca
[jira] [Updated] (SPARK-10940) Too many open files Spark Shuffle
[ https://issues.apache.org/jira/browse/SPARK-10940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Pal updated SPARK-10940: Description: Executing terasort by Spark-SQL on the data generated by teragen in hadoop. Data size generated is ~456 GB. Terasort passing with --total-executor-cores = 40, where as failing for --total-executor-cores = 120. I have tried to increase the ulimit to 10k but the problem persists. Below is the error message from one of the executor node: java.io.FileNotFoundException: /tmp/spark-e15993e8-51a4-452a-8b86-da0169445065/executor-0c661152-3837-4711-bba2-2abf4fd15240/blockmgr-973aab72-feb8-4c60-ba3d-1b2ee27a1cc2/3f/temp_shuffle_7741538d-3ccf-4566-869f-265655ca9c90 (Too many open files) was: Executing terasort by Spark-SQL on the data generated by teragen in hadoop. Data size generated is ~456 GB. Terasort passing with --total-executor-cores = 40, where failing for --total-executor-cores = 120. I have tried to increase the ulimit to 10k but the problem persists. Below is the error message from one of the executor node: java.io.FileNotFoundException: /tmp/spark-e15993e8-51a4-452a-8b86-da0169445065/executor-0c661152-3837-4711-bba2-2abf4fd15240/blockmgr-973aab72-feb8-4c60-ba3d-1b2ee27a1cc2/3f/temp_shuffle_7741538d-3ccf-4566-869f-265655ca9c90 (Too many open files) > Too many open files Spark Shuffle > - > > Key: SPARK-10940 > URL: https://issues.apache.org/jira/browse/SPARK-10940 > Project: Spark > Issue Type: Bug > Components: Shuffle, SQL >Affects Versions: 1.5.0 > Environment: 6 node standalone spark cluster with 1 master and 5 > worker nodes on Centos 6.6 for all nodes. Each node has > 100 GB memory and > 36 cores. >Reporter: Sandeep Pal > > Executing terasort by Spark-SQL on the data generated by teragen in hadoop. > Data size generated is ~456 GB. > Terasort passing with --total-executor-cores = 40, where as failing for > --total-executor-cores = 120. > I have tried to increase the ulimit to 10k but the problem persists. > Below is the error message from one of the executor node: > java.io.FileNotFoundException: > /tmp/spark-e15993e8-51a4-452a-8b86-da0169445065/executor-0c661152-3837-4711-bba2-2abf4fd15240/blockmgr-973aab72-feb8-4c60-ba3d-1b2ee27a1cc2/3f/temp_shuffle_7741538d-3ccf-4566-869f-265655ca9c90 > (Too many open files) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10940) Too many open files Spark Shuffle
Sandeep Pal created SPARK-10940: --- Summary: Too many open files Spark Shuffle Key: SPARK-10940 URL: https://issues.apache.org/jira/browse/SPARK-10940 Project: Spark Issue Type: Bug Components: Shuffle, SQL Affects Versions: 1.5.0 Environment: 6 node standalone spark cluster with 1 master and 5 worker nodes on Centos 6.6 for all nodes. Each node has > 100 GB memory and 36 cores. Reporter: Sandeep Pal Executing terasort by Spark-SQL on the data generated by teragen in hadoop. Data size generated is ~456 GB. Terasort passing with --total-executor-cores = 40, where failing for --total-executor-cores = 120. I have tried to increase the ulimit to 10k but the problem persists. Below is the error message from one of the executor node: java.io.FileNotFoundException: /tmp/spark-e15993e8-51a4-452a-8b86-da0169445065/executor-0c661152-3837-4711-bba2-2abf4fd15240/blockmgr-973aab72-feb8-4c60-ba3d-1b2ee27a1cc2/3f/temp_shuffle_7741538d-3ccf-4566-869f-265655ca9c90 (Too many open files) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10685) Misaligned data with RDD.zip and DataFrame.withColumn after repartition
[ https://issues.apache.org/jira/browse/SPARK-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944282#comment-14944282 ] Dan Brown commented on SPARK-10685: --- [~davies] [~joshrosen] Ok, I've split out the zip-after-repartition issue as https://issues.apache.org/jira/browse/SPARK-10939. > Misaligned data with RDD.zip and DataFrame.withColumn after repartition > --- > > Key: SPARK-10685 > URL: https://issues.apache.org/jira/browse/SPARK-10685 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.3.0, 1.4.1, 1.5.0 > Environment: - OSX 10.10.4, java 1.7.0_51, hadoop 2.6.0-cdh5.4.5 > - Ubuntu 12.04, java 1.7.0_80, hadoop 2.6.0-cdh5.4.5 >Reporter: Dan Brown >Assignee: Reynold Xin >Priority: Blocker > Fix For: 1.5.1, 1.6.0 > > > Here's a weird behavior where {{RDD.zip}} or {{DataFrame.withColumn}} after a > {{repartition}} produces "misaligned" data, meaning different column values > in the same row aren't matched, as if a zip shuffled the collections before > zipping them. It's difficult to reproduce because it's nondeterministic, > doesn't occur in local mode, and requires ≥2 workers (≥3 in one case). I was > able to repro it using pyspark 1.3.0 (cdh5.4.5), 1.4.1 (bin-without-hadoop), > and 1.5.0 (bin-without-hadoop). > Here's the most similar issue I was able to find. It appears to not have been > repro'd and then closed optimistically, and it smells like it could have been > the same underlying cause that was never fixed: > - https://issues.apache.org/jira/browse/SPARK-9131 > Also, this {{DataFrame.zip}} issue is related in spirit, since we were trying > to build it ourselves when we ran into this problem. Let me put in my vote > for reopening the issue and supporting {{DataFrame.zip}} in the standard lib. > - https://issues.apache.org/jira/browse/SPARK-7460 > h3. Brief repro > Fail: withColumn(udf) after DataFrame.repartition > {code} > df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1)) > df = df.repartition(100) > df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a)) > [r for r in df.collect() if r.a != r.b][:3] # Should be [] > {code} > Sample outputs (nondeterministic): > {code} > [Row(a=39, b=639), Row(a=139, b=739), Row(a=239, b=839)] > [Row(a=639, b=39), Row(a=739, b=139), Row(a=839, b=239)] > [] > [Row(a=641, b=41), Row(a=741, b=141), Row(a=841, b=241)] > [Row(a=641, b=1343), Row(a=741, b=1443), Row(a=841, b=1543)] > [Row(a=639, b=39), Row(a=739, b=139), Row(a=839, b=239)] > {code} > Fail: RDD.zip after DataFrame.repartition > {code} > df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1)) > df = df.repartition(100) > rdd = df.rdd.zip(df.map(lambda r: Row(b=r.a))).map(lambda (x,y): Row(a=x.a, > b=y.b)) > [r for r in rdd.collect() if r.a != r.b][:3] # Should be [] > {code} > Sample outputs (nondeterministic): > {code} > [] > [Row(a=50, b=6947), Row(a=150, b=7047), Row(a=250, b=7147)] > [] > [] > [Row(a=44, b=644), Row(a=144, b=744), Row(a=244, b=844)] > [] > {code} > Test setup: > - local\[8]: {{MASTER=local\[8]}} > - dist\[N]: 1 driver + 1 master + N workers > {code} > "Fail" tests pass? cluster mode spark version > > yes local[8] 1.3.0-cdh5.4.5 > no dist[4] 1.3.0-cdh5.4.5 > yes local[8] 1.4.1 > yes dist[1] 1.4.1 > no dist[2] 1.4.1 > no dist[4] 1.4.1 > yes local[8] 1.5.0 > yes dist[1] 1.5.0 > no dist[2] 1.5.0 > no dist[4] 1.5.0 > {code} > h3. Detailed repro > Start `pyspark` and run these imports: > {code} > from pyspark.sql import Row > from pyspark.sql.functions import udf > from pyspark.sql.types import IntegerType, StructType, StructField > {code} > Fail: withColumn(udf) after DataFrame.repartition > {code} > df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1)) > df = df.repartition(100) > df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a)) > len([r for r in df.collect() if r.a != r.b]) # Should be 0 > {code} > Ok: withColumn(udf) after DataFrame.repartition(100) after 1 starting > partition > {code} > df = sqlCtx.createDataFrame(sc.parallelize((Row(a=a) for a in xrange(1)), > numSlices=1)) > df = df.repartition(100) > df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a)) > len([r for r in df.collect() if r.a != r.b]) # Should be 0 > {code} > Fail: withColumn(udf) after DataFrame.repartition(100) after 100 starting > partitions > {code} > df = sqlCtx.createDataFrame(sc.parallelize((Row(a=a) for a in xrange(1)), > numSlices=100)) > df = df.repar
[jira] [Created] (SPARK-10939) Misaligned data with RDD.zip after repartition
Dan Brown created SPARK-10939: - Summary: Misaligned data with RDD.zip after repartition Key: SPARK-10939 URL: https://issues.apache.org/jira/browse/SPARK-10939 Project: Spark Issue Type: Bug Affects Versions: 1.5.0, 1.4.1, 1.3.0 Environment: - OSX 10.10.4, java 1.7.0_51, hadoop 2.6.0-cdh5.4.5 - Ubuntu 12.04, java 1.7.0_80, hadoop 2.6.0-cdh5.4.5 Reporter: Dan Brown Split out from https://issues.apache.org/jira/browse/SPARK-10685: Here's a weird behavior where {{RDD.zip}} after a {{repartition}} produces "misaligned" data, meaning different column values in the same row aren't matched, as if a zip shuffled the collections before zipping them. It's difficult to reproduce because it's nondeterministic, doesn't occur in local mode, and requires ≥2 workers (≥3 in one case). I was able to repro it using pyspark 1.3.0 (cdh5.4.5), 1.4.1 (bin-without-hadoop), and 1.5.0 (bin-without-hadoop). Also, this {{DataFrame.zip}} issue is related in spirit, since we were trying to build it ourselves when we ran into this problem. Let me put in my vote for reopening the issue and supporting {{DataFrame.zip}} in the standard lib. - https://issues.apache.org/jira/browse/SPARK-7460 h3. Repro Fail: RDD.zip after repartition {code} df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1)) df = df.repartition(100) rdd = df.rdd.zip(df.map(lambda r: Row(b=r.a))).map(lambda (x,y): Row(a=x.a, b=y.b)) [r for r in rdd.collect() if r.a != r.b][:3] # Should be [] {code} Sample outputs (nondeterministic): {code} [] [Row(a=50, b=6947), Row(a=150, b=7047), Row(a=250, b=7147)] [] [] [Row(a=44, b=644), Row(a=144, b=744), Row(a=244, b=844)] [] {code} Test setup: - local\[8]: {{MASTER=local\[8]}} - dist\[N]: 1 driver + 1 master + N workers {code} "Fail" tests pass? cluster mode spark version yes local[8] 1.3.0-cdh5.4.5 no dist[4] 1.3.0-cdh5.4.5 yes local[8] 1.4.1 yes dist[1] 1.4.1 no dist[2] 1.4.1 no dist[4] 1.4.1 yes local[8] 1.5.0 yes dist[1] 1.5.0 no dist[2] 1.5.0 no dist[4] 1.5.0 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x
[ https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944271#comment-14944271 ] Marcelo Vanzin commented on SPARK-10937: And I assume that you checked the classpath of the running spark shell and made sure there are no other Hive jars polluting it? Can you post the output of {{sys.props("java.class.path")}} from that shell? (The shell should work even if you get that error.) > java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell > using hive 0.12.x, 0.13.x > -- > > Key: SPARK-10937 > URL: https://issues.apache.org/jira/browse/SPARK-10937 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.5.1 >Reporter: Curtis Wilde > > Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and > sqlContext is not properly created. > Method 'public String getDefaultExpr()' is not in inner class ConfVars of > org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0. > org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when > 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the > exception below: > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String; > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671) > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > at > org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:234) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at > org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) > at $iwC$$iwC.(:9) > at $iwC.(:18) > at (:20) > at .(:24) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124) > at > org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324) > at > org.apache.spark.repl.SparkILoopInit$class.initializeSpa
[jira] [Issue Comment Deleted] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x
[ https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Curtis Wilde updated SPARK-10937: - Comment: was deleted (was: In spark-defaults.conf I've added: spark.sql.hive.metastore.version0.12.0 spark.sql.hive.metastore.jars /usr/lib/spark/lib/guava-11.0.2.jar:/usr/lib/spark/lib/hadoop-client-2.2.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-common-0.12.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-exec-0.12.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-metastore-0.12.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-serde-0.12.0.2.0.10.0-1.jar Still getting the same error.) > java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell > using hive 0.12.x, 0.13.x > -- > > Key: SPARK-10937 > URL: https://issues.apache.org/jira/browse/SPARK-10937 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.5.1 >Reporter: Curtis Wilde > > Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and > sqlContext is not properly created. > Method 'public String getDefaultExpr()' is not in inner class ConfVars of > org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0. > org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when > 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the > exception below: > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String; > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671) > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > at > org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:234) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at > org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) > at $iwC$$iwC.(:9) > at $iwC.(:18) > at (:20) > at .(:24) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initial
[jira] [Reopened] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x
[ https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Curtis Wilde reopened SPARK-10937: -- In spark-defaults.conf I've added: spark.sql.hive.metastore.version 0.12.0 spark.sql.hive.metastore.jars /usr/lib/spark/lib/guava-11.0.2.jar:/usr/lib/spark/lib/hadoop-client-2.2.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-common-0.12.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-exec-0.12.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-metastore-0.12.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-serde-0.12.0.2.0.10.0-1.jar Still getting the same error. > java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell > using hive 0.12.x, 0.13.x > -- > > Key: SPARK-10937 > URL: https://issues.apache.org/jira/browse/SPARK-10937 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.5.1 >Reporter: Curtis Wilde > > Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and > sqlContext is not properly created. > Method 'public String getDefaultExpr()' is not in inner class ConfVars of > org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0. > org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when > 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the > exception below: > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String; > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671) > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > at > org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:234) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at > org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) > at $iwC$$iwC.(:9) > at $iwC.(:18) > at (:20) > at .(:24) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124) > at > org.apache.spark.repl.
[jira] [Assigned] (SPARK-10337) Views are broken
[ https://issues.apache.org/jira/browse/SPARK-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10337: Assignee: Apache Spark (was: Wenchen Fan) > Views are broken > > > Key: SPARK-10337 > URL: https://issues.apache.org/jira/browse/SPARK-10337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Michael Armbrust >Assignee: Apache Spark >Priority: Critical > > I haven't dug into this yet... but it seems like this should work: > This works: > {code} > SELECT * FROM 100milints > {code} > This seems to work: > {code} > CREATE VIEW testView AS SELECT * FROM 100milints > {code} > This fails: > {code} > SELECT * FROM testView > org.apache.spark.sql.AnalysisException: cannot resolve '100milints.col' given > input columns id; line 1 pos 7 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:122) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collecti
[jira] [Commented] (SPARK-10337) Views are broken
[ https://issues.apache.org/jira/browse/SPARK-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944259#comment-14944259 ] Apache Spark commented on SPARK-10337: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/8990 > Views are broken > > > Key: SPARK-10337 > URL: https://issues.apache.org/jira/browse/SPARK-10337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Michael Armbrust >Assignee: Wenchen Fan >Priority: Critical > > I haven't dug into this yet... but it seems like this should work: > This works: > {code} > SELECT * FROM 100milints > {code} > This seems to work: > {code} > CREATE VIEW testView AS SELECT * FROM 100milints > {code} > This fails: > {code} > SELECT * FROM testView > org.apache.spark.sql.AnalysisException: cannot resolve '100milints.col' given > input columns id; line 1 pos 7 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:122) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
[jira] [Assigned] (SPARK-10337) Views are broken
[ https://issues.apache.org/jira/browse/SPARK-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10337: Assignee: Wenchen Fan (was: Apache Spark) > Views are broken > > > Key: SPARK-10337 > URL: https://issues.apache.org/jira/browse/SPARK-10337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Michael Armbrust >Assignee: Wenchen Fan >Priority: Critical > > I haven't dug into this yet... but it seems like this should work: > This works: > {code} > SELECT * FROM 100milints > {code} > This seems to work: > {code} > CREATE VIEW testView AS SELECT * FROM 100milints > {code} > This fails: > {code} > SELECT * FROM testView > org.apache.spark.sql.AnalysisException: cannot resolve '100milints.col' given > input columns id; line 1 pos 7 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:122) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collectio
[jira] [Commented] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x
[ https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944258#comment-14944258 ] Curtis Wilde commented on SPARK-10937: -- In spark-defaults.conf I've added: spark.sql.hive.metastore.version0.12.0 spark.sql.hive.metastore.jars /usr/lib/spark/lib/guava-11.0.2.jar:/usr/lib/spark/lib/hadoop-client-2.2.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-common-0.12.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-exec-0.12.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-metastore-0.12.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-serde-0.12.0.2.0.10.0-1.jar Still getting the same error. > java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell > using hive 0.12.x, 0.13.x > -- > > Key: SPARK-10937 > URL: https://issues.apache.org/jira/browse/SPARK-10937 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.5.1 >Reporter: Curtis Wilde > > Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and > sqlContext is not properly created. > Method 'public String getDefaultExpr()' is not in inner class ConfVars of > org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0. > org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when > 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the > exception below: > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String; > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671) > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > at > org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:234) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at > org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) > at $iwC$$iwC.(:9) > at $iwC.(:18) > at (:20) > at .(:24) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132) > at > org.apache.spark.repl.SparkI
[jira] [Assigned] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-: --- Assignee: Michael Armbrust > RDD-like API on top of Catalyst/DataFrame > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-: Description: The RDD API is very flexible, and as a result harder to optimize its execution in some cases. The DataFrame API, on the other hand, is much easier to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to use UDFs, lack of strong types in Scala/Java). The goal of Spark Datasets is to provide an API that allows users to easily express transformations on domain objects, while also providing the performance and robustness advantages of the Spark SQL execution engine. h2. Requirements - *Fast* - In most cases, the performance of Datasets should be equal to or better than working with RDDs. Encoders should be as fast or faster than Kryo and Java serialization, and unnecessary conversion should be avoided. - *Typesafe* - Similar to RDDs, objects and functions that operate on those objects should provide compile-time safety where possible. When converting from data where the schema is not known at compile-time (for example data read from an external source such as JSON), the conversion function should fail-fast if there is a schema mismatch. - *Support for a variety of object models* - Default encoders should be provided for a variety of object models: primitive types, case classes, tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard conventions, such as Avro SpecificRecords, should also work out of the box. - *Java Compatible* - Datasets should provide a single API that works in both Scala and Java. Where possible, shared types like Array will be used in the API. Where not possible, overloaded functions should be provided for both languages. Scala concepts, such as ClassTags should not be required in the user-facing API. - *Interoperates with DataFrames* - Users should be able to seamlessly transition between Datasets and DataFrames, without specifying conversion boiler-plate. When names used in the input schema line-up with fields in the given class, no extra mapping should be necessary. Libraries like MLlib should not need to provide different interfaces for accepting DataFrames and Datasets as input. For a detailed outline of the complete proposed API: [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] For an initial discussion of the design considerations in this API: [design doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] was: The RDD API is very flexible, and as a result harder to optimize its execution in some cases. The DataFrame API, on the other hand, is much easier to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to use UDFs, lack of strong types in Scala/Java). As a Spark user, I want an API that sits somewhere in the middle of the spectrum so I can write most of my applications with that API, and yet it can be optimized well by Spark to achieve performance and stability. > RDD-like API on top of Catalyst/DataFrame > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Arr
[jira] [Commented] (SPARK-10938) Remove typeId in columnar cache
[ https://issues.apache.org/jira/browse/SPARK-10938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944238#comment-14944238 ] Apache Spark commented on SPARK-10938: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/8989 > Remove typeId in columnar cache > --- > > Key: SPARK-10938 > URL: https://issues.apache.org/jira/browse/SPARK-10938 > Project: Spark > Issue Type: Task >Reporter: Davies Liu >Assignee: Davies Liu > > typeId is not needed in columnar cache, it's confusing to having them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10938) Remove typeId in columnar cache
[ https://issues.apache.org/jira/browse/SPARK-10938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10938: Assignee: Davies Liu (was: Apache Spark) > Remove typeId in columnar cache > --- > > Key: SPARK-10938 > URL: https://issues.apache.org/jira/browse/SPARK-10938 > Project: Spark > Issue Type: Task >Reporter: Davies Liu >Assignee: Davies Liu > > typeId is not needed in columnar cache, it's confusing to having them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10938) Remove typeId in columnar cache
[ https://issues.apache.org/jira/browse/SPARK-10938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10938: Assignee: Apache Spark (was: Davies Liu) > Remove typeId in columnar cache > --- > > Key: SPARK-10938 > URL: https://issues.apache.org/jira/browse/SPARK-10938 > Project: Spark > Issue Type: Task >Reporter: Davies Liu >Assignee: Apache Spark > > typeId is not needed in columnar cache, it's confusing to having them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10938) Remove typeId in columnar cache
Davies Liu created SPARK-10938: -- Summary: Remove typeId in columnar cache Key: SPARK-10938 URL: https://issues.apache.org/jira/browse/SPARK-10938 Project: Spark Issue Type: Task Reporter: Davies Liu Assignee: Davies Liu typeId is not needed in columnar cache, it's confusing to having them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x
[ https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944218#comment-14944218 ] Curtis Wilde edited comment on SPARK-10937 at 10/5/15 11:13 PM: Yes, I added the following jars to the classpath: guava-11.0.2.jar hadoop-client-2.2.0.2.0.10.0-1.jar hive-common-0.12.0.2.0.10.0-1.jar hive-exec-0.12.0.2.0.10.0-1.jar hive-metastore-0.12.0.2.0.10.0-1.jar hive-serde-0.12.0.2.0.10.0-1.jar (I assumed these were the jars that should be added, because setting spark.sql.hive.metastore.jarsmaven caused spark to look for these jars.) was (Author: crutis): Yes, I added the following jars to the classpath: guava-11.0.2.jar hadoop-client-2.2.0.2.0.10.0-1.jar hive-common-0.12.0.2.0.10.0-1.jar hive-exec-0.12.0.2.0.10.0-1.jar hive-metastore-0.12.0.2.0.10.0-1.jar hive-serde-0.12.0.2.0.10.0-1.jar > java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell > using hive 0.12.x, 0.13.x > -- > > Key: SPARK-10937 > URL: https://issues.apache.org/jira/browse/SPARK-10937 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.5.1 >Reporter: Curtis Wilde > > Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and > sqlContext is not properly created. > Method 'public String getDefaultExpr()' is not in inner class ConfVars of > org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0. > org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when > 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the > exception below: > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String; > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671) > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > at > org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:234) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at > org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) > at $iwC$$iwC.(:9) > at $iwC.(:18) > at (:20) > at .(:24) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apac
[jira] [Resolved] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x
[ https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-10937. Resolution: Invalid Then that's your problem. You're causing the error by overriding the Hive classes shipped with Spark. If you want Spark to use your version-specific Hive jars to access the metastore, take a look at {{spark.sql.hive.metastore.jars}}. > java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell > using hive 0.12.x, 0.13.x > -- > > Key: SPARK-10937 > URL: https://issues.apache.org/jira/browse/SPARK-10937 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.5.1 >Reporter: Curtis Wilde > > Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and > sqlContext is not properly created. > Method 'public String getDefaultExpr()' is not in inner class ConfVars of > org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0. > org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when > 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the > exception below: > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String; > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671) > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > at > org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:234) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at > org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) > at $iwC$$iwC.(:9) > at $iwC.(:18) > at (:20) > at .(:24) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124) > at > org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324) > at > org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124) > at > org.apach
[jira] [Commented] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x
[ https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944218#comment-14944218 ] Curtis Wilde commented on SPARK-10937: -- Yes, I added the following jars to the classpath: guava-11.0.2.jar hadoop-client-2.2.0.2.0.10.0-1.jar hive-common-0.12.0.2.0.10.0-1.jar hive-exec-0.12.0.2.0.10.0-1.jar hive-metastore-0.12.0.2.0.10.0-1.jar hive-serde-0.12.0.2.0.10.0-1.jar > java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell > using hive 0.12.x, 0.13.x > -- > > Key: SPARK-10937 > URL: https://issues.apache.org/jira/browse/SPARK-10937 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.5.1 >Reporter: Curtis Wilde > > Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and > sqlContext is not properly created. > Method 'public String getDefaultExpr()' is not in inner class ConfVars of > org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0. > org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when > 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the > exception below: > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String; > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671) > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > at > org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:234) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at > org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) > at $iwC$$iwC.(:9) > at $iwC.(:18) > at (:20) > at .(:24) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124) > at > org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324) > at > org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:1
[jira] [Commented] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x
[ https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944215#comment-14944215 ] Marcelo Vanzin commented on SPARK-10937: Nevermind. The exception is coming from the execution code, which isn't affected by that config in any case. Are you (or the configuration you're using) overriding Spark's classpath in any way? e.g. placing HDP's Hive jars in the driver's classpath? > java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell > using hive 0.12.x, 0.13.x > -- > > Key: SPARK-10937 > URL: https://issues.apache.org/jira/browse/SPARK-10937 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.5.1 >Reporter: Curtis Wilde > > Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and > sqlContext is not properly created. > Method 'public String getDefaultExpr()' is not in inner class ConfVars of > org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0. > org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when > 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the > exception below: > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String; > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671) > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > at > org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:234) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at > org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) > at $iwC$$iwC.(:9) > at $iwC.(:18) > at (:20) > at .(:24) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124) > at > org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324) > at > org.apache.spark.repl.SparkILoopInit$class.initializeSpark(Spar
[jira] [Comment Edited] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x
[ https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944213#comment-14944213 ] Curtis Wilde edited comment on SPARK-10937 at 10/5/15 11:06 PM: Yes spark.sql.hive.metastore.version0.12.0 was (Author: crutis): Yes > java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell > using hive 0.12.x, 0.13.x > -- > > Key: SPARK-10937 > URL: https://issues.apache.org/jira/browse/SPARK-10937 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.5.1 >Reporter: Curtis Wilde > > Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and > sqlContext is not properly created. > Method 'public String getDefaultExpr()' is not in inner class ConfVars of > org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0. > org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when > 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the > exception below: > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String; > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671) > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > at > org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:234) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at > org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) > at $iwC$$iwC.(:9) > at $iwC.(:18) > at (:20) > at .(:24) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124) > at > org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324) > at > org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124) > at > org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64) >
[jira] [Created] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x
Curtis Wilde created SPARK-10937: Summary: java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x Key: SPARK-10937 URL: https://issues.apache.org/jira/browse/SPARK-10937 Project: Spark Issue Type: Bug Components: Spark Shell, SQL Affects Versions: 1.5.1 Reporter: Curtis Wilde Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception when and sqlContext is not properly created. Method 'public String getDefaultExpr()' is not in inner class ConfVars of org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0. org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the exception below: java.lang.NoSuchMethodError: org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String; at org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671) at org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669) at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164) at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391) at org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235) at org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at org.apache.spark.sql.SQLContext.(SQLContext.scala:234) at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:408) at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) at $iwC$$iwC.(:9) at $iwC.(:18) at (:20) at .(:24) at .() at .(:7) at .() at $print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124) at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324) at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124) at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974) at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:159) at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:108) at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$r
[jira] [Commented] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x
[ https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944213#comment-14944213 ] Curtis Wilde commented on SPARK-10937: -- Yes > java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell > using hive 0.12.x, 0.13.x > -- > > Key: SPARK-10937 > URL: https://issues.apache.org/jira/browse/SPARK-10937 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.5.1 >Reporter: Curtis Wilde > > Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and > sqlContext is not properly created. > Method 'public String getDefaultExpr()' is not in inner class ConfVars of > org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0. > org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when > 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the > exception below: > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String; > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671) > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > at > org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:234) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at > org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) > at $iwC$$iwC.(:9) > at $iwC.(:18) > at (:20) > at .(:24) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124) > at > org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324) > at > org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124) > at > org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkI
[jira] [Commented] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x
[ https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944211#comment-14944211 ] Marcelo Vanzin commented on SPARK-10937: Did you set {{spark.sql.hive.metastore.version}} to {{0.12}}? > java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell > using hive 0.12.x, 0.13.x > -- > > Key: SPARK-10937 > URL: https://issues.apache.org/jira/browse/SPARK-10937 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.5.1 >Reporter: Curtis Wilde > > Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and > sqlContext is not properly created. > Method 'public String getDefaultExpr()' is not in inner class ConfVars of > org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0. > org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when > 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the > exception below: > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String; > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671) > at > org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > at > org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235) > at > org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:234) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at > org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) > at $iwC$$iwC.(:9) > at $iwC.(:18) > at (:20) > at .(:24) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124) > at > org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324) > at > org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124) > at > org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$Spa
[jira] [Updated] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x
[ https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Curtis Wilde updated SPARK-10937: - Description: Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and sqlContext is not properly created. Method 'public String getDefaultExpr()' is not in inner class ConfVars of org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0. org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the exception below: java.lang.NoSuchMethodError: org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String; at org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671) at org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669) at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164) at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391) at org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235) at org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at org.apache.spark.sql.SQLContext.(SQLContext.scala:234) at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:408) at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) at $iwC$$iwC.(:9) at $iwC.(:18) at (:20) at .(:24) at .() at .(:7) at .() at $print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124) at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324) at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124) at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974) at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:159) at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:108) at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop$$anonfun$org$apac
[jira] [Commented] (SPARK-8848) Write Parquet LISTs and MAPs conforming to Parquet format spec
[ https://issues.apache.org/jira/browse/SPARK-8848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944197#comment-14944197 ] Apache Spark commented on SPARK-8848: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/8988 > Write Parquet LISTs and MAPs conforming to Parquet format spec > -- > > Key: SPARK-8848 > URL: https://issues.apache.org/jira/browse/SPARK-8848 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.0.2, 1.1.1, 1.2.2, 1.3.1, 1.4.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > [Parquet format PR #17|https://github.com/apache/parquet-format/pull/17] > standardized structures of Parquet complex types (LIST & MAP). Spark SQL > should follow this spec and write Parquet data conforming to the standard. > Note that although currently Parquet files written by Spark SQL is > non-standard (because Parquet format spec wasn't clear about this part when > Spark SQL Parquet support was authored), it's still compatible with the most > recent Parquet format spec, because the format we use is covered by the > backwards-compatibility rules. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944166#comment-14944166 ] Hans van den Bogert edited comment on SPARK-10474 at 10/5/15 10:39 PM: --- Had this patch against tag 1.5.1: https://gist.github.com/f110f64887f4739b7dd8 Output of the added println()'s are near the beginning in: {noformat} Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties To adjust logging level use sc.setLogLevel("INFO") Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.1 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0) Type in expressions to have them evaluated. Type :help for more information. 15/10/06 00:16:00 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). numCores:0 1048576 15/10/06 00:16:02 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. I1006 00:16:02.414851 25123 sched.cpp:137] Version: 0.21.0 I1006 00:16:02.423246 25115 sched.cpp:234] New master detected at master@10.149.3.5:5050 I1006 00:16:02.423482 25115 sched.cpp:242] No credentials provided. Attempting to register without authentication I1006 00:16:02.427250 25121 sched.cpp:408] Framework registered with 20151006-001500-84120842-5050-6847- 15/10/06 00:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context available as sc. 15/10/06 00:16:04 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/06 00:16:04 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/06 00:16:09 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 15/10/06 00:16:09 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 15/10/06 00:16:11 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/06 00:16:11 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/06 00:16:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable SQL context available as sqlContext. {noformat} was (Author: hbogert): Had this patch against tag 1.5.1: https://gist.github.com/f110f64887f4739b7dd8.git Output of the added println()'s are near the beginning in: {noformat} Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties To adjust logging level use sc.setLogLevel("INFO") Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.1 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0) Type in expressions to have them evaluated. Type :help for more information. 15/10/06 00:16:00 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). numCores:0 1048576 15/10/06 00:16:02 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. I1006 00:16:02.414851 25123 sched.cpp:137] Version: 0.21.0 I1006 00:16:02.423246 25115 sched.cpp:234] New master detected at master@10.149.3.5:5050 I1006 00:16:02.423482 25115 sched.cpp:242] No credentials provided. Attempting to register without authentication I1006 00:16:02.427250 25121 sched.cpp:408] Framework registered with 20151006-001500-84120842-5050-6847- 15/10/06 00:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context available as sc. 15/10/06 00:16:04 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/06 00:16:04 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/06 00:16:09 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 15/10/06 00:16:09 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 15/10/06 00:16:11 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/06 00:16:11 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/06 00:16:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using bu
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944173#comment-14944173 ] Alexander Ulanov commented on SPARK-5575: - Weide, These are major features and some of them are under development. You can check their status in the linked issues. Could you work on something smaller as a first step? [~mengxr], do you have any suggestions? > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > Goal: Implement various types of artificial neural networks > Motivation: deep learning trend > Requirements: > 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward > and Backpropagation etc. should be implemented as traits or interfaces, so > they can be easily extended or reused > 2) Implement complex abstractions, such as feed forward and recurrent networks > 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), > autoencoder (sparse and denoising), stacked autoencoder, restricted > boltzmann machines (RBM), deep belief networks (DBN) etc. > 4) Implement or reuse supporting constucts, such as classifiers, normalizers, > poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944166#comment-14944166 ] Hans van den Bogert edited comment on SPARK-10474 at 10/5/15 10:36 PM: --- Had this patch against tag 1.5.1: https://gist.github.com/f110f64887f4739b7dd8.git Output of the added println()'s are near the beginning in: {noformat} Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties To adjust logging level use sc.setLogLevel("INFO") Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.1 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0) Type in expressions to have them evaluated. Type :help for more information. 15/10/06 00:16:00 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). numCores:0 1048576 15/10/06 00:16:02 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. I1006 00:16:02.414851 25123 sched.cpp:137] Version: 0.21.0 I1006 00:16:02.423246 25115 sched.cpp:234] New master detected at master@10.149.3.5:5050 I1006 00:16:02.423482 25115 sched.cpp:242] No credentials provided. Attempting to register without authentication I1006 00:16:02.427250 25121 sched.cpp:408] Framework registered with 20151006-001500-84120842-5050-6847- 15/10/06 00:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context available as sc. 15/10/06 00:16:04 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/06 00:16:04 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/06 00:16:09 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 15/10/06 00:16:09 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 15/10/06 00:16:11 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/06 00:16:11 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/06 00:16:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable SQL context available as sqlContext. {noformat} was (Author: hbogert): Had this patch against tag 1.5.1: https://gist.github.com/f110f64887f4739b7dd8.git Output of the added println()'s are near the beginning in: Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties To adjust logging level use sc.setLogLevel("INFO") Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.1 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0) Type in expressions to have them evaluated. Type :help for more information. 15/10/06 00:16:00 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). numCores:0 1048576 15/10/06 00:16:02 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. I1006 00:16:02.414851 25123 sched.cpp:137] Version: 0.21.0 I1006 00:16:02.423246 25115 sched.cpp:234] New master detected at master@10.149.3.5:5050 I1006 00:16:02.423482 25115 sched.cpp:242] No credentials provided. Attempting to register without authentication I1006 00:16:02.427250 25121 sched.cpp:408] Framework registered with 20151006-001500-84120842-5050-6847- 15/10/06 00:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context available as sc. 15/10/06 00:16:04 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/06 00:16:04 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/06 00:16:09 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 15/10/06 00:16:09 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 15/10/06 00:16:11 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/06 00:16:11 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/06 00:16:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using built
[jira] [Commented] (SPARK-10384) Univariate statistics as UDAFs
[ https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944168#comment-14944168 ] Xiangrui Meng commented on SPARK-10384: --- I marked small umbrella JIRAs duplicated in favor of concrete ones. I added a list of statistics in this JIRA description. Hopefully this helps understand the progress better. > Univariate statistics as UDAFs > -- > > Key: SPARK-10384 > URL: https://issues.apache.org/jira/browse/SPARK-10384 > Project: Spark > Issue Type: Umbrella > Components: ML, SQL >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > It would be nice to define univariate statistics as UDAFs. This JIRA > discusses general implementation and tracks the process of subtasks. > Univariate statistics include: > continuous: min, max, range, variance, stddev, median, quantiles, skewness, > and kurtosis > categorical: number of categories, mode > If we define them as UDAFs, it would be quite flexible to use them with > DataFrames, e.g., > {code} > df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) > {code} > Note that some univariate statistics depend on others, e.g., variance might > depend on mean and count. It would be nice if SQL can optimize the sequence > to avoid duplicate computation. > Univariate statistics for continuous variables: > * -min- > * -max- > * range (SPARK-10861) > * -mean- > * sample variance (SPARK-9296) > * population variance (SPARK-9296) > * -sample standard deviation- (SPARK-6458) > * -population standard deviation- (SPARK-6458) > * skewness (SPARK-10641) > * kurtosis (SPARK-10641) > * approximate median (SPARK-6761) > * approximate quantiles (SPARK-6761) > Univariate statistics for categorical variables: > * mode: https://en.wikipedia.org/wiki/Mode_(statistics) (SPARK-10936) > * -number of categories- (This is COUNT DISTINCT in SQL.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944166#comment-14944166 ] Hans van den Bogert commented on SPARK-10474: - Had this patch against tag 1.5.1: https://gist.github.com/f110f64887f4739b7dd8.git Output of the added println()'s are near the beginning in: Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties To adjust logging level use sc.setLogLevel("INFO") Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.1 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0) Type in expressions to have them evaluated. Type :help for more information. 15/10/06 00:16:00 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). numCores:0 1048576 15/10/06 00:16:02 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. I1006 00:16:02.414851 25123 sched.cpp:137] Version: 0.21.0 I1006 00:16:02.423246 25115 sched.cpp:234] New master detected at master@10.149.3.5:5050 I1006 00:16:02.423482 25115 sched.cpp:242] No credentials provided. Attempting to register without authentication I1006 00:16:02.427250 25121 sched.cpp:408] Framework registered with 20151006-001500-84120842-5050-6847- 15/10/06 00:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context available as sc. 15/10/06 00:16:04 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/06 00:16:04 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/06 00:16:09 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 15/10/06 00:16:09 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 15/10/06 00:16:11 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/06 00:16:11 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/06 00:16:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable SQL context available as sqlContext. > TungstenAggregation cannot acquire memory for pointer array after switching > to sort-based > - > > Key: SPARK-10474 > URL: https://issues.apache.org/jira/browse/SPARK-10474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.5.1, 1.6.0 > > > In aggregation case, a Lost task happened with below error. > {code} > java.io.IOException: Could not acquire 65536 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at
[jira] [Updated] (SPARK-10912) Improve Spark metrics executor.filesystem
[ https://issues.apache.org/jira/browse/SPARK-10912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjia Wang updated SPARK-10912: - Attachment: s3a_metrics.patch Adding s3a is fairly straightforward. I guess the reason it's not included is because s3a support (via hadoop-aws.jar) is not part of the default hadoop distribution due to licensing issues. I created a patch to enable s3a metrics, both on the executors and on the driver. Reporting shuffle statistics requires more thoughts, although all the numbers are already collected in TaskMetrics.scala (input, output, shuffle, local, remote, spill, records, bytes, etc). I think it would make sense to report the aggregated metrics per executor across all tasks, so it's easy to have an overall sense of disk I/O and network traffic. > Improve Spark metrics executor.filesystem > - > > Key: SPARK-10912 > URL: https://issues.apache.org/jira/browse/SPARK-10912 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.5.0 >Reporter: Yongjia Wang >Priority: Minor > Attachments: s3a_metrics.patch > > > In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: > "hdfs" and "file". I started using s3 as the persistent storage with Spark > standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. > The 'file' metric appears to be only for driver reading local file, it would > be nice to also report shuffle read/write metrics, so it can help with > optimization. > I think these 2 things (s3 and shuffle) are very useful and cover all the > missing information about Spark IO especially for s3 setup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944155#comment-14944155 ] Xiangrui Meng edited comment on SPARK-10641 at 10/5/15 10:34 PM: - Any updates? Please submit a PR for code review, and try to think about how to reuse existing implementation of variance. was (Author: mengxr): Any updates? > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Reporter: Jihong MA >Assignee: Seth Hendrickson > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats
[ https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-10602. - Resolution: Duplicate Marking this as duplicated in favor of concrete JIRAs. Please continue the discussion to those JIRAs. > Univariate statistics as UDAFs: single-pass continuous stats > > > Key: SPARK-10602 > URL: https://issues.apache.org/jira/browse/SPARK-10602 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Joseph K. Bradley >Assignee: Seth Hendrickson > > See parent JIRA for more details. This subtask covers statistics for > continuous values requiring a single pass over the data, such as min and max. > This JIRA is an umbrella. For individual stats, please create and link a new > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944155#comment-14944155 ] Xiangrui Meng commented on SPARK-10641: --- Any updates? > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Reporter: Jihong MA >Assignee: Seth Hendrickson > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10603) Univariate statistics as UDAFs: multi-pass continuous stats
[ https://issues.apache.org/jira/browse/SPARK-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944153#comment-14944153 ] Xiangrui Meng edited comment on SPARK-10603 at 10/5/15 10:32 PM: - Marking this as duplicated in favor of concrete JIRAs. See SPARK-10384 for the list. was (Author: mengxr): Marking this as duplicated in favor of concrete JIRAs. > Univariate statistics as UDAFs: multi-pass continuous stats > --- > > Key: SPARK-10603 > URL: https://issues.apache.org/jira/browse/SPARK-10603 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Joseph K. Bradley > > See parent JIRA for more details. This subtask covers statistics for > continuous values requiring multiple passes over the data, such as median and > quantiles. > This JIRA is an umbrella. For individual stats, please create and link a new > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10603) Univariate statistics as UDAFs: multi-pass continuous stats
[ https://issues.apache.org/jira/browse/SPARK-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-10603. - Resolution: Duplicate Marking this as duplicated in favor of concrete JIRAs. > Univariate statistics as UDAFs: multi-pass continuous stats > --- > > Key: SPARK-10603 > URL: https://issues.apache.org/jira/browse/SPARK-10603 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Joseph K. Bradley > > See parent JIRA for more details. This subtask covers statistics for > continuous values requiring multiple passes over the data, such as median and > quantiles. > This JIRA is an umbrella. For individual stats, please create and link a new > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10604) Univariate statistics as UDAFs: categorical stats
[ https://issues.apache.org/jira/browse/SPARK-10604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-10604. --- Resolution: Duplicate Marking this as duplicated in favor of concrete JIRAs. > Univariate statistics as UDAFs: categorical stats > - > > Key: SPARK-10604 > URL: https://issues.apache.org/jira/browse/SPARK-10604 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Joseph K. Bradley > > See parent JIRA for more details. This subtask covers statistics for > categorical values, such as number of categories or mode. > This JIRA is an umbrella. For individual stats, please create and link a new > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10877) Assertions fail straightforward DataFrame job due to word alignment
[ https://issues.apache.org/jira/browse/SPARK-10877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944150#comment-14944150 ] Matt Cheah commented on SPARK-10877: Does spark-submit enable assertions? I'm not sure how SBT passes these kinds of assertion options along. Also, what JDK / Java Version are you using, and what OS? > Assertions fail straightforward DataFrame job due to word alignment > --- > > Key: SPARK-10877 > URL: https://issues.apache.org/jira/browse/SPARK-10877 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Matt Cheah > Attachments: SparkFilterByKeyTest.scala > > > I have some code that I’m running in a unit test suite, but the code I’m > running is failing with an assertion error. > I have translated the JUnit test that was failing, to a Scala script that I > will attach to the ticket. The assertion error is the following: > {code} > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: > Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.AssertionError: > lengthInBytes must be a multiple of 8 (word-aligned) > at > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeWords(Murmur3_x86_32.java:53) > at > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.hashCode(UnsafeArrayData.java:289) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.hashCode(rows.scala:149) > at > org.apache.spark.sql.catalyst.expressions.GenericMutableRow.hashCode(rows.scala:247) > at org.apache.spark.HashPartitioner.getPartition(Partitioner.scala:85) > at > org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180) > at > org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > {code} > However, it turns out that this code actually works normally and computes the > correct result if assertions are turned off. > I traced the code and found that when hashUnsafeWords was called, it was > given a byte-length of 12, which clearly is not a multiple of 8. However, the > job seems to compute correctly regardless of this fact. Of course, I can’t > just disable assertions for my unit test though. > A few things we need to understand: > 1. Why is the lengthInBytes of size 12? > 2. Is it actually a problem that the byte length is not word-aligned? If so, > how should we fix the byte length? If it's not a problem, why is the > assertion flagging a false negative? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs
[ https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10384: -- Description: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * -max- * range (SPARK-10861) * -mean- * sample variance (SPARK-9296) * population variance (SPARK-9296) * -sample standard deviation- (SPARK-6458) * -population standard deviation- (SPARK-6458) * skewness (SPARK-10641) * kurtosis (SPARK-10641) * approximate median (SPARK-6761) * approximate quantiles (SPARK-6761) Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) (SPARK-10936) * -number of categories- (This is COUNT DISTINCT in SQL.) was: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * -max- * range (SPARK-10861) * -mean- * sample variance (SPARK-9296) * population variance (SPARK-9296) * -sample standard deviation- (SPARK-6458) * -population standard deviation- (SPARK-6458) * skewness (SPARK-10641) * kurtosis (SPARK-10641) * approximate median (SPARK-6761) * approximate quantiles (SPARK-6761) Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) (SPARK-10936) * number of categories > Univariate statistics as UDAFs > -- > > Key: SPARK-10384 > URL: https://issues.apache.org/jira/browse/SPARK-10384 > Project: Spark > Issue Type: Umbrella > Components: ML, SQL >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > It would be nice to define univariate statistics as UDAFs. This JIRA > discusses general implementation and tracks the process of subtasks. > Univariate statistics include: > continuous: min, max, range, variance, stddev, median, quantiles, skewness, > and kurtosis > categorical: number of categories, mode > If we define them as UDAFs, it would be quite flexible to use them with > DataFrames, e.g., > {code} > df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) > {code} > Note that some univariate statistics depend on others, e.g., variance might > depend on mean and count. It would be nice if SQL can optimize the sequence > to avoid duplicate computation. > Univariate statistics for continuous variables: > * -min- > * -max- > * range (SPARK-10861) > * -mean- > * sample variance (SPARK-9296) > * population variance (SPARK-9296) > * -sample standard deviation- (SPARK-6458) > * -population standard deviation- (SPARK-6458) > * skewness (SPARK-10641) > * kurtosis (SPARK-10641) > * approximate median (SPARK-6761) > * approximate quantiles (SPARK-6761) > Univariate statistics for categorical variables: > * mode: https://en.wikipedia.org/wiki/Mode_(statistics) (SPARK-10936) > * -number of categories- (This is COUNT DISTINCT in SQL.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10936) UDAF "mode" for categorical variables
Xiangrui Meng created SPARK-10936: - Summary: UDAF "mode" for categorical variables Key: SPARK-10936 URL: https://issues.apache.org/jira/browse/SPARK-10936 Project: Spark Issue Type: Sub-task Reporter: Xiangrui Meng This is similar to frequent items except that we don't have a threshold on the frequency. So an exact implementation might require a global shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs
[ https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10384: -- Description: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * -max- * range (SPARK-10861) * -mean- * sample variance (SPARK-9296) * population variance (SPARK-9296) * -sample standard deviation- (SPARK-6458) * -population standard deviation- (SPARK-6458) * skewness (SPARK-10641) * kurtosis (SPARK-10641) * approximate median (SPARK-6761) * approximate quantiles (SPARK-6761) Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) (SPARK-10936) * number of categories was: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * -max- * range (SPARK-10861) * -mean- * sample variance (SPARK-9296) * population variance (SPARK-9296) * -sample standard deviation- (SPARK-6458) * -population standard deviation- (SPARK-6458) * skewness (SPARK-10641) * kurtosis (SPARK-10641) * approximate median (SPARK-6761) * approximate quantiles (SPARK-6761) Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories > Univariate statistics as UDAFs > -- > > Key: SPARK-10384 > URL: https://issues.apache.org/jira/browse/SPARK-10384 > Project: Spark > Issue Type: Umbrella > Components: ML, SQL >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > It would be nice to define univariate statistics as UDAFs. This JIRA > discusses general implementation and tracks the process of subtasks. > Univariate statistics include: > continuous: min, max, range, variance, stddev, median, quantiles, skewness, > and kurtosis > categorical: number of categories, mode > If we define them as UDAFs, it would be quite flexible to use them with > DataFrames, e.g., > {code} > df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) > {code} > Note that some univariate statistics depend on others, e.g., variance might > depend on mean and count. It would be nice if SQL can optimize the sequence > to avoid duplicate computation. > Univariate statistics for continuous variables: > * -min- > * -max- > * range (SPARK-10861) > * -mean- > * sample variance (SPARK-9296) > * population variance (SPARK-9296) > * -sample standard deviation- (SPARK-6458) > * -population standard deviation- (SPARK-6458) > * skewness (SPARK-10641) > * kurtosis (SPARK-10641) > * approximate median (SPARK-6761) > * approximate quantiles (SPARK-6761) > Univariate statistics for categorical variables: > * mode: https://en.wikipedia.org/wiki/Mode_(statistics) (SPARK-10936) > * number of categories -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10384) Univariate statistics as UDAFs
[ https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-10384: - Assignee: Xiangrui Meng (was: Burak Yavuz) > Univariate statistics as UDAFs > -- > > Key: SPARK-10384 > URL: https://issues.apache.org/jira/browse/SPARK-10384 > Project: Spark > Issue Type: Umbrella > Components: ML, SQL >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > It would be nice to define univariate statistics as UDAFs. This JIRA > discusses general implementation and tracks the process of subtasks. > Univariate statistics include: > continuous: min, max, range, variance, stddev, median, quantiles, skewness, > and kurtosis > categorical: number of categories, mode > If we define them as UDAFs, it would be quite flexible to use them with > DataFrames, e.g., > {code} > df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) > {code} > Note that some univariate statistics depend on others, e.g., variance might > depend on mean and count. It would be nice if SQL can optimize the sequence > to avoid duplicate computation. > Univariate statistics for continuous variables: > * -min- > * -max- > * range (SPARK-10861) > * -mean- > * sample variance (SPARK-9296) > * population variance (SPARK-9296) > * -sample standard deviation- (SPARK-6458) > * -population standard deviation- (SPARK-6458) > * skewness (SPARK-10641) > * kurtosis (SPARK-10641) > * approximate median (SPARK-6761) > * approximate quantiles (SPARK-6761) > Univariate statistics for categorical variables: > * mode: https://en.wikipedia.org/wiki/Mode_(statistics) > * number of categories -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10862) Univariate Statistics: Adding median & quantile support as UDAF
[ https://issues.apache.org/jira/browse/SPARK-10862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-10862. --- Resolution: Duplicate > Univariate Statistics: Adding median & quantile support as UDAF > --- > > Key: SPARK-10862 > URL: https://issues.apache.org/jira/browse/SPARK-10862 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs
[ https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10384: -- Description: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * -max- * range (SPARK-10861) * -mean- * sample variance (SPARK-9296) * population variance (SPARK-9296) * -sample standard deviation- (SPARK-6458) * -population standard deviation- (SPARK-6458) * skewness (SPARK-10641) * kurtosis (SPARK-10641) * approximate median (SPARK-6761) * approximate quantiles (SPARK-6761) Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories was: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * -max- * range (SPARK-10861) * -mean- * sample variance (SPARK-9296) * population variance (SPARK-9296) * -sample standard deviation- (SPARK-6458) * -population standard deviation- (SPARK-6458) * skewness (SPARK-10641) * kurtosis (SPARK-10641) * approximate median * approximate quantiles Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories > Univariate statistics as UDAFs > -- > > Key: SPARK-10384 > URL: https://issues.apache.org/jira/browse/SPARK-10384 > Project: Spark > Issue Type: Umbrella > Components: ML, SQL >Reporter: Xiangrui Meng >Assignee: Burak Yavuz > > It would be nice to define univariate statistics as UDAFs. This JIRA > discusses general implementation and tracks the process of subtasks. > Univariate statistics include: > continuous: min, max, range, variance, stddev, median, quantiles, skewness, > and kurtosis > categorical: number of categories, mode > If we define them as UDAFs, it would be quite flexible to use them with > DataFrames, e.g., > {code} > df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) > {code} > Note that some univariate statistics depend on others, e.g., variance might > depend on mean and count. It would be nice if SQL can optimize the sequence > to avoid duplicate computation. > Univariate statistics for continuous variables: > * -min- > * -max- > * range (SPARK-10861) > * -mean- > * sample variance (SPARK-9296) > * population variance (SPARK-9296) > * -sample standard deviation- (SPARK-6458) > * -population standard deviation- (SPARK-6458) > * skewness (SPARK-10641) > * kurtosis (SPARK-10641) > * approximate median (SPARK-6761) > * approximate quantiles (SPARK-6761) > Univariate statistics for categorical variables: > * mode: https://en.wikipedia.org/wiki/Mode_(statistics) > * number of categories -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs
[ https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10384: -- Description: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * -max- * range (SPARK-10861) * -mean- * sample variance (SPARK-9296) * population variance (SPARK-9296) * -sample standard deviation- (SPARK-6458) * -population standard deviation- (SPARK-6458) * skewness (SPARK-10641) * kurtosis (SPARK-10641) * approximate median * approximate quantiles Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories was: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * -max- * range (SPARK-10861) * -mean- * sample variance * population variance * -sample standard deviation- (SPARK-6458) * -population standard deviation- (SPARK-6458) * skewness (SPARK-10641) * kurtosis (SPARK-10641) * approximate median * approximate quantiles Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories > Univariate statistics as UDAFs > -- > > Key: SPARK-10384 > URL: https://issues.apache.org/jira/browse/SPARK-10384 > Project: Spark > Issue Type: Umbrella > Components: ML, SQL >Reporter: Xiangrui Meng >Assignee: Burak Yavuz > > It would be nice to define univariate statistics as UDAFs. This JIRA > discusses general implementation and tracks the process of subtasks. > Univariate statistics include: > continuous: min, max, range, variance, stddev, median, quantiles, skewness, > and kurtosis > categorical: number of categories, mode > If we define them as UDAFs, it would be quite flexible to use them with > DataFrames, e.g., > {code} > df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) > {code} > Note that some univariate statistics depend on others, e.g., variance might > depend on mean and count. It would be nice if SQL can optimize the sequence > to avoid duplicate computation. > Univariate statistics for continuous variables: > * -min- > * -max- > * range (SPARK-10861) > * -mean- > * sample variance (SPARK-9296) > * population variance (SPARK-9296) > * -sample standard deviation- (SPARK-6458) > * -population standard deviation- (SPARK-6458) > * skewness (SPARK-10641) > * kurtosis (SPARK-10641) > * approximate median > * approximate quantiles > Univariate statistics for categorical variables: > * mode: https://en.wikipedia.org/wiki/Mode_(statistics) > * number of categories -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10877) Assertions fail straightforward DataFrame job due to word alignment
[ https://issues.apache.org/jira/browse/SPARK-10877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944141#comment-14944141 ] Jason C Lee commented on SPARK-10877: - I enabled assertions by specifying either the following in my build.sbt javaOptions += "-ea" and use set package to build. I also run it with spark-submit instead of spark-shell...still doesn't see what you see. $SPARK_HOME/bin/spark-submit --class "SparkFilterByKeyTest" --master local[2] target/scala-2.10/simple-project_2.10-1.0.jar > Assertions fail straightforward DataFrame job due to word alignment > --- > > Key: SPARK-10877 > URL: https://issues.apache.org/jira/browse/SPARK-10877 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Matt Cheah > Attachments: SparkFilterByKeyTest.scala > > > I have some code that I’m running in a unit test suite, but the code I’m > running is failing with an assertion error. > I have translated the JUnit test that was failing, to a Scala script that I > will attach to the ticket. The assertion error is the following: > {code} > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: > Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.AssertionError: > lengthInBytes must be a multiple of 8 (word-aligned) > at > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeWords(Murmur3_x86_32.java:53) > at > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.hashCode(UnsafeArrayData.java:289) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.hashCode(rows.scala:149) > at > org.apache.spark.sql.catalyst.expressions.GenericMutableRow.hashCode(rows.scala:247) > at org.apache.spark.HashPartitioner.getPartition(Partitioner.scala:85) > at > org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180) > at > org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > {code} > However, it turns out that this code actually works normally and computes the > correct result if assertions are turned off. > I traced the code and found that when hashUnsafeWords was called, it was > given a byte-length of 12, which clearly is not a multiple of 8. However, the > job seems to compute correctly regardless of this fact. Of course, I can’t > just disable assertions for my unit test though. > A few things we need to understand: > 1. Why is the lengthInBytes of size 12? > 2. Is it actually a problem that the byte length is not word-aligned? If so, > how should we fix the byte length? If it's not a problem, why is the > assertion flagging a false negative? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs
[ https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10384: -- Description: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * -max- * range (SPARK-10861) * -mean- * sample variance * population variance * -sample standard deviation- (SPARK-6458) * -population standard deviation- (SPARK-6458) * skewness (SPARK-10641) * kurtosis (SPARK-10641) * approximate median * approximate quantiles Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories was: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * -max- * range * -mean- * sample variance * population variance * -sample standard deviation- (SPARK-6458) * -population standard deviation- (SPARK-6458) * skewness (SPARK-10641) * kurtosis (SPARK-10641) * approximate median * approximate quantiles Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories > Univariate statistics as UDAFs > -- > > Key: SPARK-10384 > URL: https://issues.apache.org/jira/browse/SPARK-10384 > Project: Spark > Issue Type: Umbrella > Components: ML, SQL >Reporter: Xiangrui Meng >Assignee: Burak Yavuz > > It would be nice to define univariate statistics as UDAFs. This JIRA > discusses general implementation and tracks the process of subtasks. > Univariate statistics include: > continuous: min, max, range, variance, stddev, median, quantiles, skewness, > and kurtosis > categorical: number of categories, mode > If we define them as UDAFs, it would be quite flexible to use them with > DataFrames, e.g., > {code} > df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) > {code} > Note that some univariate statistics depend on others, e.g., variance might > depend on mean and count. It would be nice if SQL can optimize the sequence > to avoid duplicate computation. > Univariate statistics for continuous variables: > * -min- > * -max- > * range (SPARK-10861) > * -mean- > * sample variance > * population variance > * -sample standard deviation- (SPARK-6458) > * -population standard deviation- (SPARK-6458) > * skewness (SPARK-10641) > * kurtosis (SPARK-10641) > * approximate median > * approximate quantiles > Univariate statistics for categorical variables: > * mode: https://en.wikipedia.org/wiki/Mode_(statistics) > * number of categories -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs
[ https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10384: -- Description: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * -max- * range * sample variance * population variance * -sample standard deviation- (SPARK-6458) * -population standard deviation- (SPARK-6458) * skewness (SPARK-10641) * kurtosis (SPARK-10641) * approximate median * approximate quantiles Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories was: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * -max- * range * sample variance * population variance * -sample standard deviation- (SPARK-6458) * -population standard deviation- (SPARK-6458) * skewness * kurtosis * approximate median * approximate quantiles Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories > Univariate statistics as UDAFs > -- > > Key: SPARK-10384 > URL: https://issues.apache.org/jira/browse/SPARK-10384 > Project: Spark > Issue Type: Umbrella > Components: ML, SQL >Reporter: Xiangrui Meng >Assignee: Burak Yavuz > > It would be nice to define univariate statistics as UDAFs. This JIRA > discusses general implementation and tracks the process of subtasks. > Univariate statistics include: > continuous: min, max, range, variance, stddev, median, quantiles, skewness, > and kurtosis > categorical: number of categories, mode > If we define them as UDAFs, it would be quite flexible to use them with > DataFrames, e.g., > {code} > df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) > {code} > Note that some univariate statistics depend on others, e.g., variance might > depend on mean and count. It would be nice if SQL can optimize the sequence > to avoid duplicate computation. > Univariate statistics for continuous variables: > * -min- > * -max- > * range > * sample variance > * population variance > * -sample standard deviation- (SPARK-6458) > * -population standard deviation- (SPARK-6458) > * skewness (SPARK-10641) > * kurtosis (SPARK-10641) > * approximate median > * approximate quantiles > Univariate statistics for categorical variables: > * mode: https://en.wikipedia.org/wiki/Mode_(statistics) > * number of categories -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs
[ https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10384: -- Description: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * -max- * range * -mean- * sample variance * population variance * -sample standard deviation- (SPARK-6458) * -population standard deviation- (SPARK-6458) * skewness (SPARK-10641) * kurtosis (SPARK-10641) * approximate median * approximate quantiles Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories was: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * -max- * range * sample variance * population variance * -sample standard deviation- (SPARK-6458) * -population standard deviation- (SPARK-6458) * skewness (SPARK-10641) * kurtosis (SPARK-10641) * approximate median * approximate quantiles Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories > Univariate statistics as UDAFs > -- > > Key: SPARK-10384 > URL: https://issues.apache.org/jira/browse/SPARK-10384 > Project: Spark > Issue Type: Umbrella > Components: ML, SQL >Reporter: Xiangrui Meng >Assignee: Burak Yavuz > > It would be nice to define univariate statistics as UDAFs. This JIRA > discusses general implementation and tracks the process of subtasks. > Univariate statistics include: > continuous: min, max, range, variance, stddev, median, quantiles, skewness, > and kurtosis > categorical: number of categories, mode > If we define them as UDAFs, it would be quite flexible to use them with > DataFrames, e.g., > {code} > df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) > {code} > Note that some univariate statistics depend on others, e.g., variance might > depend on mean and count. It would be nice if SQL can optimize the sequence > to avoid duplicate computation. > Univariate statistics for continuous variables: > * -min- > * -max- > * range > * -mean- > * sample variance > * population variance > * -sample standard deviation- (SPARK-6458) > * -population standard deviation- (SPARK-6458) > * skewness (SPARK-10641) > * kurtosis (SPARK-10641) > * approximate median > * approximate quantiles > Univariate statistics for categorical variables: > * mode: https://en.wikipedia.org/wiki/Mode_(statistics) > * number of categories -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs
[ https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10384: -- Description: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * -max- * range * sample variance * population variance * -sample standard deviation- (SPARK-6458) * -population standard deviation- (SPARK-6458) * skewness * kurtosis * approximate median * approximate quantiles Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories was: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * -max- * range * sample variance * population variance * -sample standard deviation- * -population standard deviation- * skewness * kurtosis * approximate median * approximate quantiles Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories > Univariate statistics as UDAFs > -- > > Key: SPARK-10384 > URL: https://issues.apache.org/jira/browse/SPARK-10384 > Project: Spark > Issue Type: Umbrella > Components: ML, SQL >Reporter: Xiangrui Meng >Assignee: Burak Yavuz > > It would be nice to define univariate statistics as UDAFs. This JIRA > discusses general implementation and tracks the process of subtasks. > Univariate statistics include: > continuous: min, max, range, variance, stddev, median, quantiles, skewness, > and kurtosis > categorical: number of categories, mode > If we define them as UDAFs, it would be quite flexible to use them with > DataFrames, e.g., > {code} > df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) > {code} > Note that some univariate statistics depend on others, e.g., variance might > depend on mean and count. It would be nice if SQL can optimize the sequence > to avoid duplicate computation. > Univariate statistics for continuous variables: > * -min- > * -max- > * range > * sample variance > * population variance > * -sample standard deviation- (SPARK-6458) > * -population standard deviation- (SPARK-6458) > * skewness > * kurtosis > * approximate median > * approximate quantiles > Univariate statistics for categorical variables: > * mode: https://en.wikipedia.org/wiki/Mode_(statistics) > * number of categories -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs
[ https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10384: -- Description: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * -max- * range * sample variance * population variance * -sample standard deviation- * -population standard deviation- * skewness * kurtosis * approximate median * approximate quantiles Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories was: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * max * range * sample variance * population variance * sample standard deviation * population standard deviation * skewness * kurtosis * approximate median * approximate quantiles Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories > Univariate statistics as UDAFs > -- > > Key: SPARK-10384 > URL: https://issues.apache.org/jira/browse/SPARK-10384 > Project: Spark > Issue Type: Umbrella > Components: ML, SQL >Reporter: Xiangrui Meng >Assignee: Burak Yavuz > > It would be nice to define univariate statistics as UDAFs. This JIRA > discusses general implementation and tracks the process of subtasks. > Univariate statistics include: > continuous: min, max, range, variance, stddev, median, quantiles, skewness, > and kurtosis > categorical: number of categories, mode > If we define them as UDAFs, it would be quite flexible to use them with > DataFrames, e.g., > {code} > df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) > {code} > Note that some univariate statistics depend on others, e.g., variance might > depend on mean and count. It would be nice if SQL can optimize the sequence > to avoid duplicate computation. > Univariate statistics for continuous variables: > * -min- > * -max- > * range > * sample variance > * population variance > * -sample standard deviation- > * -population standard deviation- > * skewness > * kurtosis > * approximate median > * approximate quantiles > Univariate statistics for categorical variables: > * mode: https://en.wikipedia.org/wiki/Mode_(statistics) > * number of categories -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs
[ https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10384: -- Description: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * max * range * sample variance * population variance * sample standard deviation * population standard deviation * skewness * kurtosis * approximate median * approximate quantiles Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories was: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * ~~min~~ * max * range * sample variance * population variance * sample standard deviation * population standard deviation * skewness * kurtosis * approximate median * approximate quantiles Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories > Univariate statistics as UDAFs > -- > > Key: SPARK-10384 > URL: https://issues.apache.org/jira/browse/SPARK-10384 > Project: Spark > Issue Type: Umbrella > Components: ML, SQL >Reporter: Xiangrui Meng >Assignee: Burak Yavuz > > It would be nice to define univariate statistics as UDAFs. This JIRA > discusses general implementation and tracks the process of subtasks. > Univariate statistics include: > continuous: min, max, range, variance, stddev, median, quantiles, skewness, > and kurtosis > categorical: number of categories, mode > If we define them as UDAFs, it would be quite flexible to use them with > DataFrames, e.g., > {code} > df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) > {code} > Note that some univariate statistics depend on others, e.g., variance might > depend on mean and count. It would be nice if SQL can optimize the sequence > to avoid duplicate computation. > Univariate statistics for continuous variables: > * -min- > * max > * range > * sample variance > * population variance > * sample standard deviation > * population standard deviation > * skewness > * kurtosis > * approximate median > * approximate quantiles > Univariate statistics for categorical variables: > * mode: https://en.wikipedia.org/wiki/Mode_(statistics) > * number of categories -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs
[ https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10384: -- Description: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * ~~min~~ * max * range * sample variance * population variance * sample standard deviation * population standard deviation * skewness * kurtosis * approximate median * approximate quantiles Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories was: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * min * max * range * sample variance * population variance * sample standard deviation * population standard deviation * skewness * kurtosis * approximate median * approximate quantiles Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories > Univariate statistics as UDAFs > -- > > Key: SPARK-10384 > URL: https://issues.apache.org/jira/browse/SPARK-10384 > Project: Spark > Issue Type: Umbrella > Components: ML, SQL >Reporter: Xiangrui Meng >Assignee: Burak Yavuz > > It would be nice to define univariate statistics as UDAFs. This JIRA > discusses general implementation and tracks the process of subtasks. > Univariate statistics include: > continuous: min, max, range, variance, stddev, median, quantiles, skewness, > and kurtosis > categorical: number of categories, mode > If we define them as UDAFs, it would be quite flexible to use them with > DataFrames, e.g., > {code} > df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) > {code} > Note that some univariate statistics depend on others, e.g., variance might > depend on mean and count. It would be nice if SQL can optimize the sequence > to avoid duplicate computation. > Univariate statistics for continuous variables: > * ~~min~~ > * max > * range > * sample variance > * population variance > * sample standard deviation > * population standard deviation > * skewness > * kurtosis > * approximate median > * approximate quantiles > Univariate statistics for categorical variables: > * mode: https://en.wikipedia.org/wiki/Mode_(statistics) > * number of categories -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10868) monotonicallyIncreasingId() supports offset for indexing
[ https://issues.apache.org/jira/browse/SPARK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944137#comment-14944137 ] Martin Senne commented on SPARK-10868: -- I will do (and give my best!) Thx for offering this opportunity > monotonicallyIncreasingId() supports offset for indexing > > > Key: SPARK-10868 > URL: https://issues.apache.org/jira/browse/SPARK-10868 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.0 >Reporter: Martin Senne > > With SPARK-7135 and https://github.com/apache/spark/pull/5709 > `monotonicallyIncreasingID()` allows to create an index column with unique > ids. The indexing always starts at 0 (no offset). > *Feature wish* > Having a parameter `offset`, such that the function can be used as > {{monotonicallyIncreasingID( offset )}} > and indexing _starts at *offset* instead of 0_. > *Use-case* > Add rows to a DataFrame that is already written to a DB (via > _.write.jdbc(...)_). > In detail: > - A DataFrame *A* (containing an ID column) and having indices from 0 to 199 > in that column is existent in DB. > - New rows need to be added to *A*. This included > -- Creating a DataFrame *A'* with new rows, but without id column > -- Add the index column to *A'* - this time starting at *200*, as there are > already entries with id's from 0 to 199 (*here, monotonicallyInreasingID( 200 > ) is required.*) > -- union *A* and *A'* > -- store into DB -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs
[ https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10384: -- Description: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * min * max * range * sample variance * population variance * sample standard deviation * population standard deviation * skewness * kurtosis * approximate median * approximate quantiles Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories was: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. > Univariate statistics as UDAFs > -- > > Key: SPARK-10384 > URL: https://issues.apache.org/jira/browse/SPARK-10384 > Project: Spark > Issue Type: Umbrella > Components: ML, SQL >Reporter: Xiangrui Meng >Assignee: Burak Yavuz > > It would be nice to define univariate statistics as UDAFs. This JIRA > discusses general implementation and tracks the process of subtasks. > Univariate statistics include: > continuous: min, max, range, variance, stddev, median, quantiles, skewness, > and kurtosis > categorical: number of categories, mode > If we define them as UDAFs, it would be quite flexible to use them with > DataFrames, e.g., > {code} > df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) > {code} > Note that some univariate statistics depend on others, e.g., variance might > depend on mean and count. It would be nice if SQL can optimize the sequence > to avoid duplicate computation. > Univariate statistics for continuous variables: > * min > * max > * range > * sample variance > * population variance > * sample standard deviation > * population standard deviation > * skewness > * kurtosis > * approximate median > * approximate quantiles > Univariate statistics for categorical variables: > * mode: https://en.wikipedia.org/wiki/Mode_(statistics) > * number of categories -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9941) Try ML pipeline API on Kaggle competitions
[ https://issues.apache.org/jira/browse/SPARK-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944128#comment-14944128 ] Xiangrui Meng commented on SPARK-9941: -- Created https://issues.apache.org/jira/browse/SPARK-10935. I think we could start by importing the datasets using spark-csv. > Try ML pipeline API on Kaggle competitions > -- > > Key: SPARK-9941 > URL: https://issues.apache.org/jira/browse/SPARK-9941 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > This is an umbrella JIRA to track some fun tasks :) > We have built many features under the ML pipeline API, and we want to see how > it works on real-world datasets, e.g., Kaggle competition datasets > (https://www.kaggle.com/competitions). We want to invite community members to > help test. The goal is NOT to win the competitions but to provide code > examples and to find out missing features and other issues to help shape the > roadmap. > For people who are interested, please do the following: > 1. Create a subtask (or leave a comment if you cannot create a subtask) to > claim a Kaggle dataset. > 2. Use the ML pipeline API to build and tune an ML pipeline that works for > the Kaggle dataset. > 3. Paste the code to gist (https://gist.github.com/) and provide the link > here. > 4. Report missing features, issues, running times, and accuracy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org