[jira] [Updated] (SPARK-10946) JDBC - Use Statement.executeUpdate instead of PreparedStatement.executeUpdate for DDLs

2015-10-05 Thread Pallavi Priyadarshini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pallavi Priyadarshini updated SPARK-10946:
--
Summary: JDBC - Use Statement.executeUpdate instead of 
PreparedStatement.executeUpdate for DDLs  (was: JDBC - Use Statement.execute 
instead of PreparedStatement.execute for DDLs)

> JDBC - Use Statement.executeUpdate instead of PreparedStatement.executeUpdate 
> for DDLs
> --
>
> Key: SPARK-10946
> URL: https://issues.apache.org/jira/browse/SPARK-10946
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.1
>Reporter: Pallavi Priyadarshini
>Priority: Minor
>
> Certain DataFrame APIs invoke DDLs such as CREATE TABLE and DROP TABLE under 
> the covers. Current code in DataFrameWriter and JDBCUtils uses 
> PreparedStatement.executeUpdate to issue the DDLs to the DBs. This causes the 
> DDLs to fail against couple of databases that do not support prepares of DDLs.
> Can we use Statement.executeUpdate instead of 
> PreparedStatement.executeUpdate? DDL is not a repetitive activity, so there 
> shouldn't be a performance impact.
> I can submit a PULL request if no one has objections.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10946) JDBC - Use Statement.execute instead of PreparedStatement.execute for DDLs

2015-10-05 Thread Pallavi Priyadarshini (JIRA)
Pallavi Priyadarshini created SPARK-10946:
-

 Summary: JDBC - Use Statement.execute instead of 
PreparedStatement.execute for DDLs
 Key: SPARK-10946
 URL: https://issues.apache.org/jira/browse/SPARK-10946
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1, 1.4.1, 1.4.0
Reporter: Pallavi Priyadarshini
Priority: Minor


Certain DataFrame APIs invoke DDLs such as CREATE TABLE and DROP TABLE under 
the covers. Current code in DataFrameWriter and JDBCUtils uses 
PreparedStatement.executeUpdate to issue the DDLs to the DBs. This causes the 
DDLs to fail against couple of databases that do not support prepares of DDLs.

Can we use Statement.executeUpdate instead of PreparedStatement.executeUpdate? 
DDL is not a repetitive activity, so there shouldn't be a performance impact.

I can submit a PULL request if no one has objections.

Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10942) Not all cached RDDs are unpersisted

2015-10-05 Thread Rekha Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944569#comment-14944569
 ] 

Rekha Joshi commented on SPARK-10942:
-

great, thanks [~pnpritchard].In anycase will keep an eye open if i see this 
happening under specific conditions.thanks!

> Not all cached RDDs are unpersisted
> ---
>
> Key: SPARK-10942
> URL: https://issues.apache.org/jira/browse/SPARK-10942
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Nick Pritchard
>Priority: Minor
> Attachments: SPARK-10942_1.png, SPARK-10942_2.png, SPARK-10942_3.png
>
>
> I have a Spark Streaming application that caches RDDs inside of a 
> {{transform}} closure. Looking at the Spark UI, it seems that most of these 
> RDDs are unpersisted after the batch completes, but not all.
> I have copied a minimal reproducible example below to highlight the problem. 
> I run this and monitor the Spark UI "Storage" tab. The example generates and 
> caches 30 RDDs, and I see most get cleaned up. However in the end, some still 
> remain cached. There is some randomness going on because I see different RDDs 
> remain cached for each run.
> I have marked this as Major because I haven't been able to workaround it and 
> it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} 
> but that did not change anything.
> {code}
> val inputRDDs = mutable.Queue.tabulate(30) { i =>
>   sc.parallelize(Seq(i))
> }
> val input: DStream[Int] = ssc.queueStream(inputRDDs)
> val output = input.transform { rdd =>
>   if (rdd.isEmpty()) {
> rdd
>   } else {
> val rdd2 = rdd.map(identity)
> rdd2.setName(rdd.first().toString)
> rdd2.cache()
> val rdd3 = rdd2.map(identity)
> rdd3
>   }
> }
> output.print()
> ssc.start()
> ssc.awaitTermination()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10945) GraphX computes Pagerank with NaN (with some datasets)

2015-10-05 Thread Khaled Ammar (JIRA)
Khaled Ammar created SPARK-10945:


 Summary: GraphX computes Pagerank with NaN (with some datasets)
 Key: SPARK-10945
 URL: https://issues.apache.org/jira/browse/SPARK-10945
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.3.0
 Environment: Linux
Reporter: Khaled Ammar


Hi,

I run GraphX in a medium size standalone Spark 1.3.0 installation. The pagerank 
typically works fine, except with one dataset (Twitter: 
http://law.di.unimi.it/webdata/twitter-2010). This is a public dataset that is 
commonly used in research papers.

I found that many vertices have an NaN values. This is true, even if the 
algorithm run for 1 iteration only.  

Thanks,
-Khaled



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10944) org/slf4j/Logger is not provided in spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar

2015-10-05 Thread Pranas Baliuka (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944539#comment-14944539
 ] 

Pranas Baliuka edited comment on SPARK-10944 at 10/6/15 5:42 AM:
-

If one wants to deploy Spark without Hadoop it should be possible. Currently 
even path names and jar names conflicts each other:

{quote}
spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
{quote}

Long term solution: remove Hadoop mentioning in the paths and jar names.


was (Author: pranas):
If one wants to deploy Spark without Hadoop it should be possible. Currently 
even path names and jar names conflicts each other:

{qoute}
spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
{qoute}

> org/slf4j/Logger is not provided in 
> spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
> ---
>
> Key: SPARK-10944
> URL: https://issues.apache.org/jira/browse/SPARK-10944
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.5.1
> Environment: Mac OS/Java 8/Spark 1.5.1 without hadoop
>Reporter: Pranas Baliuka
>Priority: Blocker
>  Labels: easyfix, patch
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Attempt to run Spark cluster on Mac OS machine fails
> Invocation:
> {code}
> # cd $SPARK_HOME
> Imin:spark-1.5.1-bin-without-hadoop pranas$ ./sbin/start-master.sh
> {code}
> Output:
> {code}
> starting org.apache.spark.deploy.master.Master, logging to 
> /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out
> failed to launch org.apache.spark.deploy.master.Master:
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 7 more
> full log in 
> /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out
> {code}
> Log:
> {code}
> # Options read when launching programs locally with
> # ./bin/run-example or ./bin/spark-submit
> Spark Command: 
> /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java -cp 
> /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../conf/:/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
>  -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip Imin.local --port 
> 7077 --webui-port 8080
> 
> Error: A JNI error has occurred, please check your installation and try again
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
> at java.lang.Class.getDeclaredMethods0(Native Method)
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
> at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
> at java.lang.Class.getMethod0(Class.java:3018)
> at java.lang.Class.getMethod(Class.java:1784)
> at 
> sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
> at 
> sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
> Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> {code}
> Proposed short term fix:
> Bundle all required 3rd party libs to the uberjar and/or fix  start-up script 
> to include required 3rd party libs.
> Long term quality improvement proposal: Introduce integration tests to check 
> distribution before releasing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10944) org/slf4j/Logger is not provided in spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar

2015-10-05 Thread Pranas Baliuka (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944539#comment-14944539
 ] 

Pranas Baliuka edited comment on SPARK-10944 at 10/6/15 5:42 AM:
-

If one wants to deploy Spark without Hadoop it should be possible. Currently 
even path names and jar names contradicts each other:

{quote}
spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
{quote}

Long term solution: remove Hadoop mentioning in the paths and jar names.


was (Author: pranas):
If one wants to deploy Spark without Hadoop it should be possible. Currently 
even path names and jar names conflicts each other:

{quote}
spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
{quote}

Long term solution: remove Hadoop mentioning in the paths and jar names.

> org/slf4j/Logger is not provided in 
> spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
> ---
>
> Key: SPARK-10944
> URL: https://issues.apache.org/jira/browse/SPARK-10944
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.5.1
> Environment: Mac OS/Java 8/Spark 1.5.1 without hadoop
>Reporter: Pranas Baliuka
>Priority: Blocker
>  Labels: easyfix, patch
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Attempt to run Spark cluster on Mac OS machine fails
> Invocation:
> {code}
> # cd $SPARK_HOME
> Imin:spark-1.5.1-bin-without-hadoop pranas$ ./sbin/start-master.sh
> {code}
> Output:
> {code}
> starting org.apache.spark.deploy.master.Master, logging to 
> /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out
> failed to launch org.apache.spark.deploy.master.Master:
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 7 more
> full log in 
> /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out
> {code}
> Log:
> {code}
> # Options read when launching programs locally with
> # ./bin/run-example or ./bin/spark-submit
> Spark Command: 
> /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java -cp 
> /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../conf/:/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
>  -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip Imin.local --port 
> 7077 --webui-port 8080
> 
> Error: A JNI error has occurred, please check your installation and try again
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
> at java.lang.Class.getDeclaredMethods0(Native Method)
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
> at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
> at java.lang.Class.getMethod0(Class.java:3018)
> at java.lang.Class.getMethod(Class.java:1784)
> at 
> sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
> at 
> sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
> Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> {code}
> Proposed short term fix:
> Bundle all required 3rd party libs to the uberjar and/or fix  start-up script 
> to include required 3rd party libs.
> Long term quality improvement proposal: Introduce integration tests to check 
> distribution before releasing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10944) org/slf4j/Logger is not provided in spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar

2015-10-05 Thread Pranas Baliuka (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944539#comment-14944539
 ] 

Pranas Baliuka commented on SPARK-10944:


If one wants to deploy Spark without Hadoop it should be possible. Currently 
even path names and jar names conflicts each other:

{qoute}
spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
{qoute}

> org/slf4j/Logger is not provided in 
> spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
> ---
>
> Key: SPARK-10944
> URL: https://issues.apache.org/jira/browse/SPARK-10944
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.5.1
> Environment: Mac OS/Java 8/Spark 1.5.1 without hadoop
>Reporter: Pranas Baliuka
>Priority: Blocker
>  Labels: easyfix, patch
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Attempt to run Spark cluster on Mac OS machine fails
> Invocation:
> {code}
> # cd $SPARK_HOME
> Imin:spark-1.5.1-bin-without-hadoop pranas$ ./sbin/start-master.sh
> {code}
> Output:
> {code}
> starting org.apache.spark.deploy.master.Master, logging to 
> /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out
> failed to launch org.apache.spark.deploy.master.Master:
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 7 more
> full log in 
> /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out
> {code}
> Log:
> {code}
> # Options read when launching programs locally with
> # ./bin/run-example or ./bin/spark-submit
> Spark Command: 
> /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java -cp 
> /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../conf/:/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
>  -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip Imin.local --port 
> 7077 --webui-port 8080
> 
> Error: A JNI error has occurred, please check your installation and try again
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
> at java.lang.Class.getDeclaredMethods0(Native Method)
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
> at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
> at java.lang.Class.getMethod0(Class.java:3018)
> at java.lang.Class.getMethod(Class.java:1784)
> at 
> sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
> at 
> sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
> Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> {code}
> Proposed short term fix:
> Bundle all required 3rd party libs to the uberjar and/or fix  start-up script 
> to include required 3rd party libs.
> Long term quality improvement proposal: Introduce integration tests to check 
> distribution before releasing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10944) org/slf4j/Logger is not provided in spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar

2015-10-05 Thread Pranas Baliuka (JIRA)
Pranas Baliuka created SPARK-10944:
--

 Summary: org/slf4j/Logger is not provided in 
spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
 Key: SPARK-10944
 URL: https://issues.apache.org/jira/browse/SPARK-10944
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.5.1
 Environment: Mac OS/Java 8/Spark 1.5.1 without hadoop
Reporter: Pranas Baliuka
Priority: Blocker


Attempt to run Spark cluster on Mac OS machine fails

Invocation:
{code}
# cd $SPARK_HOME
Imin:spark-1.5.1-bin-without-hadoop pranas$ ./sbin/start-master.sh
{code}

Output:
{code}
starting org.apache.spark.deploy.master.Master, logging to 
/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out
failed to launch org.apache.spark.deploy.master.Master:
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
full log in 
/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out
{code}

Log:
{code}
# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
Spark Command: 
/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java -cp 
/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../conf/:/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
 -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip Imin.local --port 
7077 --webui-port 8080

Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at 
sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
{code}

Proposed short term fix:
Bundle all required 3rd party libs to the uberjar and/or fix  start-up script 
to include required 3rd party libs.

Long term quality improvement proposal: Introduce integration tests to check 
distribution before releasing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10943) NullType Column cannot be written to Parquet

2015-10-05 Thread Jason Pohl (JIRA)
Jason Pohl created SPARK-10943:
--

 Summary: NullType Column cannot be written to Parquet
 Key: SPARK-10943
 URL: https://issues.apache.org/jira/browse/SPARK-10943
 Project: Spark
  Issue Type: Bug
Reporter: Jason Pohl


var data02 = sqlContext.sql("select 1 as id, \"cat in the hat\" as text, null 
as comments")

//FAIL - Try writing a NullType column (where all the values are NULL)
data02.write.parquet("/tmp/celtra-test/dataset2")

at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in 
stage 179.0 (TID 39924, 10.0.196.208): org.apache.spark.sql.AnalysisException: 
Unsupported data type StructField(comments,NullType,true).dataType;
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:524)
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at org.apache.spark.sql.types.StructType.map(StructType.scala:92)
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58)
at 
org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:233)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRel

[jira] [Updated] (SPARK-10942) Not all cached RDDs are unpersisted

2015-10-05 Thread Nick Pritchard (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pritchard updated SPARK-10942:
---
Priority: Minor  (was: Major)

> Not all cached RDDs are unpersisted
> ---
>
> Key: SPARK-10942
> URL: https://issues.apache.org/jira/browse/SPARK-10942
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Nick Pritchard
>Priority: Minor
> Attachments: SPARK-10942_1.png, SPARK-10942_2.png, SPARK-10942_3.png
>
>
> I have a Spark Streaming application that caches RDDs inside of a 
> {{transform}} closure. Looking at the Spark UI, it seems that most of these 
> RDDs are unpersisted after the batch completes, but not all.
> I have copied a minimal reproducible example below to highlight the problem. 
> I run this and monitor the Spark UI "Storage" tab. The example generates and 
> caches 30 RDDs, and I see most get cleaned up. However in the end, some still 
> remain cached. There is some randomness going on because I see different RDDs 
> remain cached for each run.
> I have marked this as Major because I haven't been able to workaround it and 
> it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} 
> but that did not change anything.
> {code}
> val inputRDDs = mutable.Queue.tabulate(30) { i =>
>   sc.parallelize(Seq(i))
> }
> val input: DStream[Int] = ssc.queueStream(inputRDDs)
> val output = input.transform { rdd =>
>   if (rdd.isEmpty()) {
> rdd
>   } else {
> val rdd2 = rdd.map(identity)
> rdd2.setName(rdd.first().toString)
> rdd2.cache()
> val rdd3 = rdd2.map(identity)
> rdd3
>   }
> }
> output.print()
> ssc.start()
> ssc.awaitTermination()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10942) Not all cached RDDs are unpersisted

2015-10-05 Thread Nick Pritchard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944508#comment-14944508
 ] 

Nick Pritchard commented on SPARK-10942:


[~rekhajoshm] Thanks for trying to reproduce it. Since you do not this see the 
same, this is most likely an issue on my end so I'll downgrade the priority. I 
am using 1.5.0 so will try 1.6.0-snapshot and also investigate the logs.

> Not all cached RDDs are unpersisted
> ---
>
> Key: SPARK-10942
> URL: https://issues.apache.org/jira/browse/SPARK-10942
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Nick Pritchard
> Attachments: SPARK-10942_1.png, SPARK-10942_2.png, SPARK-10942_3.png
>
>
> I have a Spark Streaming application that caches RDDs inside of a 
> {{transform}} closure. Looking at the Spark UI, it seems that most of these 
> RDDs are unpersisted after the batch completes, but not all.
> I have copied a minimal reproducible example below to highlight the problem. 
> I run this and monitor the Spark UI "Storage" tab. The example generates and 
> caches 30 RDDs, and I see most get cleaned up. However in the end, some still 
> remain cached. There is some randomness going on because I see different RDDs 
> remain cached for each run.
> I have marked this as Major because I haven't been able to workaround it and 
> it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} 
> but that did not change anything.
> {code}
> val inputRDDs = mutable.Queue.tabulate(30) { i =>
>   sc.parallelize(Seq(i))
> }
> val input: DStream[Int] = ssc.queueStream(inputRDDs)
> val output = input.transform { rdd =>
>   if (rdd.isEmpty()) {
> rdd
>   } else {
> val rdd2 = rdd.map(identity)
> rdd2.setName(rdd.first().toString)
> rdd2.cache()
> val rdd3 = rdd2.map(identity)
> rdd3
>   }
> }
> output.print()
> ssc.start()
> ssc.awaitTermination()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10942) Not all cached RDDs are unpersisted

2015-10-05 Thread Rekha Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rekha Joshi updated SPARK-10942:

Attachment: SPARK-10942_3.png
SPARK-10942_2.png
SPARK-10942_1.png

SPARK-10942: TestStreaming job ran for checking cache and storage scenario.So 
far for my runs, the storage gets cleared out.

> Not all cached RDDs are unpersisted
> ---
>
> Key: SPARK-10942
> URL: https://issues.apache.org/jira/browse/SPARK-10942
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Nick Pritchard
> Attachments: SPARK-10942_1.png, SPARK-10942_2.png, SPARK-10942_3.png
>
>
> I have a Spark Streaming application that caches RDDs inside of a 
> {{transform}} closure. Looking at the Spark UI, it seems that most of these 
> RDDs are unpersisted after the batch completes, but not all.
> I have copied a minimal reproducible example below to highlight the problem. 
> I run this and monitor the Spark UI "Storage" tab. The example generates and 
> caches 30 RDDs, and I see most get cleaned up. However in the end, some still 
> remain cached. There is some randomness going on because I see different RDDs 
> remain cached for each run.
> I have marked this as Major because I haven't been able to workaround it and 
> it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} 
> but that did not change anything.
> {code}
> val inputRDDs = mutable.Queue.tabulate(30) { i =>
>   sc.parallelize(Seq(i))
> }
> val input: DStream[Int] = ssc.queueStream(inputRDDs)
> val output = input.transform { rdd =>
>   if (rdd.isEmpty()) {
> rdd
>   } else {
> val rdd2 = rdd.map(identity)
> rdd2.setName(rdd.first().toString)
> rdd2.cache()
> val rdd3 = rdd2.map(identity)
> rdd3
>   }
> }
> output.print()
> ssc.start()
> ssc.awaitTermination()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10942) Not all cached RDDs are unpersisted

2015-10-05 Thread Rekha Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944500#comment-14944500
 ] 

Rekha Joshi edited comment on SPARK-10942 at 10/6/15 4:50 AM:
--

SPARK-10942: Attached job run screenshots for TestStreaming job ran for 
checking cache and storage scenario.So far for my runs, the storage gets 
cleared out.


was (Author: rekhajoshm):
SPARK-10942: TestStreaming job ran for checking cache and storage scenario.So 
far for my runs, the storage gets cleared out.

> Not all cached RDDs are unpersisted
> ---
>
> Key: SPARK-10942
> URL: https://issues.apache.org/jira/browse/SPARK-10942
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Nick Pritchard
> Attachments: SPARK-10942_1.png, SPARK-10942_2.png, SPARK-10942_3.png
>
>
> I have a Spark Streaming application that caches RDDs inside of a 
> {{transform}} closure. Looking at the Spark UI, it seems that most of these 
> RDDs are unpersisted after the batch completes, but not all.
> I have copied a minimal reproducible example below to highlight the problem. 
> I run this and monitor the Spark UI "Storage" tab. The example generates and 
> caches 30 RDDs, and I see most get cleaned up. However in the end, some still 
> remain cached. There is some randomness going on because I see different RDDs 
> remain cached for each run.
> I have marked this as Major because I haven't been able to workaround it and 
> it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} 
> but that did not change anything.
> {code}
> val inputRDDs = mutable.Queue.tabulate(30) { i =>
>   sc.parallelize(Seq(i))
> }
> val input: DStream[Int] = ssc.queueStream(inputRDDs)
> val output = input.transform { rdd =>
>   if (rdd.isEmpty()) {
> rdd
>   } else {
> val rdd2 = rdd.map(identity)
> rdd2.setName(rdd.first().toString)
> rdd2.cache()
> val rdd3 = rdd2.map(identity)
> rdd3
>   }
> }
> output.print()
> ssc.start()
> ssc.awaitTermination()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10942) Not all cached RDDs are unpersisted

2015-10-05 Thread Rekha Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944497#comment-14944497
 ] 

Rekha Joshi commented on SPARK-10942:
-

Thanks [~pnpritchard] I tried to replicate the issue few times now. So far I 
see Storage tab getting cleaned out.I do not even specify ttl.
Attached job run screenshots.I am on 1.6.0-snapshot, and do not currently have 
any other load on the system but first level diagnosis seems automatic 
unpersist does happen.I do see below logs also stating persistence list is 
getting updated in background, and storage cleared. [~sowen] [~vanzin] Your 
thoughts? Thanks
{panel}
15/10/05 21:42:24 INFO rdd.ParallelCollectionRDD: Removing RDD 30 from 
persistence list
15/10/05 21:42:24 INFO rdd.ParallelCollectionRDD: Removing RDD 30 from 
persistence list
15/10/05 21:42:24 INFO storage.BlockManager: Removing RDD 30
15/10/05 21:42:24 INFO storage.BlockManager: Removing RDD 30
15/10/05 21:42:24 INFO scheduler.ReceivedBlockTracker: Deleting batches 
ArrayBuffer()
15/10/05 21:42:24 INFO scheduler.InputInfoTracker: remove old batch metadata: 
{panel}


> Not all cached RDDs are unpersisted
> ---
>
> Key: SPARK-10942
> URL: https://issues.apache.org/jira/browse/SPARK-10942
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Nick Pritchard
>
> I have a Spark Streaming application that caches RDDs inside of a 
> {{transform}} closure. Looking at the Spark UI, it seems that most of these 
> RDDs are unpersisted after the batch completes, but not all.
> I have copied a minimal reproducible example below to highlight the problem. 
> I run this and monitor the Spark UI "Storage" tab. The example generates and 
> caches 30 RDDs, and I see most get cleaned up. However in the end, some still 
> remain cached. There is some randomness going on because I see different RDDs 
> remain cached for each run.
> I have marked this as Major because I haven't been able to workaround it and 
> it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} 
> but that did not change anything.
> {code}
> val inputRDDs = mutable.Queue.tabulate(30) { i =>
>   sc.parallelize(Seq(i))
> }
> val input: DStream[Int] = ssc.queueStream(inputRDDs)
> val output = input.transform { rdd =>
>   if (rdd.isEmpty()) {
> rdd
>   } else {
> val rdd2 = rdd.map(identity)
> rdd2.setName(rdd.first().toString)
> rdd2.cache()
> val rdd3 = rdd2.map(identity)
> rdd3
>   }
> }
> output.print()
> ssc.start()
> ssc.awaitTermination()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10942) Not all cached RDDs are unpersisted

2015-10-05 Thread Rekha Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944497#comment-14944497
 ] 

Rekha Joshi edited comment on SPARK-10942 at 10/6/15 4:48 AM:
--

Thanks [~pnpritchard] I tried to replicate the issue few times now. So far I 
see Storage tab getting cleaned out.I do not even specify ttl.Attached job run 
screenshots.I am on 1.6.0-snapshot, and do not currently have any other load on 
the system but first level diagnosis seems automatic unpersist does happen.I do 
see below logs also stating persistence list is getting updated in background, 
and storage cleared. [~sowen] [~vanzin] Your thoughts? Thanks
{panel}
15/10/05 21:42:24 INFO rdd.ParallelCollectionRDD: Removing RDD 30 from 
persistence list
15/10/05 21:42:24 INFO rdd.ParallelCollectionRDD: Removing RDD 30 from 
persistence list
15/10/05 21:42:24 INFO storage.BlockManager: Removing RDD 30
15/10/05 21:42:24 INFO storage.BlockManager: Removing RDD 30
15/10/05 21:42:24 INFO scheduler.ReceivedBlockTracker: Deleting batches 
ArrayBuffer()
15/10/05 21:42:24 INFO scheduler.InputInfoTracker: remove old batch metadata: 
{panel}



was (Author: rekhajoshm):
Thanks [~pnpritchard] I tried to replicate the issue few times now. So far I 
see Storage tab getting cleaned out.I do not even specify ttl.
Attached job run screenshots.I am on 1.6.0-snapshot, and do not currently have 
any other load on the system but first level diagnosis seems automatic 
unpersist does happen.I do see below logs also stating persistence list is 
getting updated in background, and storage cleared. [~sowen] [~vanzin] Your 
thoughts? Thanks
{panel}
15/10/05 21:42:24 INFO rdd.ParallelCollectionRDD: Removing RDD 30 from 
persistence list
15/10/05 21:42:24 INFO rdd.ParallelCollectionRDD: Removing RDD 30 from 
persistence list
15/10/05 21:42:24 INFO storage.BlockManager: Removing RDD 30
15/10/05 21:42:24 INFO storage.BlockManager: Removing RDD 30
15/10/05 21:42:24 INFO scheduler.ReceivedBlockTracker: Deleting batches 
ArrayBuffer()
15/10/05 21:42:24 INFO scheduler.InputInfoTracker: remove old batch metadata: 
{panel}


> Not all cached RDDs are unpersisted
> ---
>
> Key: SPARK-10942
> URL: https://issues.apache.org/jira/browse/SPARK-10942
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Nick Pritchard
>
> I have a Spark Streaming application that caches RDDs inside of a 
> {{transform}} closure. Looking at the Spark UI, it seems that most of these 
> RDDs are unpersisted after the batch completes, but not all.
> I have copied a minimal reproducible example below to highlight the problem. 
> I run this and monitor the Spark UI "Storage" tab. The example generates and 
> caches 30 RDDs, and I see most get cleaned up. However in the end, some still 
> remain cached. There is some randomness going on because I see different RDDs 
> remain cached for each run.
> I have marked this as Major because I haven't been able to workaround it and 
> it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} 
> but that did not change anything.
> {code}
> val inputRDDs = mutable.Queue.tabulate(30) { i =>
>   sc.parallelize(Seq(i))
> }
> val input: DStream[Int] = ssc.queueStream(inputRDDs)
> val output = input.transform { rdd =>
>   if (rdd.isEmpty()) {
> rdd
>   } else {
> val rdd2 = rdd.map(identity)
> rdd2.setName(rdd.first().toString)
> rdd2.cache()
> val rdd3 = rdd2.map(identity)
> rdd3
>   }
> }
> output.print()
> ssc.start()
> ssc.awaitTermination()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10641) skewness and kurtosis support

2015-10-05 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944485#comment-14944485
 ] 

Seth Hendrickson commented on SPARK-10641:
--

My apologies, I haven't been able to devote much time to this lately. To your 
point, one of the bigger decisions for this PR we'll be how to combine these 
functions with other aggregates, since online algorithms for higher order 
statistical moments require the calculation of all the lower order moments. I 
can have a WIP PR up by tomorrow, so we can get some discussion going. This PR 
will also be affected by several other ongoing PRs.

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Seth Hendrickson
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10382) Make example code in user guide testable

2015-10-05 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944480#comment-14944480
 ] 

Xusen Yin commented on SPARK-10382:
---

[~mengxr] I'd love to work on this if no one else keep on doing it.

> Make example code in user guide testable
> 
>
> Key: SPARK-10382
> URL: https://issues.apache.org/jira/browse/SPARK-10382
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "guide" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Just one way to implement this. It would be nice to hear more ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10942) Not all cached RDDs are unpersisted

2015-10-05 Thread Nick Pritchard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944478#comment-14944478
 ] 

Nick Pritchard commented on SPARK-10942:


Regardless, the documentation for {{spark.streaming.unpersist}} and 
{{spark.cleaner.ttl}} suggest that unpersisting will be handled automatically, 
by spark code.

> Not all cached RDDs are unpersisted
> ---
>
> Key: SPARK-10942
> URL: https://issues.apache.org/jira/browse/SPARK-10942
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Nick Pritchard
>
> I have a Spark Streaming application that caches RDDs inside of a 
> {{transform}} closure. Looking at the Spark UI, it seems that most of these 
> RDDs are unpersisted after the batch completes, but not all.
> I have copied a minimal reproducible example below to highlight the problem. 
> I run this and monitor the Spark UI "Storage" tab. The example generates and 
> caches 30 RDDs, and I see most get cleaned up. However in the end, some still 
> remain cached. There is some randomness going on because I see different RDDs 
> remain cached for each run.
> I have marked this as Major because I haven't been able to workaround it and 
> it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} 
> but that did not change anything.
> {code}
> val inputRDDs = mutable.Queue.tabulate(30) { i =>
>   sc.parallelize(Seq(i))
> }
> val input: DStream[Int] = ssc.queueStream(inputRDDs)
> val output = input.transform { rdd =>
>   if (rdd.isEmpty()) {
> rdd
>   } else {
> val rdd2 = rdd.map(identity)
> rdd2.setName(rdd.first().toString)
> rdd2.cache()
> val rdd3 = rdd2.map(identity)
> rdd3
>   }
> }
> output.print()
> ssc.start()
> ssc.awaitTermination()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10942) Not all cached RDDs are unpersisted

2015-10-05 Thread Nick Pritchard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944477#comment-14944477
 ] 

Nick Pritchard commented on SPARK-10942:


[~rekhajoshm] Yes, but calling {{rdd2.unpersist()}} negates the call to 
{{rdd2.cache()}}, no matter where I put it in the {{transform}} closure. This 
is because all the operations on {{rdd2}} are lazy.

> Not all cached RDDs are unpersisted
> ---
>
> Key: SPARK-10942
> URL: https://issues.apache.org/jira/browse/SPARK-10942
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Nick Pritchard
>
> I have a Spark Streaming application that caches RDDs inside of a 
> {{transform}} closure. Looking at the Spark UI, it seems that most of these 
> RDDs are unpersisted after the batch completes, but not all.
> I have copied a minimal reproducible example below to highlight the problem. 
> I run this and monitor the Spark UI "Storage" tab. The example generates and 
> caches 30 RDDs, and I see most get cleaned up. However in the end, some still 
> remain cached. There is some randomness going on because I see different RDDs 
> remain cached for each run.
> I have marked this as Major because I haven't been able to workaround it and 
> it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} 
> but that did not change anything.
> {code}
> val inputRDDs = mutable.Queue.tabulate(30) { i =>
>   sc.parallelize(Seq(i))
> }
> val input: DStream[Int] = ssc.queueStream(inputRDDs)
> val output = input.transform { rdd =>
>   if (rdd.isEmpty()) {
> rdd
>   } else {
> val rdd2 = rdd.map(identity)
> rdd2.setName(rdd.first().toString)
> rdd2.cache()
> val rdd3 = rdd2.map(identity)
> rdd3
>   }
> }
> output.print()
> ssc.start()
> ssc.awaitTermination()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10942) Not all cached RDDs are unpersisted

2015-10-05 Thread Rekha Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944465#comment-14944465
 ] 

Rekha Joshi edited comment on SPARK-10942 at 10/6/15 4:04 AM:
--

[~pnpritchard] hi.did you try rdd2.unpersist()?


was (Author: rekhajoshm):
[~pnpritchard] hi.did you try rdd.unpersist()?

> Not all cached RDDs are unpersisted
> ---
>
> Key: SPARK-10942
> URL: https://issues.apache.org/jira/browse/SPARK-10942
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Nick Pritchard
>
> I have a Spark Streaming application that caches RDDs inside of a 
> {{transform}} closure. Looking at the Spark UI, it seems that most of these 
> RDDs are unpersisted after the batch completes, but not all.
> I have copied a minimal reproducible example below to highlight the problem. 
> I run this and monitor the Spark UI "Storage" tab. The example generates and 
> caches 30 RDDs, and I see most get cleaned up. However in the end, some still 
> remain cached. There is some randomness going on because I see different RDDs 
> remain cached for each run.
> I have marked this as Major because I haven't been able to workaround it and 
> it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} 
> but that did not change anything.
> {code}
> val inputRDDs = mutable.Queue.tabulate(30) { i =>
>   sc.parallelize(Seq(i))
> }
> val input: DStream[Int] = ssc.queueStream(inputRDDs)
> val output = input.transform { rdd =>
>   if (rdd.isEmpty()) {
> rdd
>   } else {
> val rdd2 = rdd.map(identity)
> rdd2.setName(rdd.first().toString)
> rdd2.cache()
> val rdd3 = rdd2.map(identity)
> rdd3
>   }
> }
> output.print()
> ssc.start()
> ssc.awaitTermination()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10942) Not all cached RDDs are unpersisted

2015-10-05 Thread Rekha Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944465#comment-14944465
 ] 

Rekha Joshi commented on SPARK-10942:
-

[~pnpritchard] hi.did you try rdd.unpersist()?

> Not all cached RDDs are unpersisted
> ---
>
> Key: SPARK-10942
> URL: https://issues.apache.org/jira/browse/SPARK-10942
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Nick Pritchard
>
> I have a Spark Streaming application that caches RDDs inside of a 
> {{transform}} closure. Looking at the Spark UI, it seems that most of these 
> RDDs are unpersisted after the batch completes, but not all.
> I have copied a minimal reproducible example below to highlight the problem. 
> I run this and monitor the Spark UI "Storage" tab. The example generates and 
> caches 30 RDDs, and I see most get cleaned up. However in the end, some still 
> remain cached. There is some randomness going on because I see different RDDs 
> remain cached for each run.
> I have marked this as Major because I haven't been able to workaround it and 
> it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} 
> but that did not change anything.
> {code}
> val inputRDDs = mutable.Queue.tabulate(30) { i =>
>   sc.parallelize(Seq(i))
> }
> val input: DStream[Int] = ssc.queueStream(inputRDDs)
> val output = input.transform { rdd =>
>   if (rdd.isEmpty()) {
> rdd
>   } else {
> val rdd2 = rdd.map(identity)
> rdd2.setName(rdd.first().toString)
> rdd2.cache()
> val rdd3 = rdd2.map(identity)
> rdd3
>   }
> }
> output.print()
> ssc.start()
> ssc.awaitTermination()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10534) ORDER BY clause allows only columns that are present in SELECT statement

2015-10-05 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944462#comment-14944462
 ] 

Dilip Biswal commented on SPARK-10534:
--

I would like to work on this.

> ORDER BY clause allows only columns that are present in SELECT statement
> 
>
> Key: SPARK-10534
> URL: https://issues.apache.org/jira/browse/SPARK-10534
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Michal Cwienczek
>
> When invoking query SELECT EmployeeID from Employees order by YEAR(HireDate) 
> Spark 1.5 throws exception:
> {code}
> cannot resolve 'MsSqlNorthwindJobServerTested_dbo_Employees.HireDate' given 
> input columns EmployeeID; line 2 pos 14 StackTrace: 
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'MsSqlNorthwindJobServerTested_dbo_Employees.HireDate' given input columns 
> EmployeeID; line 2 pos 14
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$7.apply(TreeNode.scala:268)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:266)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$c

[jira] [Created] (SPARK-10942) Not all cached RDDs are unpersisted

2015-10-05 Thread Nick Pritchard (JIRA)
Nick Pritchard created SPARK-10942:
--

 Summary: Not all cached RDDs are unpersisted
 Key: SPARK-10942
 URL: https://issues.apache.org/jira/browse/SPARK-10942
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Nick Pritchard


I have a Spark Streaming application that caches RDDs inside of a {{transform}} 
closure. Looking at the Spark UI, it seems that most of these RDDs are 
unpersisted after the batch completes, but not all.

I have copied a minimal reproducible example below to highlight the problem. I 
run this and monitor the Spark UI "Storage" tab. The example generates and 
caches 30 RDDs, and I see most get cleaned up. However in the end, some still 
remain cached. There is some randomness going on because I see different RDDs 
remain cached for each run.

I have marked this as Major because I haven't been able to workaround it and it 
is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} but 
that did not change anything.

{code}
val inputRDDs = mutable.Queue.tabulate(30) { i =>
  sc.parallelize(Seq(i))
}
val input: DStream[Int] = ssc.queueStream(inputRDDs)

val output = input.transform { rdd =>
  if (rdd.isEmpty()) {
rdd
  } else {
val rdd2 = rdd.map(identity)
rdd2.setName(rdd.first().toString)
rdd2.cache()
val rdd3 = rdd2.map(identity)
rdd3
  }
}
output.print()

ssc.start()
ssc.awaitTermination()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6723) Model import/export for ChiSqSelector

2015-10-05 Thread Jayant Shekhar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944415#comment-14944415
 ] 

Jayant Shekhar commented on SPARK-6723:
---

[~fliang] Hey Feyman, I made changes to the PR based on your inputs and fixed 
the merge conflicts. You can try it out.
Thanks.


> Model import/export for ChiSqSelector
> -
>
> Key: SPARK-6723
> URL: https://issues.apache.org/jira/browse/SPARK-6723
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6723) Model import/export for ChiSqSelector

2015-10-05 Thread Jayant Shekhar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944415#comment-14944415
 ] 

Jayant Shekhar edited comment on SPARK-6723 at 10/6/15 2:25 AM:


[~fliang] Hey Feynman, I made changes to the PR based on your inputs and fixed 
the merge conflicts. You can try it out.
Thanks.



was (Author: jayants):
[~fliang] Hey Feyman, I made changes to the PR based on your inputs and fixed 
the merge conflicts. You can try it out.
Thanks.


> Model import/export for ChiSqSelector
> -
>
> Key: SPARK-6723
> URL: https://issues.apache.org/jira/browse/SPARK-6723
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10900) Add output operation events to StreamingListener

2015-10-05 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-10900.
---
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 1.6.0

> Add output operation events to StreamingListener
> 
>
> Key: SPARK-10900
> URL: https://issues.apache.org/jira/browse/SPARK-10900
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10941) .Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve code clarity

2015-10-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944411#comment-14944411
 ] 

Apache Spark commented on SPARK-10941:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8973

> .Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve 
> code clarity
> --
>
> Key: SPARK-10941
> URL: https://issues.apache.org/jira/browse/SPARK-10941
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Spark SQL's new AlgebraicAggregate interface is confusingly named.
> AlgebraicAggregate inherits from AggregateFunction2, adds a new set of 
> methods, then effectively bans the use of the inherited methods. This is 
> really confusing. I think that it's an anti-pattern / bad code smell if you 
> end up inheriting and wanting to remove methods inherited from the superclass.
> I think that we should re-name this class and should refactor the class 
> hierarchy so that there's a clear distinction between which parts of the code 
> work with imperative aggregate functions vs. expression-based aggregates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10941) .Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve code clarity

2015-10-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10941:


Assignee: Apache Spark  (was: Josh Rosen)

> .Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve 
> code clarity
> --
>
> Key: SPARK-10941
> URL: https://issues.apache.org/jira/browse/SPARK-10941
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Spark SQL's new AlgebraicAggregate interface is confusingly named.
> AlgebraicAggregate inherits from AggregateFunction2, adds a new set of 
> methods, then effectively bans the use of the inherited methods. This is 
> really confusing. I think that it's an anti-pattern / bad code smell if you 
> end up inheriting and wanting to remove methods inherited from the superclass.
> I think that we should re-name this class and should refactor the class 
> hierarchy so that there's a clear distinction between which parts of the code 
> work with imperative aggregate functions vs. expression-based aggregates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10941) .Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve code clarity

2015-10-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10941:


Assignee: Josh Rosen  (was: Apache Spark)

> .Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve 
> code clarity
> --
>
> Key: SPARK-10941
> URL: https://issues.apache.org/jira/browse/SPARK-10941
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Spark SQL's new AlgebraicAggregate interface is confusingly named.
> AlgebraicAggregate inherits from AggregateFunction2, adds a new set of 
> methods, then effectively bans the use of the inherited methods. This is 
> really confusing. I think that it's an anti-pattern / bad code smell if you 
> end up inheriting and wanting to remove methods inherited from the superclass.
> I think that we should re-name this class and should refactor the class 
> hierarchy so that there's a clear distinction between which parts of the code 
> work with imperative aggregate functions vs. expression-based aggregates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10941) .Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve code clarity

2015-10-05 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-10941:
--

 Summary: .Refactor AggregateFunction2 and AlgebraicAggregate 
interfaces to improve code clarity
 Key: SPARK-10941
 URL: https://issues.apache.org/jira/browse/SPARK-10941
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen


Spark SQL's new AlgebraicAggregate interface is confusingly named.

AlgebraicAggregate inherits from AggregateFunction2, adds a new set of methods, 
then effectively bans the use of the inherited methods. This is really 
confusing. I think that it's an anti-pattern / bad code smell if you end up 
inheriting and wanting to remove methods inherited from the superclass.

I think that we should re-name this class and should refactor the class 
hierarchy so that there's a clear distinction between which parts of the code 
work with imperative aggregate functions vs. expression-based aggregates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9239) HiveUDAF support for AggregateFunction2

2015-10-05 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944405#comment-14944405
 ] 

Josh Rosen commented on SPARK-9239:
---

Is this duplicated by SPARK-10765?

> HiveUDAF support for AggregateFunction2
> ---
>
> Key: SPARK-9239
> URL: https://issues.apache.org/jira/browse/SPARK-9239
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> We need to build a wrapper for Hive UDAFs on top of AggregateFunction2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10560) Make StreamingLogisticRegressionWithSGD Python API equals with Scala one

2015-10-05 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944305#comment-14944305
 ] 

Bryan Cutler edited comment on SPARK-10560 at 10/6/15 1:17 AM:
---

Hi [~yanboliang], I just want to make sure I'm on the same page as to what we 
need to do here.  
Here are the differences I see between the Python and Scala APIs for 
StreamingLogisticRegressionWithSGD:
-
* The documentation for Python is missing the default parameter values, also 
the same for StreamingLinearRegressionWithSGD

* In Python StreamingLogisticRegressionWithSGD the regularization defaults to 
0.01 while the Scala version defaults to 0.  
I believe other SGD implementations default to non-zero, so maybe there is some 
reason to turn it off in Streaming implementations?
In any case, these ones should probably default to the same value

* The Scala StreamingLogisticRegressionWithSGD is missing a method to set 
convergence tolerance, it is in the Python one

* StreamingLogisticRegressionWithSGD for Scala and Python are missing ability 
to set regularization parameter

* Python Streaming**RegressionWithSGD are missing API methods to set 
parameters, i.e. setStepSize

-
How about for this JIRA, I fix the documentation to include default parameters 
and then I will make JIRAs for the other items?



was (Author: bryanc):
Hi [~yanboliang]], I just want to make sure I'm on the same page as to what we 
need to do here.  
Here are the differences I see between the Python and Scala APIs for 
StreamingLogisticRegressionWithSGD:
-
* The documentation for Python is missing the default parameter values, also 
the same for StreamingLinearRegressionWithSGD

* In Python StreamingLogisticRegressionWithSGD the regularization defaults to 
0.01 while the Scala version defaults to 0.  
I believe other SGD implementations default to non-zero, so maybe there is some 
reason to turn it off in Streaming implementations?
In any case, these ones should probably default to the same value

* The Scala StreamingLogisticRegressionWithSGD is missing a method to set 
convergence tolerance, it is in the Python one

* StreamingLogisticRegressionWithSGD for Scala and Python are missing ability 
to set regularization parameter

* Python Streaming**RegressionWithSGD are missing API methods to set 
parameters, i.e. setStepSize

How about for this JIRA, I fix the documentation to include default parameters 
and then I will make JIRAs for the other items?


> Make StreamingLogisticRegressionWithSGD Python API equals with Scala one
> 
>
> Key: SPARK-10560
> URL: https://issues.apache.org/jira/browse/SPARK-10560
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> StreamingLogisticRegressionWithSGD Python API lacks of some parameters 
> compared with Scala one, here we make them equality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database

2015-10-05 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944340#comment-14944340
 ] 

Michael Armbrust commented on SPARK-9776:
-

You should not create a HiveContext in the spark-shell.  One is already created 
for you as sqlContext.

> Another instance of Derby may have already booted the database 
> ---
>
> Key: SPARK-9776
> URL: https://issues.apache.org/jira/browse/SPARK-9776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Mac Yosemite, spark-1.5.0
>Reporter: Sudhakar Thota
> Attachments: SPARK-9776-FL1.rtf
>
>
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in 
> error. Though the same works for spark-1.4.1.
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10585) only copy data once when generate unsafe projection

2015-10-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944337#comment-14944337
 ] 

Apache Spark commented on SPARK-10585:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/8991

> only copy data once when generate unsafe projection
> ---
>
> Key: SPARK-10585
> URL: https://issues.apache.org/jira/browse/SPARK-10585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>
> When we have nested struct, array or map, we will create a byte buffer for 
> each of them, and copy data to the buffer first, then copy them to the final 
> row buffer. We can save the first copy and directly copy data to final row 
> buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database

2015-10-05 Thread Alexander Pivovarov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944335#comment-14944335
 ] 

Alexander Pivovarov commented on SPARK-9776:


To reproduce the issue 
1. start emr-4.1.0 cluster (it comes with spark-1.5.0 and yarn)
2. ssh to master box
3. open spark-shell
4. run new org.apache.spark.sql.hive.HiveContext(sc)

> Another instance of Derby may have already booted the database 
> ---
>
> Key: SPARK-9776
> URL: https://issues.apache.org/jira/browse/SPARK-9776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Mac Yosemite, spark-1.5.0
>Reporter: Sudhakar Thota
> Attachments: SPARK-9776-FL1.rtf
>
>
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in 
> error. Though the same works for spark-1.4.1.
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10934) hashCode of unsafe array may crush

2015-10-05 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-10934:
---
Assignee: Wenchen Fan

> hashCode of unsafe array may crush
> --
>
> Key: SPARK-10934
> URL: https://issues.apache.org/jira/browse/SPARK-10934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.5.2, 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10934) hashCode of unsafe array may crush

2015-10-05 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-10934:
---
Fix Version/s: 1.5.2

> hashCode of unsafe array may crush
> --
>
> Key: SPARK-10934
> URL: https://issues.apache.org/jira/browse/SPARK-10934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.5.2, 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10934) hashCode of unsafe array may crush

2015-10-05 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10934.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8987
[https://github.com/apache/spark/pull/8987]

> hashCode of unsafe array may crush
> --
>
> Key: SPARK-10934
> URL: https://issues.apache.org/jira/browse/SPARK-10934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10940) Too many open files Spark Shuffle

2015-10-05 Thread Sandeep Pal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944307#comment-14944307
 ] 

Sandeep Pal commented on SPARK-10940:
-

I have already read similar issue as below but this does not help 
https://issues.apache.org/jira/browse/SPARK-9921

> Too many open files Spark Shuffle
> -
>
> Key: SPARK-10940
> URL: https://issues.apache.org/jira/browse/SPARK-10940
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL
>Affects Versions: 1.5.0
> Environment: 6 node standalone spark cluster with 1 master and 5 
> worker nodes on Centos 6.6 for all nodes. Each node has > 100 GB memory and 
> 36 cores.
>Reporter: Sandeep Pal
>
> Executing terasort by Spark-SQL on the data generated by teragen in hadoop. 
> Data size generated is ~456 GB. 
> Terasort passing with --total-executor-cores = 40, where as failing for 
> --total-executor-cores = 120. 
> I have tried to increase the ulimit to 10k but the problem persists.
> Below is the error message from one of the executor node:
> java.io.FileNotFoundException: 
> /tmp/spark-e15993e8-51a4-452a-8b86-da0169445065/executor-0c661152-3837-4711-bba2-2abf4fd15240/blockmgr-973aab72-feb8-4c60-ba3d-1b2ee27a1cc2/3f/temp_shuffle_7741538d-3ccf-4566-869f-265655ca9c90
>  (Too many open files)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10560) Make StreamingLogisticRegressionWithSGD Python API equals with Scala one

2015-10-05 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944305#comment-14944305
 ] 

Bryan Cutler commented on SPARK-10560:
--

Hi [~yanboliang]], I just want to make sure I'm on the same page as to what we 
need to do here.  
Here are the differences I see between the Python and Scala APIs for 
StreamingLogisticRegressionWithSGD:
-
* The documentation for Python is missing the default parameter values, also 
the same for StreamingLinearRegressionWithSGD

* In Python StreamingLogisticRegressionWithSGD the regularization defaults to 
0.01 while the Scala version defaults to 0.  
I believe other SGD implementations default to non-zero, so maybe there is some 
reason to turn it off in Streaming implementations?
In any case, these ones should probably default to the same value

* The Scala StreamingLogisticRegressionWithSGD is missing a method to set 
convergence tolerance, it is in the Python one

* StreamingLogisticRegressionWithSGD for Scala and Python are missing ability 
to set regularization parameter

* Python Streaming**RegressionWithSGD are missing API methods to set 
parameters, i.e. setStepSize

How about for this JIRA, I fix the documentation to include default parameters 
and then I will make JIRAs for the other items?


> Make StreamingLogisticRegressionWithSGD Python API equals with Scala one
> 
>
> Key: SPARK-10560
> URL: https://issues.apache.org/jira/browse/SPARK-10560
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> StreamingLogisticRegressionWithSGD Python API lacks of some parameters 
> compared with Scala one, here we make them equality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10925) Exception when joining DataFrames

2015-10-05 Thread Jason C Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason C Lee updated SPARK-10925:

Comment: was deleted

(was: I removed your 2nd step "apply an UDF on column "name"" and was able to 
also recreate the problem. I reduced your test case to the following:

import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._

object TestCase2 {

  case class Individual(id: String, name: String, surname: String, birthDate: 
String)

  def main(args: Array[String]) {

val sc = new SparkContext("local", "join DFs")
val sqlContext = new SQLContext(sc)

val rdd = sc.parallelize(Seq(
  Individual("14", "patrick", "andrews", "10/10/1970")
))

val df = sqlContext.createDataFrame(rdd)
df.show()

val df1 = df;
val df2 = df1.withColumn("surname1", df("surname"))
df2.show()

val df3 = df2.withColumn("birthDate1", df("birthDate"))
df3.show()

val cardinalityDF1 = df3.groupBy("name")
  .agg(count("name").as("cardinality_name"))
cardinalityDF1.show()

val df4 = df3.join(cardinalityDF1, df3("name") === cardinalityDF1("name"))
df4.show()

val cardinalityDF2 = df4.groupBy("surname1")
  .agg(count("surname1").as("cardinality_surname"))
cardinalityDF2.show()

val df5 = df4.join(cardinalityDF2, df4("surname") === 
cardinalityDF2("surname1"))
df5.show()
  }
})

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.L

[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

2015-10-05 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944303#comment-14944303
 ] 

Jason C Lee commented on SPARK-10925:
-

I removed your 2nd step "apply an UDF on column "name"" and was able to also 
recreate the problem. I reduced your test case to the following:

import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._

object TestCase2 {

  case class Individual(id: String, name: String, surname: String, birthDate: 
String)

  def main(args: Array[String]) {

val sc = new SparkContext("local", "join DFs")
val sqlContext = new SQLContext(sc)

val rdd = sc.parallelize(Seq(
  Individual("14", "patrick", "andrews", "10/10/1970")
))

val df = sqlContext.createDataFrame(rdd)
df.show()

val df1 = df;
val df2 = df1.withColumn("surname1", df("surname"))
df2.show()

val df3 = df2.withColumn("birthDate1", df("birthDate"))
df3.show()

val cardinalityDF1 = df3.groupBy("name")
  .agg(count("name").as("cardinality_name"))
cardinalityDF1.show()

val df4 = df3.join(cardinalityDF1, df3("name") === cardinalityDF1("name"))
df4.show()

val cardinalityDF2 = df4.groupBy("surname1")
  .agg(count("surname1").as("cardinality_surname"))
cardinalityDF2.show()

val df5 = df4.join(cardinalityDF2, df4("surname") === 
cardinalityDF2("surname1"))
df5.show()
  }
}

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at sca

[jira] [Updated] (SPARK-10940) Too many open files Spark Shuffle

2015-10-05 Thread Sandeep Pal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated SPARK-10940:

Description: 
Executing terasort by Spark-SQL on the data generated by teragen in hadoop. 
Data size generated is ~456 GB. 
Terasort passing with --total-executor-cores = 40, where as failing for 
--total-executor-cores = 120. 
I have tried to increase the ulimit to 10k but the problem persists.
Below is the error message from one of the executor node:

java.io.FileNotFoundException: 
/tmp/spark-e15993e8-51a4-452a-8b86-da0169445065/executor-0c661152-3837-4711-bba2-2abf4fd15240/blockmgr-973aab72-feb8-4c60-ba3d-1b2ee27a1cc2/3f/temp_shuffle_7741538d-3ccf-4566-869f-265655ca9c90
 (Too many open files)

  was:
Executing terasort by Spark-SQL on the data generated by teragen in hadoop. 
Data size generated is ~456 GB. 
Terasort passing with --total-executor-cores = 40, where failing for 
--total-executor-cores = 120. 
I have tried to increase the ulimit to 10k but the problem persists.
Below is the error message from one of the executor node:

java.io.FileNotFoundException: 
/tmp/spark-e15993e8-51a4-452a-8b86-da0169445065/executor-0c661152-3837-4711-bba2-2abf4fd15240/blockmgr-973aab72-feb8-4c60-ba3d-1b2ee27a1cc2/3f/temp_shuffle_7741538d-3ccf-4566-869f-265655ca9c90
 (Too many open files)


> Too many open files Spark Shuffle
> -
>
> Key: SPARK-10940
> URL: https://issues.apache.org/jira/browse/SPARK-10940
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL
>Affects Versions: 1.5.0
> Environment: 6 node standalone spark cluster with 1 master and 5 
> worker nodes on Centos 6.6 for all nodes. Each node has > 100 GB memory and 
> 36 cores.
>Reporter: Sandeep Pal
>
> Executing terasort by Spark-SQL on the data generated by teragen in hadoop. 
> Data size generated is ~456 GB. 
> Terasort passing with --total-executor-cores = 40, where as failing for 
> --total-executor-cores = 120. 
> I have tried to increase the ulimit to 10k but the problem persists.
> Below is the error message from one of the executor node:
> java.io.FileNotFoundException: 
> /tmp/spark-e15993e8-51a4-452a-8b86-da0169445065/executor-0c661152-3837-4711-bba2-2abf4fd15240/blockmgr-973aab72-feb8-4c60-ba3d-1b2ee27a1cc2/3f/temp_shuffle_7741538d-3ccf-4566-869f-265655ca9c90
>  (Too many open files)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10940) Too many open files Spark Shuffle

2015-10-05 Thread Sandeep Pal (JIRA)
Sandeep Pal created SPARK-10940:
---

 Summary: Too many open files Spark Shuffle
 Key: SPARK-10940
 URL: https://issues.apache.org/jira/browse/SPARK-10940
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, SQL
Affects Versions: 1.5.0
 Environment: 6 node standalone spark cluster with 1 master and 5 
worker nodes on Centos 6.6 for all nodes. Each node has > 100 GB memory and 36 
cores.
Reporter: Sandeep Pal


Executing terasort by Spark-SQL on the data generated by teragen in hadoop. 
Data size generated is ~456 GB. 
Terasort passing with --total-executor-cores = 40, where failing for 
--total-executor-cores = 120. 
I have tried to increase the ulimit to 10k but the problem persists.
Below is the error message from one of the executor node:

java.io.FileNotFoundException: 
/tmp/spark-e15993e8-51a4-452a-8b86-da0169445065/executor-0c661152-3837-4711-bba2-2abf4fd15240/blockmgr-973aab72-feb8-4c60-ba3d-1b2ee27a1cc2/3f/temp_shuffle_7741538d-3ccf-4566-869f-265655ca9c90
 (Too many open files)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10685) Misaligned data with RDD.zip and DataFrame.withColumn after repartition

2015-10-05 Thread Dan Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944282#comment-14944282
 ] 

Dan Brown commented on SPARK-10685:
---

[~davies] [~joshrosen] Ok, I've split out the zip-after-repartition issue as 
https://issues.apache.org/jira/browse/SPARK-10939.

> Misaligned data with RDD.zip and DataFrame.withColumn after repartition
> ---
>
> Key: SPARK-10685
> URL: https://issues.apache.org/jira/browse/SPARK-10685
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0, 1.4.1, 1.5.0
> Environment: - OSX 10.10.4, java 1.7.0_51, hadoop 2.6.0-cdh5.4.5
> - Ubuntu 12.04, java 1.7.0_80, hadoop 2.6.0-cdh5.4.5
>Reporter: Dan Brown
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> Here's a weird behavior where {{RDD.zip}} or {{DataFrame.withColumn}} after a 
> {{repartition}} produces "misaligned" data, meaning different column values 
> in the same row aren't matched, as if a zip shuffled the collections before 
> zipping them. It's difficult to reproduce because it's nondeterministic, 
> doesn't occur in local mode, and requires ≥2 workers (≥3 in one case). I was 
> able to repro it using pyspark 1.3.0 (cdh5.4.5), 1.4.1 (bin-without-hadoop), 
> and 1.5.0 (bin-without-hadoop).
> Here's the most similar issue I was able to find. It appears to not have been 
> repro'd and then closed optimistically, and it smells like it could have been 
> the same underlying cause that was never fixed:
> - https://issues.apache.org/jira/browse/SPARK-9131
> Also, this {{DataFrame.zip}} issue is related in spirit, since we were trying 
> to build it ourselves when we ran into this problem. Let me put in my vote 
> for reopening the issue and supporting {{DataFrame.zip}} in the standard lib.
> - https://issues.apache.org/jira/browse/SPARK-7460
> h3. Brief repro
> Fail: withColumn(udf) after DataFrame.repartition
> {code}
> df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> [r for r in df.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> [Row(a=39, b=639), Row(a=139, b=739), Row(a=239, b=839)]
> [Row(a=639, b=39), Row(a=739, b=139), Row(a=839, b=239)]
> []
> [Row(a=641, b=41), Row(a=741, b=141), Row(a=841, b=241)]
> [Row(a=641, b=1343), Row(a=741, b=1443), Row(a=841, b=1543)]
> [Row(a=639, b=39), Row(a=739, b=139), Row(a=839, b=239)]
> {code}
> Fail: RDD.zip after DataFrame.repartition
> {code}
> df  = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df  = df.repartition(100)
> rdd = df.rdd.zip(df.map(lambda r: Row(b=r.a))).map(lambda (x,y): Row(a=x.a, 
> b=y.b))
> [r for r in rdd.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> []
> [Row(a=50, b=6947), Row(a=150, b=7047), Row(a=250, b=7147)]
> []
> []
> [Row(a=44, b=644), Row(a=144, b=744), Row(a=244, b=844)]
> []
> {code}
> Test setup:
> - local\[8]: {{MASTER=local\[8]}}
> - dist\[N]: 1 driver + 1 master + N workers
> {code}
> "Fail" tests pass?  cluster mode  spark version
> 
> yes local[8]  1.3.0-cdh5.4.5
> no  dist[4]   1.3.0-cdh5.4.5
> yes local[8]  1.4.1
> yes dist[1]   1.4.1
> no  dist[2]   1.4.1
> no  dist[4]   1.4.1
> yes local[8]  1.5.0
> yes dist[1]   1.5.0
> no  dist[2]   1.5.0
> no  dist[4]   1.5.0
> {code}
> h3. Detailed repro
> Start `pyspark` and run these imports:
> {code}
> from pyspark.sql import Row
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType, StructType, StructField
> {code}
> Fail: withColumn(udf) after DataFrame.repartition
> {code}
> df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Ok: withColumn(udf) after DataFrame.repartition(100) after 1 starting 
> partition
> {code}
> df = sqlCtx.createDataFrame(sc.parallelize((Row(a=a) for a in xrange(1)), 
> numSlices=1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Fail: withColumn(udf) after DataFrame.repartition(100) after 100 starting 
> partitions
> {code}
> df = sqlCtx.createDataFrame(sc.parallelize((Row(a=a) for a in xrange(1)), 
> numSlices=100))
> df = df.repar

[jira] [Created] (SPARK-10939) Misaligned data with RDD.zip after repartition

2015-10-05 Thread Dan Brown (JIRA)
Dan Brown created SPARK-10939:
-

 Summary: Misaligned data with RDD.zip after repartition
 Key: SPARK-10939
 URL: https://issues.apache.org/jira/browse/SPARK-10939
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0, 1.4.1, 1.3.0
 Environment: - OSX 10.10.4, java 1.7.0_51, hadoop 2.6.0-cdh5.4.5
- Ubuntu 12.04, java 1.7.0_80, hadoop 2.6.0-cdh5.4.5
Reporter: Dan Brown


Split out from https://issues.apache.org/jira/browse/SPARK-10685:

Here's a weird behavior where {{RDD.zip}} after a {{repartition}} produces 
"misaligned" data, meaning different column values in the same row aren't 
matched, as if a zip shuffled the collections before zipping them. It's 
difficult to reproduce because it's nondeterministic, doesn't occur in local 
mode, and requires ≥2 workers (≥3 in one case). I was able to repro it using 
pyspark 1.3.0 (cdh5.4.5), 1.4.1 (bin-without-hadoop), and 1.5.0 
(bin-without-hadoop).

Also, this {{DataFrame.zip}} issue is related in spirit, since we were trying 
to build it ourselves when we ran into this problem. Let me put in my vote for 
reopening the issue and supporting {{DataFrame.zip}} in the standard lib.

- https://issues.apache.org/jira/browse/SPARK-7460

h3. Repro

Fail: RDD.zip after repartition
{code}
df  = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
df  = df.repartition(100)
rdd = df.rdd.zip(df.map(lambda r: Row(b=r.a))).map(lambda (x,y): Row(a=x.a, 
b=y.b))
[r for r in rdd.collect() if r.a != r.b][:3] # Should be []
{code}

Sample outputs (nondeterministic):
{code}
[]
[Row(a=50, b=6947), Row(a=150, b=7047), Row(a=250, b=7147)]
[]
[]
[Row(a=44, b=644), Row(a=144, b=744), Row(a=244, b=844)]
[]
{code}

Test setup:

- local\[8]: {{MASTER=local\[8]}}
- dist\[N]: 1 driver + 1 master + N workers

{code}
"Fail" tests pass?  cluster mode  spark version

yes local[8]  1.3.0-cdh5.4.5
no  dist[4]   1.3.0-cdh5.4.5
yes local[8]  1.4.1
yes dist[1]   1.4.1
no  dist[2]   1.4.1
no  dist[4]   1.4.1
yes local[8]  1.5.0
yes dist[1]   1.5.0
no  dist[2]   1.5.0
no  dist[4]   1.5.0
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x

2015-10-05 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944271#comment-14944271
 ] 

Marcelo Vanzin commented on SPARK-10937:


And I assume that you checked the classpath of the running spark shell and made 
sure there are no other Hive jars polluting it? Can you post the output of 
{{sys.props("java.class.path")}} from that shell? (The shell should work even 
if you get that error.)

> java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell 
> using hive 0.12.x, 0.13.x
> --
>
> Key: SPARK-10937
> URL: https://issues.apache.org/jira/browse/SPARK-10937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.5.1
>Reporter: Curtis Wilde
>
> Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and 
> sqlContext is not properly created.
> Method 'public String getDefaultExpr()' is not in inner class ConfVars of 
> org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0.
> org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when 
> 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the 
> exception below:
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String;
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671)
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
> at 
> org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.sql.SQLContext.(SQLContext.scala:234)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
> at $iwC$$iwC.(:9)
> at $iwC.(:18)
> at (:20)
> at .(:24)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124)
> at 
> org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
> at 
> org.apache.spark.repl.SparkILoopInit$class.initializeSpa

[jira] [Issue Comment Deleted] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x

2015-10-05 Thread Curtis Wilde (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Curtis Wilde updated SPARK-10937:
-
Comment: was deleted

(was: In spark-defaults.conf I've added:

spark.sql.hive.metastore.version0.12.0
spark.sql.hive.metastore.jars   
/usr/lib/spark/lib/guava-11.0.2.jar:/usr/lib/spark/lib/hadoop-client-2.2.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-common-0.12.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-exec-0.12.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-metastore-0.12.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-serde-0.12.0.2.0.10.0-1.jar

Still getting the same error.)

> java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell 
> using hive 0.12.x, 0.13.x
> --
>
> Key: SPARK-10937
> URL: https://issues.apache.org/jira/browse/SPARK-10937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.5.1
>Reporter: Curtis Wilde
>
> Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and 
> sqlContext is not properly created.
> Method 'public String getDefaultExpr()' is not in inner class ConfVars of 
> org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0.
> org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when 
> 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the 
> exception below:
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String;
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671)
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
> at 
> org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.sql.SQLContext.(SQLContext.scala:234)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
> at $iwC$$iwC.(:9)
> at $iwC.(:18)
> at (:20)
> at .(:24)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initial

[jira] [Reopened] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x

2015-10-05 Thread Curtis Wilde (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Curtis Wilde reopened SPARK-10937:
--

In spark-defaults.conf I've added:
spark.sql.hive.metastore.version 0.12.0
spark.sql.hive.metastore.jars 
/usr/lib/spark/lib/guava-11.0.2.jar:/usr/lib/spark/lib/hadoop-client-2.2.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-common-0.12.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-exec-0.12.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-metastore-0.12.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-serde-0.12.0.2.0.10.0-1.jar
Still getting the same error.

> java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell 
> using hive 0.12.x, 0.13.x
> --
>
> Key: SPARK-10937
> URL: https://issues.apache.org/jira/browse/SPARK-10937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.5.1
>Reporter: Curtis Wilde
>
> Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and 
> sqlContext is not properly created.
> Method 'public String getDefaultExpr()' is not in inner class ConfVars of 
> org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0.
> org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when 
> 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the 
> exception below:
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String;
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671)
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
> at 
> org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.sql.SQLContext.(SQLContext.scala:234)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
> at $iwC$$iwC.(:9)
> at $iwC.(:18)
> at (:20)
> at .(:24)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124)
> at 
> org.apache.spark.repl.

[jira] [Assigned] (SPARK-10337) Views are broken

2015-10-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10337:


Assignee: Apache Spark  (was: Wenchen Fan)

> Views are broken
> 
>
> Key: SPARK-10337
> URL: https://issues.apache.org/jira/browse/SPARK-10337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>Priority: Critical
>
> I haven't dug into this yet... but it seems like this should work:
> This works:
> {code}
> SELECT * FROM 100milints
> {code}
> This seems to work:
> {code}
> CREATE VIEW testView AS SELECT * FROM 100milints
> {code}
> This fails:
> {code}
> SELECT * FROM testView
> org.apache.spark.sql.AnalysisException: cannot resolve '100milints.col' given 
> input columns id; line 1 pos 7
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:122)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collecti

[jira] [Commented] (SPARK-10337) Views are broken

2015-10-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944259#comment-14944259
 ] 

Apache Spark commented on SPARK-10337:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/8990

> Views are broken
> 
>
> Key: SPARK-10337
> URL: https://issues.apache.org/jira/browse/SPARK-10337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Michael Armbrust
>Assignee: Wenchen Fan
>Priority: Critical
>
> I haven't dug into this yet... but it seems like this should work:
> This works:
> {code}
> SELECT * FROM 100milints
> {code}
> This seems to work:
> {code}
> CREATE VIEW testView AS SELECT * FROM 100milints
> {code}
> This fails:
> {code}
> SELECT * FROM testView
> org.apache.spark.sql.AnalysisException: cannot resolve '100milints.col' given 
> input columns id; line 1 pos 7
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:122)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

[jira] [Assigned] (SPARK-10337) Views are broken

2015-10-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10337:


Assignee: Wenchen Fan  (was: Apache Spark)

> Views are broken
> 
>
> Key: SPARK-10337
> URL: https://issues.apache.org/jira/browse/SPARK-10337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Michael Armbrust
>Assignee: Wenchen Fan
>Priority: Critical
>
> I haven't dug into this yet... but it seems like this should work:
> This works:
> {code}
> SELECT * FROM 100milints
> {code}
> This seems to work:
> {code}
> CREATE VIEW testView AS SELECT * FROM 100milints
> {code}
> This fails:
> {code}
> SELECT * FROM testView
> org.apache.spark.sql.AnalysisException: cannot resolve '100milints.col' given 
> input columns id; line 1 pos 7
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:122)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collectio

[jira] [Commented] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x

2015-10-05 Thread Curtis Wilde (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944258#comment-14944258
 ] 

Curtis Wilde commented on SPARK-10937:
--

In spark-defaults.conf I've added:

spark.sql.hive.metastore.version0.12.0
spark.sql.hive.metastore.jars   
/usr/lib/spark/lib/guava-11.0.2.jar:/usr/lib/spark/lib/hadoop-client-2.2.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-common-0.12.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-exec-0.12.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-metastore-0.12.0.2.0.10.0-1.jar:/usr/lib/spark/lib/hive-serde-0.12.0.2.0.10.0-1.jar

Still getting the same error.

> java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell 
> using hive 0.12.x, 0.13.x
> --
>
> Key: SPARK-10937
> URL: https://issues.apache.org/jira/browse/SPARK-10937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.5.1
>Reporter: Curtis Wilde
>
> Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and 
> sqlContext is not properly created.
> Method 'public String getDefaultExpr()' is not in inner class ConfVars of 
> org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0.
> org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when 
> 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the 
> exception below:
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String;
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671)
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
> at 
> org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.sql.SQLContext.(SQLContext.scala:234)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
> at $iwC$$iwC.(:9)
> at $iwC.(:18)
> at (:20)
> at .(:24)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)
> at 
> org.apache.spark.repl.SparkI

[jira] [Assigned] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-05 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-:
---

Assignee: Michael Armbrust

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-05 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-:

Description: 
The RDD API is very flexible, and as a result harder to optimize its execution 
in some cases. The DataFrame API, on the other hand, is much easier to 
optimize, but lacks some of the nice perks of the RDD API (e.g. harder to use 
UDFs, lack of strong types in Scala/Java).

The goal of Spark Datasets is to provide an API that allows users to easily 
express transformations on domain objects, while also providing the performance 
and robustness advantages of the Spark SQL execution engine.

h2. Requirements
 - *Fast* - In most cases, the performance of Datasets should be equal to or 
better than working with RDDs.  Encoders should be as fast or faster than Kryo 
and Java serialization, and unnecessary conversion should be avoided.
 - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
objects should provide compile-time safety where possible.  When converting 
from data where the schema is not known at compile-time (for example data read 
from an external source such as JSON), the conversion function should fail-fast 
if there is a schema mismatch.
 - *Support for a variety of object models* - Default encoders should be 
provided for a variety of object models: primitive types, case classes, tuples, 
POJOs, JavaBeans, etc.  Ideally, objects that follow standard conventions, such 
as Avro SpecificRecords, should also work out of the box.
 - *Java Compatible* - Datasets should provide a single API that works in both 
Scala and Java.  Where possible, shared types like Array will be used in the 
API.  Where not possible, overloaded functions should be provided for both 
languages.  Scala concepts, such as ClassTags should not be required in the 
user-facing API.
 - *Interoperates with DataFrames* - Users should be able to seamlessly 
transition between Datasets and DataFrames, without specifying conversion 
boiler-plate.  When names used in the input schema line-up with fields in the 
given class, no extra mapping should be necessary.  Libraries like MLlib should 
not need to provide different interfaces for accepting DataFrames and Datasets 
as input.

For a detailed outline of the complete proposed API: 
[marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
For an initial discussion of the design considerations in this API: [design 
doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]

  was:
The RDD API is very flexible, and as a result harder to optimize its execution 
in some cases. The DataFrame API, on the other hand, is much easier to 
optimize, but lacks some of the nice perks of the RDD API (e.g. harder to use 
UDFs, lack of strong types in Scala/Java).

As a Spark user, I want an API that sits somewhere in the middle of the 
spectrum so I can write most of my applications with that API, and yet it can 
be optimized well by Spark to achieve performance and stability.



> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Arr

[jira] [Commented] (SPARK-10938) Remove typeId in columnar cache

2015-10-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944238#comment-14944238
 ] 

Apache Spark commented on SPARK-10938:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/8989

> Remove typeId in columnar cache
> ---
>
> Key: SPARK-10938
> URL: https://issues.apache.org/jira/browse/SPARK-10938
> Project: Spark
>  Issue Type: Task
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> typeId is not needed in columnar cache, it's confusing to having them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10938) Remove typeId in columnar cache

2015-10-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10938:


Assignee: Davies Liu  (was: Apache Spark)

> Remove typeId in columnar cache
> ---
>
> Key: SPARK-10938
> URL: https://issues.apache.org/jira/browse/SPARK-10938
> Project: Spark
>  Issue Type: Task
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> typeId is not needed in columnar cache, it's confusing to having them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10938) Remove typeId in columnar cache

2015-10-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10938:


Assignee: Apache Spark  (was: Davies Liu)

> Remove typeId in columnar cache
> ---
>
> Key: SPARK-10938
> URL: https://issues.apache.org/jira/browse/SPARK-10938
> Project: Spark
>  Issue Type: Task
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> typeId is not needed in columnar cache, it's confusing to having them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10938) Remove typeId in columnar cache

2015-10-05 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10938:
--

 Summary: Remove typeId in columnar cache
 Key: SPARK-10938
 URL: https://issues.apache.org/jira/browse/SPARK-10938
 Project: Spark
  Issue Type: Task
Reporter: Davies Liu
Assignee: Davies Liu


typeId is not needed in columnar cache, it's confusing to having them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x

2015-10-05 Thread Curtis Wilde (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944218#comment-14944218
 ] 

Curtis Wilde edited comment on SPARK-10937 at 10/5/15 11:13 PM:


Yes, I added the following jars to the classpath:

guava-11.0.2.jar
hadoop-client-2.2.0.2.0.10.0-1.jar
hive-common-0.12.0.2.0.10.0-1.jar
hive-exec-0.12.0.2.0.10.0-1.jar
hive-metastore-0.12.0.2.0.10.0-1.jar
hive-serde-0.12.0.2.0.10.0-1.jar

(I assumed these were the jars that should be added, because setting
spark.sql.hive.metastore.jarsmaven
caused spark to look for these jars.)


was (Author: crutis):
Yes, I added the following jars to the classpath:

guava-11.0.2.jar
hadoop-client-2.2.0.2.0.10.0-1.jar
hive-common-0.12.0.2.0.10.0-1.jar
hive-exec-0.12.0.2.0.10.0-1.jar
hive-metastore-0.12.0.2.0.10.0-1.jar
hive-serde-0.12.0.2.0.10.0-1.jar

> java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell 
> using hive 0.12.x, 0.13.x
> --
>
> Key: SPARK-10937
> URL: https://issues.apache.org/jira/browse/SPARK-10937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.5.1
>Reporter: Curtis Wilde
>
> Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and 
> sqlContext is not properly created.
> Method 'public String getDefaultExpr()' is not in inner class ConfVars of 
> org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0.
> org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when 
> 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the 
> exception below:
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String;
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671)
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
> at 
> org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.sql.SQLContext.(SQLContext.scala:234)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
> at $iwC$$iwC.(:9)
> at $iwC.(:18)
> at (:20)
> at .(:24)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apac

[jira] [Resolved] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x

2015-10-05 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-10937.

Resolution: Invalid

Then that's your problem. You're causing the error by overriding the Hive 
classes shipped with Spark.

If you want Spark to use your version-specific Hive jars to access the 
metastore, take a look at {{spark.sql.hive.metastore.jars}}.

> java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell 
> using hive 0.12.x, 0.13.x
> --
>
> Key: SPARK-10937
> URL: https://issues.apache.org/jira/browse/SPARK-10937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.5.1
>Reporter: Curtis Wilde
>
> Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and 
> sqlContext is not properly created.
> Method 'public String getDefaultExpr()' is not in inner class ConfVars of 
> org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0.
> org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when 
> 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the 
> exception below:
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String;
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671)
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
> at 
> org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.sql.SQLContext.(SQLContext.scala:234)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
> at $iwC$$iwC.(:9)
> at $iwC.(:18)
> at (:20)
> at .(:24)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124)
> at 
> org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
> at 
> org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124)
> at 
> org.apach

[jira] [Commented] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x

2015-10-05 Thread Curtis Wilde (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944218#comment-14944218
 ] 

Curtis Wilde commented on SPARK-10937:
--

Yes, I added the following jars to the classpath:

guava-11.0.2.jar
hadoop-client-2.2.0.2.0.10.0-1.jar
hive-common-0.12.0.2.0.10.0-1.jar
hive-exec-0.12.0.2.0.10.0-1.jar
hive-metastore-0.12.0.2.0.10.0-1.jar
hive-serde-0.12.0.2.0.10.0-1.jar

> java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell 
> using hive 0.12.x, 0.13.x
> --
>
> Key: SPARK-10937
> URL: https://issues.apache.org/jira/browse/SPARK-10937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.5.1
>Reporter: Curtis Wilde
>
> Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and 
> sqlContext is not properly created.
> Method 'public String getDefaultExpr()' is not in inner class ConfVars of 
> org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0.
> org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when 
> 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the 
> exception below:
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String;
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671)
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
> at 
> org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.sql.SQLContext.(SQLContext.scala:234)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
> at $iwC$$iwC.(:9)
> at $iwC.(:18)
> at (:20)
> at .(:24)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124)
> at 
> org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
> at 
> org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:1

[jira] [Commented] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x

2015-10-05 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944215#comment-14944215
 ] 

Marcelo Vanzin commented on SPARK-10937:


Nevermind. The exception is coming from the execution code, which isn't 
affected by that config in any case. Are you (or the configuration you're 
using) overriding Spark's classpath in any way? e.g. placing HDP's Hive jars in 
the driver's classpath?

> java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell 
> using hive 0.12.x, 0.13.x
> --
>
> Key: SPARK-10937
> URL: https://issues.apache.org/jira/browse/SPARK-10937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.5.1
>Reporter: Curtis Wilde
>
> Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and 
> sqlContext is not properly created.
> Method 'public String getDefaultExpr()' is not in inner class ConfVars of 
> org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0.
> org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when 
> 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the 
> exception below:
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String;
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671)
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
> at 
> org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.sql.SQLContext.(SQLContext.scala:234)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
> at $iwC$$iwC.(:9)
> at $iwC.(:18)
> at (:20)
> at .(:24)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124)
> at 
> org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
> at 
> org.apache.spark.repl.SparkILoopInit$class.initializeSpark(Spar

[jira] [Comment Edited] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x

2015-10-05 Thread Curtis Wilde (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944213#comment-14944213
 ] 

Curtis Wilde edited comment on SPARK-10937 at 10/5/15 11:06 PM:


Yes
spark.sql.hive.metastore.version0.12.0


was (Author: crutis):
Yes

> java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell 
> using hive 0.12.x, 0.13.x
> --
>
> Key: SPARK-10937
> URL: https://issues.apache.org/jira/browse/SPARK-10937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.5.1
>Reporter: Curtis Wilde
>
> Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and 
> sqlContext is not properly created.
> Method 'public String getDefaultExpr()' is not in inner class ConfVars of 
> org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0.
> org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when 
> 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the 
> exception below:
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String;
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671)
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
> at 
> org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.sql.SQLContext.(SQLContext.scala:234)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
> at $iwC$$iwC.(:9)
> at $iwC.(:18)
> at (:20)
> at .(:24)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124)
> at 
> org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
> at 
> org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124)
> at 
> org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
>  

[jira] [Created] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x

2015-10-05 Thread Curtis Wilde (JIRA)
Curtis Wilde created SPARK-10937:


 Summary: java.lang.NoSuchMethodError when instantiating sqlContext 
in spark-shell using hive 0.12.x, 0.13.x
 Key: SPARK-10937
 URL: https://issues.apache.org/jira/browse/SPARK-10937
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, SQL
Affects Versions: 1.5.1
Reporter: Curtis Wilde


Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception when  and 
sqlContext is not properly created.

Method 'public String getDefaultExpr()' is not in inner class ConfVars of 
org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0.

org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when 
'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the 
exception below:

java.lang.NoSuchMethodError: 
org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String;
at 
org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671)
at 
org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669)
at 
org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164)
at 
org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391)
at 
org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235)
at 
org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.sql.SQLContext.(SQLContext.scala:234)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
at 
org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
at $iwC$$iwC.(:9)
at $iwC.(:18)
at (:20)
at .(:24)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at 
org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)
at 
org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124)
at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
at 
org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)
at 
org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:159)
at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
at 
org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:108)
at 
org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$r

[jira] [Commented] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x

2015-10-05 Thread Curtis Wilde (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944213#comment-14944213
 ] 

Curtis Wilde commented on SPARK-10937:
--

Yes

> java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell 
> using hive 0.12.x, 0.13.x
> --
>
> Key: SPARK-10937
> URL: https://issues.apache.org/jira/browse/SPARK-10937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.5.1
>Reporter: Curtis Wilde
>
> Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and 
> sqlContext is not properly created.
> Method 'public String getDefaultExpr()' is not in inner class ConfVars of 
> org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0.
> org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when 
> 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the 
> exception below:
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String;
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671)
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
> at 
> org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.sql.SQLContext.(SQLContext.scala:234)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
> at $iwC$$iwC.(:9)
> at $iwC.(:18)
> at (:20)
> at .(:24)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124)
> at 
> org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
> at 
> org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124)
> at 
> org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkI

[jira] [Commented] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x

2015-10-05 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944211#comment-14944211
 ] 

Marcelo Vanzin commented on SPARK-10937:


Did you set {{spark.sql.hive.metastore.version}} to {{0.12}}?

> java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell 
> using hive 0.12.x, 0.13.x
> --
>
> Key: SPARK-10937
> URL: https://issues.apache.org/jira/browse/SPARK-10937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.5.1
>Reporter: Curtis Wilde
>
> Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and 
> sqlContext is not properly created.
> Method 'public String getDefaultExpr()' is not in inner class ConfVars of 
> org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0.
> org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when 
> 'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the 
> exception below:
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String;
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671)
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
> at 
> org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.sql.SQLContext.(SQLContext.scala:234)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
> at $iwC$$iwC.(:9)
> at $iwC.(:18)
> at (:20)
> at .(:24)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124)
> at 
> org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
> at 
> org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124)
> at 
> org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$Spa

[jira] [Updated] (SPARK-10937) java.lang.NoSuchMethodError when instantiating sqlContext in spark-shell using hive 0.12.x, 0.13.x

2015-10-05 Thread Curtis Wilde (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Curtis Wilde updated SPARK-10937:
-
Description: 
Running spark-shell with Hive 0.12.0 (HDP 2.0) causes an exception and 
sqlContext is not properly created.

Method 'public String getDefaultExpr()' is not in inner class ConfVars of 
org.apache.hadoop.hive.conf.Hive.Conf.java before Hive 0.14.0.

org.apache.spark.sql.hive.HiveContext.scala calls 'getDefaultExpr()' when 
'newTemporaryConfiguration(): Map[String, String]' is invoked, causing the 
exception below:

java.lang.NoSuchMethodError: 
org.apache.hadoop.hive.conf.HiveConf$ConfVars.getDefaultExpr()Ljava/lang/String;
at 
org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:671)
at 
org.apache.spark.sql.hive.HiveContext$$anonfun$newTemporaryConfiguration$1.apply(HiveContext.scala:669)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:669)
at 
org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:164)
at 
org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:391)
at 
org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:235)
at 
org.apache.spark.sql.SQLContext$$anonfun$5.apply(SQLContext.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.sql.SQLContext.(SQLContext.scala:234)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
at 
org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
at $iwC$$iwC.(:9)
at $iwC.(:18)
at (:20)
at .(:24)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at 
org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)
at 
org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124)
at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
at 
org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)
at 
org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:159)
at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
at 
org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:108)
at 
org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apac

[jira] [Commented] (SPARK-8848) Write Parquet LISTs and MAPs conforming to Parquet format spec

2015-10-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944197#comment-14944197
 ] 

Apache Spark commented on SPARK-8848:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8988

> Write Parquet LISTs and MAPs conforming to Parquet format spec
> --
>
> Key: SPARK-8848
> URL: https://issues.apache.org/jira/browse/SPARK-8848
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.0.2, 1.1.1, 1.2.2, 1.3.1, 1.4.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> [Parquet format PR #17|https://github.com/apache/parquet-format/pull/17] 
> standardized structures of Parquet complex types (LIST & MAP). Spark SQL 
> should follow this spec and write Parquet data conforming to the standard.
> Note that although currently Parquet files written by Spark SQL is 
> non-standard (because Parquet format spec wasn't clear about this part when 
> Spark SQL Parquet support was authored), it's still compatible with the most 
> recent Parquet format spec, because the format we use is covered by the 
> backwards-compatibility rules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-10-05 Thread Hans van den Bogert (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944166#comment-14944166
 ] 

Hans van den Bogert edited comment on SPARK-10474 at 10/5/15 10:39 PM:
---

Had this patch against tag 1.5.1:
https://gist.github.com/f110f64887f4739b7dd8

Output of the added println()'s are near the beginning in:

{noformat}
Using Spark's repl log4j profile: 
org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.1
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0)
Type in expressions to have them evaluated.
Type :help for more information.
15/10/06 00:16:00 WARN SparkConf: In Spark 1.0 and later spark.local.dir will 
be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in 
mesos/standalone and LOCAL_DIRS in YARN).
numCores:0
1048576
15/10/06 00:16:02 WARN MetricsSystem: Using default name DAGScheduler for 
source because spark.app.id is not set.
I1006 00:16:02.414851 25123 sched.cpp:137] Version: 0.21.0
I1006 00:16:02.423246 25115 sched.cpp:234] New master detected at 
master@10.149.3.5:5050
I1006 00:16:02.423482 25115 sched.cpp:242] No credentials provided. Attempting 
to register without authentication
I1006 00:16:02.427250 25121 sched.cpp:408] Framework registered with 
20151006-001500-84120842-5050-6847-
15/10/06 00:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Spark context available as sc.
15/10/06 00:16:04 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/06 00:16:04 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/06 00:16:09 WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 1.2.0
15/10/06 00:16:09 WARN ObjectStore: Failed to get database default, returning 
NoSuchObjectException
15/10/06 00:16:11 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/06 00:16:11 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/06 00:16:14 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
SQL context available as sqlContext.
{noformat}


was (Author: hbogert):
Had this patch against tag 1.5.1:
https://gist.github.com/f110f64887f4739b7dd8.git

Output of the added println()'s are near the beginning in:

{noformat}
Using Spark's repl log4j profile: 
org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.1
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0)
Type in expressions to have them evaluated.
Type :help for more information.
15/10/06 00:16:00 WARN SparkConf: In Spark 1.0 and later spark.local.dir will 
be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in 
mesos/standalone and LOCAL_DIRS in YARN).
numCores:0
1048576
15/10/06 00:16:02 WARN MetricsSystem: Using default name DAGScheduler for 
source because spark.app.id is not set.
I1006 00:16:02.414851 25123 sched.cpp:137] Version: 0.21.0
I1006 00:16:02.423246 25115 sched.cpp:234] New master detected at 
master@10.149.3.5:5050
I1006 00:16:02.423482 25115 sched.cpp:242] No credentials provided. Attempting 
to register without authentication
I1006 00:16:02.427250 25121 sched.cpp:408] Framework registered with 
20151006-001500-84120842-5050-6847-
15/10/06 00:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Spark context available as sc.
15/10/06 00:16:04 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/06 00:16:04 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/06 00:16:09 WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 1.2.0
15/10/06 00:16:09 WARN ObjectStore: Failed to get database default, returning 
NoSuchObjectException
15/10/06 00:16:11 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/06 00:16:11 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/06 00:16:14 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using bu

[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-10-05 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944173#comment-14944173
 ] 

Alexander Ulanov commented on SPARK-5575:
-

Weide,

These are major features and some of them are under development. You can check 
their status in the linked issues. Could you work on something smaller as a 
first step? [~mengxr], do you have any suggestions?

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-10-05 Thread Hans van den Bogert (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944166#comment-14944166
 ] 

Hans van den Bogert edited comment on SPARK-10474 at 10/5/15 10:36 PM:
---

Had this patch against tag 1.5.1:
https://gist.github.com/f110f64887f4739b7dd8.git

Output of the added println()'s are near the beginning in:

{noformat}
Using Spark's repl log4j profile: 
org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.1
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0)
Type in expressions to have them evaluated.
Type :help for more information.
15/10/06 00:16:00 WARN SparkConf: In Spark 1.0 and later spark.local.dir will 
be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in 
mesos/standalone and LOCAL_DIRS in YARN).
numCores:0
1048576
15/10/06 00:16:02 WARN MetricsSystem: Using default name DAGScheduler for 
source because spark.app.id is not set.
I1006 00:16:02.414851 25123 sched.cpp:137] Version: 0.21.0
I1006 00:16:02.423246 25115 sched.cpp:234] New master detected at 
master@10.149.3.5:5050
I1006 00:16:02.423482 25115 sched.cpp:242] No credentials provided. Attempting 
to register without authentication
I1006 00:16:02.427250 25121 sched.cpp:408] Framework registered with 
20151006-001500-84120842-5050-6847-
15/10/06 00:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Spark context available as sc.
15/10/06 00:16:04 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/06 00:16:04 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/06 00:16:09 WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 1.2.0
15/10/06 00:16:09 WARN ObjectStore: Failed to get database default, returning 
NoSuchObjectException
15/10/06 00:16:11 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/06 00:16:11 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/06 00:16:14 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
SQL context available as sqlContext.
{noformat}


was (Author: hbogert):
Had this patch against tag 1.5.1:
https://gist.github.com/f110f64887f4739b7dd8.git

Output of the added println()'s are near the beginning in:

Using Spark's repl log4j profile: 
org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.1
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0)
Type in expressions to have them evaluated.
Type :help for more information.
15/10/06 00:16:00 WARN SparkConf: In Spark 1.0 and later spark.local.dir will 
be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in 
mesos/standalone and LOCAL_DIRS in YARN).
numCores:0
1048576
15/10/06 00:16:02 WARN MetricsSystem: Using default name DAGScheduler for 
source because spark.app.id is not set.
I1006 00:16:02.414851 25123 sched.cpp:137] Version: 0.21.0
I1006 00:16:02.423246 25115 sched.cpp:234] New master detected at 
master@10.149.3.5:5050
I1006 00:16:02.423482 25115 sched.cpp:242] No credentials provided. Attempting 
to register without authentication
I1006 00:16:02.427250 25121 sched.cpp:408] Framework registered with 
20151006-001500-84120842-5050-6847-
15/10/06 00:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Spark context available as sc.
15/10/06 00:16:04 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/06 00:16:04 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/06 00:16:09 WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 1.2.0
15/10/06 00:16:09 WARN ObjectStore: Failed to get database default, returning 
NoSuchObjectException
15/10/06 00:16:11 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/06 00:16:11 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/06 00:16:14 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using built

[jira] [Commented] (SPARK-10384) Univariate statistics as UDAFs

2015-10-05 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944168#comment-14944168
 ] 

Xiangrui Meng commented on SPARK-10384:
---

I marked small umbrella JIRAs duplicated in favor of concrete ones. I added a 
list of statistics in this JIRA description. Hopefully this helps understand 
the progress better.

> Univariate statistics as UDAFs
> --
>
> Key: SPARK-10384
> URL: https://issues.apache.org/jira/browse/SPARK-10384
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SQL
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> It would be nice to define univariate statistics as UDAFs. This JIRA 
> discusses general implementation and tracks the process of subtasks. 
> Univariate statistics include:
> continuous: min, max, range, variance, stddev, median, quantiles, skewness, 
> and kurtosis
> categorical: number of categories, mode
> If we define them as UDAFs, it would be quite flexible to use them with 
> DataFrames, e.g.,
> {code}
> df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
> {code}
> Note that some univariate statistics depend on others, e.g., variance might 
> depend on mean and count. It would be nice if SQL can optimize the sequence 
> to avoid duplicate computation.
> Univariate statistics for continuous variables:
> * -min-
> * -max-
> * range (SPARK-10861)
> * -mean-
> * sample variance (SPARK-9296)
> * population variance (SPARK-9296)
> * -sample standard deviation- (SPARK-6458)
> * -population standard deviation- (SPARK-6458)
> * skewness (SPARK-10641)
> * kurtosis (SPARK-10641)
> * approximate median (SPARK-6761)
> * approximate quantiles (SPARK-6761)
> Univariate statistics for categorical variables:
> * mode: https://en.wikipedia.org/wiki/Mode_(statistics) (SPARK-10936)
> * -number of categories- (This is COUNT DISTINCT in SQL.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-10-05 Thread Hans van den Bogert (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944166#comment-14944166
 ] 

Hans van den Bogert commented on SPARK-10474:
-

Had this patch against tag 1.5.1:
https://gist.github.com/f110f64887f4739b7dd8.git

Output of the added println()'s are near the beginning in:

Using Spark's repl log4j profile: 
org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.1
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0)
Type in expressions to have them evaluated.
Type :help for more information.
15/10/06 00:16:00 WARN SparkConf: In Spark 1.0 and later spark.local.dir will 
be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in 
mesos/standalone and LOCAL_DIRS in YARN).
numCores:0
1048576
15/10/06 00:16:02 WARN MetricsSystem: Using default name DAGScheduler for 
source because spark.app.id is not set.
I1006 00:16:02.414851 25123 sched.cpp:137] Version: 0.21.0
I1006 00:16:02.423246 25115 sched.cpp:234] New master detected at 
master@10.149.3.5:5050
I1006 00:16:02.423482 25115 sched.cpp:242] No credentials provided. Attempting 
to register without authentication
I1006 00:16:02.427250 25121 sched.cpp:408] Framework registered with 
20151006-001500-84120842-5050-6847-
15/10/06 00:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Spark context available as sc.
15/10/06 00:16:04 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/06 00:16:04 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/06 00:16:09 WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 1.2.0
15/10/06 00:16:09 WARN ObjectStore: Failed to get database default, returning 
NoSuchObjectException
15/10/06 00:16:11 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/06 00:16:11 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/06 00:16:14 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
SQL context available as sqlContext.

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at

[jira] [Updated] (SPARK-10912) Improve Spark metrics executor.filesystem

2015-10-05 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-10912:
-
Attachment: s3a_metrics.patch

Adding s3a is fairly straightforward. I guess the reason it's not included is 
because s3a support (via hadoop-aws.jar) is not part of the default hadoop 
distribution due to licensing issues. I created a patch to enable s3a metrics, 
both on the executors and on the driver. Reporting shuffle statistics requires 
more thoughts, although all the numbers are already collected in 
TaskMetrics.scala (input, output, shuffle, local, remote, spill, records, 
bytes, etc). I think it would make sense to report the aggregated metrics per 
executor across all tasks, so it's easy to have an overall sense of disk I/O 
and network traffic.

> Improve Spark metrics executor.filesystem
> -
>
> Key: SPARK-10912
> URL: https://issues.apache.org/jira/browse/SPARK-10912
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.5.0
>Reporter: Yongjia Wang
>Priority: Minor
> Attachments: s3a_metrics.patch
>
>
> In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: 
> "hdfs" and "file". I started using s3 as the persistent storage with Spark 
> standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. 
> The 'file' metric appears to be only for driver reading local file, it would 
> be nice to also report shuffle read/write metrics, so it can help with 
> optimization.
> I think these 2 things (s3 and shuffle) are very useful and cover all the 
> missing information about Spark IO especially for s3 setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10641) skewness and kurtosis support

2015-10-05 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944155#comment-14944155
 ] 

Xiangrui Meng edited comment on SPARK-10641 at 10/5/15 10:34 PM:
-

Any updates? Please submit a PR for code review, and try to think about how to 
reuse existing implementation of variance.


was (Author: mengxr):
Any updates?

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Seth Hendrickson
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats

2015-10-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-10602.
-
Resolution: Duplicate

Marking this as duplicated in favor of concrete JIRAs. Please continue the 
discussion to those JIRAs.

> Univariate statistics as UDAFs: single-pass continuous stats
> 
>
> Key: SPARK-10602
> URL: https://issues.apache.org/jira/browse/SPARK-10602
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>
> See parent JIRA for more details.  This subtask covers statistics for 
> continuous values requiring a single pass over the data, such as min and max.
> This JIRA is an umbrella.  For individual stats, please create and link a new 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10641) skewness and kurtosis support

2015-10-05 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944155#comment-14944155
 ] 

Xiangrui Meng commented on SPARK-10641:
---

Any updates?

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Seth Hendrickson
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10603) Univariate statistics as UDAFs: multi-pass continuous stats

2015-10-05 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944153#comment-14944153
 ] 

Xiangrui Meng edited comment on SPARK-10603 at 10/5/15 10:32 PM:
-

Marking this as duplicated in favor of concrete JIRAs. See SPARK-10384 for the 
list.


was (Author: mengxr):
Marking this as duplicated in favor of concrete JIRAs.

> Univariate statistics as UDAFs: multi-pass continuous stats
> ---
>
> Key: SPARK-10603
> URL: https://issues.apache.org/jira/browse/SPARK-10603
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Joseph K. Bradley
>
> See parent JIRA for more details. This subtask covers statistics for 
> continuous values requiring multiple passes over the data, such as median and 
> quantiles.
> This JIRA is an umbrella. For individual stats, please create and link a new 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10603) Univariate statistics as UDAFs: multi-pass continuous stats

2015-10-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-10603.
-
Resolution: Duplicate

Marking this as duplicated in favor of concrete JIRAs.

> Univariate statistics as UDAFs: multi-pass continuous stats
> ---
>
> Key: SPARK-10603
> URL: https://issues.apache.org/jira/browse/SPARK-10603
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Joseph K. Bradley
>
> See parent JIRA for more details. This subtask covers statistics for 
> continuous values requiring multiple passes over the data, such as median and 
> quantiles.
> This JIRA is an umbrella. For individual stats, please create and link a new 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10604) Univariate statistics as UDAFs: categorical stats

2015-10-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10604.
---
Resolution: Duplicate

Marking this as duplicated in favor of concrete JIRAs.

> Univariate statistics as UDAFs: categorical stats
> -
>
> Key: SPARK-10604
> URL: https://issues.apache.org/jira/browse/SPARK-10604
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Joseph K. Bradley
>
> See parent JIRA for more details. This subtask covers statistics for 
> categorical values, such as number of categories or mode.
> This JIRA is an umbrella. For individual stats, please create and link a new 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10877) Assertions fail straightforward DataFrame job due to word alignment

2015-10-05 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944150#comment-14944150
 ] 

Matt Cheah commented on SPARK-10877:


Does spark-submit enable assertions? I'm not sure how SBT passes these kinds of 
assertion options along.

Also, what JDK / Java Version are you using, and what OS?

> Assertions fail straightforward DataFrame job due to word alignment
> ---
>
> Key: SPARK-10877
> URL: https://issues.apache.org/jira/browse/SPARK-10877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Matt Cheah
> Attachments: SparkFilterByKeyTest.scala
>
>
> I have some code that I’m running in a unit test suite, but the code I’m 
> running is failing with an assertion error.
> I have translated the JUnit test that was failing, to a Scala script that I 
> will attach to the ticket. The assertion error is the following:
> {code}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.AssertionError: 
> lengthInBytes must be a multiple of 8 (word-aligned)
> at 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeWords(Murmur3_x86_32.java:53)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.hashCode(UnsafeArrayData.java:289)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.hashCode(rows.scala:149)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.hashCode(rows.scala:247)
> at org.apache.spark.HashPartitioner.getPartition(Partitioner.scala:85)
> at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
> at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> {code}
> However, it turns out that this code actually works normally and computes the 
> correct result if assertions are turned off.
> I traced the code and found that when hashUnsafeWords was called, it was 
> given a byte-length of 12, which clearly is not a multiple of 8. However, the 
> job seems to compute correctly regardless of this fact. Of course, I can’t 
> just disable assertions for my unit test though.
> A few things we need to understand:
> 1. Why is the lengthInBytes of size 12?
> 2. Is it actually a problem that the byte length is not word-aligned? If so, 
> how should we fix the byte length? If it's not a problem, why is the 
> assertion flagging a false negative?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs

2015-10-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10384:
--
Description: 
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* -max-
* range (SPARK-10861)
* -mean-
* sample variance (SPARK-9296)
* population variance (SPARK-9296)
* -sample standard deviation- (SPARK-6458)
* -population standard deviation- (SPARK-6458)
* skewness (SPARK-10641)
* kurtosis (SPARK-10641)
* approximate median (SPARK-6761)
* approximate quantiles (SPARK-6761)

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics) (SPARK-10936)
* -number of categories- (This is COUNT DISTINCT in SQL.)

  was:
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* -max-
* range (SPARK-10861)
* -mean-
* sample variance (SPARK-9296)
* population variance (SPARK-9296)
* -sample standard deviation- (SPARK-6458)
* -population standard deviation- (SPARK-6458)
* skewness (SPARK-10641)
* kurtosis (SPARK-10641)
* approximate median (SPARK-6761)
* approximate quantiles (SPARK-6761)

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics) (SPARK-10936)
* number of categories


> Univariate statistics as UDAFs
> --
>
> Key: SPARK-10384
> URL: https://issues.apache.org/jira/browse/SPARK-10384
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SQL
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> It would be nice to define univariate statistics as UDAFs. This JIRA 
> discusses general implementation and tracks the process of subtasks. 
> Univariate statistics include:
> continuous: min, max, range, variance, stddev, median, quantiles, skewness, 
> and kurtosis
> categorical: number of categories, mode
> If we define them as UDAFs, it would be quite flexible to use them with 
> DataFrames, e.g.,
> {code}
> df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
> {code}
> Note that some univariate statistics depend on others, e.g., variance might 
> depend on mean and count. It would be nice if SQL can optimize the sequence 
> to avoid duplicate computation.
> Univariate statistics for continuous variables:
> * -min-
> * -max-
> * range (SPARK-10861)
> * -mean-
> * sample variance (SPARK-9296)
> * population variance (SPARK-9296)
> * -sample standard deviation- (SPARK-6458)
> * -population standard deviation- (SPARK-6458)
> * skewness (SPARK-10641)
> * kurtosis (SPARK-10641)
> * approximate median (SPARK-6761)
> * approximate quantiles (SPARK-6761)
> Univariate statistics for categorical variables:
> * mode: https://en.wikipedia.org/wiki/Mode_(statistics) (SPARK-10936)
> * -number of categories- (This is COUNT DISTINCT in SQL.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10936) UDAF "mode" for categorical variables

2015-10-05 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10936:
-

 Summary: UDAF "mode" for categorical variables
 Key: SPARK-10936
 URL: https://issues.apache.org/jira/browse/SPARK-10936
 Project: Spark
  Issue Type: Sub-task
Reporter: Xiangrui Meng


This is similar to frequent items except that we don't have a threshold on the 
frequency. So an exact implementation might require a global shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs

2015-10-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10384:
--
Description: 
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* -max-
* range (SPARK-10861)
* -mean-
* sample variance (SPARK-9296)
* population variance (SPARK-9296)
* -sample standard deviation- (SPARK-6458)
* -population standard deviation- (SPARK-6458)
* skewness (SPARK-10641)
* kurtosis (SPARK-10641)
* approximate median (SPARK-6761)
* approximate quantiles (SPARK-6761)

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics) (SPARK-10936)
* number of categories

  was:
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* -max-
* range (SPARK-10861)
* -mean-
* sample variance (SPARK-9296)
* population variance (SPARK-9296)
* -sample standard deviation- (SPARK-6458)
* -population standard deviation- (SPARK-6458)
* skewness (SPARK-10641)
* kurtosis (SPARK-10641)
* approximate median (SPARK-6761)
* approximate quantiles (SPARK-6761)

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories


> Univariate statistics as UDAFs
> --
>
> Key: SPARK-10384
> URL: https://issues.apache.org/jira/browse/SPARK-10384
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SQL
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> It would be nice to define univariate statistics as UDAFs. This JIRA 
> discusses general implementation and tracks the process of subtasks. 
> Univariate statistics include:
> continuous: min, max, range, variance, stddev, median, quantiles, skewness, 
> and kurtosis
> categorical: number of categories, mode
> If we define them as UDAFs, it would be quite flexible to use them with 
> DataFrames, e.g.,
> {code}
> df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
> {code}
> Note that some univariate statistics depend on others, e.g., variance might 
> depend on mean and count. It would be nice if SQL can optimize the sequence 
> to avoid duplicate computation.
> Univariate statistics for continuous variables:
> * -min-
> * -max-
> * range (SPARK-10861)
> * -mean-
> * sample variance (SPARK-9296)
> * population variance (SPARK-9296)
> * -sample standard deviation- (SPARK-6458)
> * -population standard deviation- (SPARK-6458)
> * skewness (SPARK-10641)
> * kurtosis (SPARK-10641)
> * approximate median (SPARK-6761)
> * approximate quantiles (SPARK-6761)
> Univariate statistics for categorical variables:
> * mode: https://en.wikipedia.org/wiki/Mode_(statistics) (SPARK-10936)
> * number of categories



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10384) Univariate statistics as UDAFs

2015-10-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-10384:
-

Assignee: Xiangrui Meng  (was: Burak Yavuz)

> Univariate statistics as UDAFs
> --
>
> Key: SPARK-10384
> URL: https://issues.apache.org/jira/browse/SPARK-10384
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SQL
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> It would be nice to define univariate statistics as UDAFs. This JIRA 
> discusses general implementation and tracks the process of subtasks. 
> Univariate statistics include:
> continuous: min, max, range, variance, stddev, median, quantiles, skewness, 
> and kurtosis
> categorical: number of categories, mode
> If we define them as UDAFs, it would be quite flexible to use them with 
> DataFrames, e.g.,
> {code}
> df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
> {code}
> Note that some univariate statistics depend on others, e.g., variance might 
> depend on mean and count. It would be nice if SQL can optimize the sequence 
> to avoid duplicate computation.
> Univariate statistics for continuous variables:
> * -min-
> * -max-
> * range (SPARK-10861)
> * -mean-
> * sample variance (SPARK-9296)
> * population variance (SPARK-9296)
> * -sample standard deviation- (SPARK-6458)
> * -population standard deviation- (SPARK-6458)
> * skewness (SPARK-10641)
> * kurtosis (SPARK-10641)
> * approximate median (SPARK-6761)
> * approximate quantiles (SPARK-6761)
> Univariate statistics for categorical variables:
> * mode: https://en.wikipedia.org/wiki/Mode_(statistics)
> * number of categories



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10862) Univariate Statistics: Adding median & quantile support as UDAF

2015-10-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10862.
---
Resolution: Duplicate

> Univariate Statistics: Adding median & quantile support as UDAF
> ---
>
> Key: SPARK-10862
> URL: https://issues.apache.org/jira/browse/SPARK-10862
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs

2015-10-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10384:
--
Description: 
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* -max-
* range (SPARK-10861)
* -mean-
* sample variance (SPARK-9296)
* population variance (SPARK-9296)
* -sample standard deviation- (SPARK-6458)
* -population standard deviation- (SPARK-6458)
* skewness (SPARK-10641)
* kurtosis (SPARK-10641)
* approximate median (SPARK-6761)
* approximate quantiles (SPARK-6761)

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories

  was:
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* -max-
* range (SPARK-10861)
* -mean-
* sample variance (SPARK-9296)
* population variance (SPARK-9296)
* -sample standard deviation- (SPARK-6458)
* -population standard deviation- (SPARK-6458)
* skewness (SPARK-10641)
* kurtosis (SPARK-10641)
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories


> Univariate statistics as UDAFs
> --
>
> Key: SPARK-10384
> URL: https://issues.apache.org/jira/browse/SPARK-10384
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SQL
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
>
> It would be nice to define univariate statistics as UDAFs. This JIRA 
> discusses general implementation and tracks the process of subtasks. 
> Univariate statistics include:
> continuous: min, max, range, variance, stddev, median, quantiles, skewness, 
> and kurtosis
> categorical: number of categories, mode
> If we define them as UDAFs, it would be quite flexible to use them with 
> DataFrames, e.g.,
> {code}
> df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
> {code}
> Note that some univariate statistics depend on others, e.g., variance might 
> depend on mean and count. It would be nice if SQL can optimize the sequence 
> to avoid duplicate computation.
> Univariate statistics for continuous variables:
> * -min-
> * -max-
> * range (SPARK-10861)
> * -mean-
> * sample variance (SPARK-9296)
> * population variance (SPARK-9296)
> * -sample standard deviation- (SPARK-6458)
> * -population standard deviation- (SPARK-6458)
> * skewness (SPARK-10641)
> * kurtosis (SPARK-10641)
> * approximate median (SPARK-6761)
> * approximate quantiles (SPARK-6761)
> Univariate statistics for categorical variables:
> * mode: https://en.wikipedia.org/wiki/Mode_(statistics)
> * number of categories



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs

2015-10-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10384:
--
Description: 
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* -max-
* range (SPARK-10861)
* -mean-
* sample variance (SPARK-9296)
* population variance (SPARK-9296)
* -sample standard deviation- (SPARK-6458)
* -population standard deviation- (SPARK-6458)
* skewness (SPARK-10641)
* kurtosis (SPARK-10641)
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories

  was:
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* -max-
* range (SPARK-10861)
* -mean-
* sample variance
* population variance
* -sample standard deviation- (SPARK-6458)
* -population standard deviation- (SPARK-6458)
* skewness (SPARK-10641)
* kurtosis (SPARK-10641)
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories


> Univariate statistics as UDAFs
> --
>
> Key: SPARK-10384
> URL: https://issues.apache.org/jira/browse/SPARK-10384
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SQL
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
>
> It would be nice to define univariate statistics as UDAFs. This JIRA 
> discusses general implementation and tracks the process of subtasks. 
> Univariate statistics include:
> continuous: min, max, range, variance, stddev, median, quantiles, skewness, 
> and kurtosis
> categorical: number of categories, mode
> If we define them as UDAFs, it would be quite flexible to use them with 
> DataFrames, e.g.,
> {code}
> df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
> {code}
> Note that some univariate statistics depend on others, e.g., variance might 
> depend on mean and count. It would be nice if SQL can optimize the sequence 
> to avoid duplicate computation.
> Univariate statistics for continuous variables:
> * -min-
> * -max-
> * range (SPARK-10861)
> * -mean-
> * sample variance (SPARK-9296)
> * population variance (SPARK-9296)
> * -sample standard deviation- (SPARK-6458)
> * -population standard deviation- (SPARK-6458)
> * skewness (SPARK-10641)
> * kurtosis (SPARK-10641)
> * approximate median
> * approximate quantiles
> Univariate statistics for categorical variables:
> * mode: https://en.wikipedia.org/wiki/Mode_(statistics)
> * number of categories



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10877) Assertions fail straightforward DataFrame job due to word alignment

2015-10-05 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944141#comment-14944141
 ] 

Jason C Lee commented on SPARK-10877:
-

I enabled assertions by specifying either the following in my build.sbt
javaOptions += "-ea"
and use set package to build. 

I also run it with spark-submit instead of spark-shell...still doesn't see what 
you see.
$SPARK_HOME/bin/spark-submit --class "SparkFilterByKeyTest" --master local[2] 
target/scala-2.10/simple-project_2.10-1.0.jar 



> Assertions fail straightforward DataFrame job due to word alignment
> ---
>
> Key: SPARK-10877
> URL: https://issues.apache.org/jira/browse/SPARK-10877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Matt Cheah
> Attachments: SparkFilterByKeyTest.scala
>
>
> I have some code that I’m running in a unit test suite, but the code I’m 
> running is failing with an assertion error.
> I have translated the JUnit test that was failing, to a Scala script that I 
> will attach to the ticket. The assertion error is the following:
> {code}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.AssertionError: 
> lengthInBytes must be a multiple of 8 (word-aligned)
> at 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeWords(Murmur3_x86_32.java:53)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.hashCode(UnsafeArrayData.java:289)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.hashCode(rows.scala:149)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.hashCode(rows.scala:247)
> at org.apache.spark.HashPartitioner.getPartition(Partitioner.scala:85)
> at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
> at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> {code}
> However, it turns out that this code actually works normally and computes the 
> correct result if assertions are turned off.
> I traced the code and found that when hashUnsafeWords was called, it was 
> given a byte-length of 12, which clearly is not a multiple of 8. However, the 
> job seems to compute correctly regardless of this fact. Of course, I can’t 
> just disable assertions for my unit test though.
> A few things we need to understand:
> 1. Why is the lengthInBytes of size 12?
> 2. Is it actually a problem that the byte length is not word-aligned? If so, 
> how should we fix the byte length? If it's not a problem, why is the 
> assertion flagging a false negative?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs

2015-10-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10384:
--
Description: 
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* -max-
* range (SPARK-10861)
* -mean-
* sample variance
* population variance
* -sample standard deviation- (SPARK-6458)
* -population standard deviation- (SPARK-6458)
* skewness (SPARK-10641)
* kurtosis (SPARK-10641)
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories

  was:
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* -max-
* range
* -mean-
* sample variance
* population variance
* -sample standard deviation- (SPARK-6458)
* -population standard deviation- (SPARK-6458)
* skewness (SPARK-10641)
* kurtosis (SPARK-10641)
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories


> Univariate statistics as UDAFs
> --
>
> Key: SPARK-10384
> URL: https://issues.apache.org/jira/browse/SPARK-10384
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SQL
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
>
> It would be nice to define univariate statistics as UDAFs. This JIRA 
> discusses general implementation and tracks the process of subtasks. 
> Univariate statistics include:
> continuous: min, max, range, variance, stddev, median, quantiles, skewness, 
> and kurtosis
> categorical: number of categories, mode
> If we define them as UDAFs, it would be quite flexible to use them with 
> DataFrames, e.g.,
> {code}
> df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
> {code}
> Note that some univariate statistics depend on others, e.g., variance might 
> depend on mean and count. It would be nice if SQL can optimize the sequence 
> to avoid duplicate computation.
> Univariate statistics for continuous variables:
> * -min-
> * -max-
> * range (SPARK-10861)
> * -mean-
> * sample variance
> * population variance
> * -sample standard deviation- (SPARK-6458)
> * -population standard deviation- (SPARK-6458)
> * skewness (SPARK-10641)
> * kurtosis (SPARK-10641)
> * approximate median
> * approximate quantiles
> Univariate statistics for categorical variables:
> * mode: https://en.wikipedia.org/wiki/Mode_(statistics)
> * number of categories



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs

2015-10-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10384:
--
Description: 
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* -max-
* range
* sample variance
* population variance
* -sample standard deviation- (SPARK-6458)
* -population standard deviation- (SPARK-6458)
* skewness (SPARK-10641)
* kurtosis (SPARK-10641)
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories

  was:
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* -max-
* range
* sample variance
* population variance
* -sample standard deviation- (SPARK-6458)
* -population standard deviation- (SPARK-6458)
* skewness
* kurtosis
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories


> Univariate statistics as UDAFs
> --
>
> Key: SPARK-10384
> URL: https://issues.apache.org/jira/browse/SPARK-10384
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SQL
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
>
> It would be nice to define univariate statistics as UDAFs. This JIRA 
> discusses general implementation and tracks the process of subtasks. 
> Univariate statistics include:
> continuous: min, max, range, variance, stddev, median, quantiles, skewness, 
> and kurtosis
> categorical: number of categories, mode
> If we define them as UDAFs, it would be quite flexible to use them with 
> DataFrames, e.g.,
> {code}
> df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
> {code}
> Note that some univariate statistics depend on others, e.g., variance might 
> depend on mean and count. It would be nice if SQL can optimize the sequence 
> to avoid duplicate computation.
> Univariate statistics for continuous variables:
> * -min-
> * -max-
> * range
> * sample variance
> * population variance
> * -sample standard deviation- (SPARK-6458)
> * -population standard deviation- (SPARK-6458)
> * skewness (SPARK-10641)
> * kurtosis (SPARK-10641)
> * approximate median
> * approximate quantiles
> Univariate statistics for categorical variables:
> * mode: https://en.wikipedia.org/wiki/Mode_(statistics)
> * number of categories



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs

2015-10-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10384:
--
Description: 
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* -max-
* range
* -mean-
* sample variance
* population variance
* -sample standard deviation- (SPARK-6458)
* -population standard deviation- (SPARK-6458)
* skewness (SPARK-10641)
* kurtosis (SPARK-10641)
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories

  was:
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* -max-
* range
* sample variance
* population variance
* -sample standard deviation- (SPARK-6458)
* -population standard deviation- (SPARK-6458)
* skewness (SPARK-10641)
* kurtosis (SPARK-10641)
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories


> Univariate statistics as UDAFs
> --
>
> Key: SPARK-10384
> URL: https://issues.apache.org/jira/browse/SPARK-10384
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SQL
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
>
> It would be nice to define univariate statistics as UDAFs. This JIRA 
> discusses general implementation and tracks the process of subtasks. 
> Univariate statistics include:
> continuous: min, max, range, variance, stddev, median, quantiles, skewness, 
> and kurtosis
> categorical: number of categories, mode
> If we define them as UDAFs, it would be quite flexible to use them with 
> DataFrames, e.g.,
> {code}
> df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
> {code}
> Note that some univariate statistics depend on others, e.g., variance might 
> depend on mean and count. It would be nice if SQL can optimize the sequence 
> to avoid duplicate computation.
> Univariate statistics for continuous variables:
> * -min-
> * -max-
> * range
> * -mean-
> * sample variance
> * population variance
> * -sample standard deviation- (SPARK-6458)
> * -population standard deviation- (SPARK-6458)
> * skewness (SPARK-10641)
> * kurtosis (SPARK-10641)
> * approximate median
> * approximate quantiles
> Univariate statistics for categorical variables:
> * mode: https://en.wikipedia.org/wiki/Mode_(statistics)
> * number of categories



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs

2015-10-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10384:
--
Description: 
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* -max-
* range
* sample variance
* population variance
* -sample standard deviation- (SPARK-6458)
* -population standard deviation- (SPARK-6458)
* skewness
* kurtosis
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories

  was:
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* -max-
* range
* sample variance
* population variance
* -sample standard deviation-
* -population standard deviation-
* skewness
* kurtosis
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories


> Univariate statistics as UDAFs
> --
>
> Key: SPARK-10384
> URL: https://issues.apache.org/jira/browse/SPARK-10384
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SQL
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
>
> It would be nice to define univariate statistics as UDAFs. This JIRA 
> discusses general implementation and tracks the process of subtasks. 
> Univariate statistics include:
> continuous: min, max, range, variance, stddev, median, quantiles, skewness, 
> and kurtosis
> categorical: number of categories, mode
> If we define them as UDAFs, it would be quite flexible to use them with 
> DataFrames, e.g.,
> {code}
> df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
> {code}
> Note that some univariate statistics depend on others, e.g., variance might 
> depend on mean and count. It would be nice if SQL can optimize the sequence 
> to avoid duplicate computation.
> Univariate statistics for continuous variables:
> * -min-
> * -max-
> * range
> * sample variance
> * population variance
> * -sample standard deviation- (SPARK-6458)
> * -population standard deviation- (SPARK-6458)
> * skewness
> * kurtosis
> * approximate median
> * approximate quantiles
> Univariate statistics for categorical variables:
> * mode: https://en.wikipedia.org/wiki/Mode_(statistics)
> * number of categories



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs

2015-10-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10384:
--
Description: 
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* -max-
* range
* sample variance
* population variance
* -sample standard deviation-
* -population standard deviation-
* skewness
* kurtosis
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories

  was:
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* max
* range
* sample variance
* population variance
* sample standard deviation
* population standard deviation
* skewness
* kurtosis
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories


> Univariate statistics as UDAFs
> --
>
> Key: SPARK-10384
> URL: https://issues.apache.org/jira/browse/SPARK-10384
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SQL
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
>
> It would be nice to define univariate statistics as UDAFs. This JIRA 
> discusses general implementation and tracks the process of subtasks. 
> Univariate statistics include:
> continuous: min, max, range, variance, stddev, median, quantiles, skewness, 
> and kurtosis
> categorical: number of categories, mode
> If we define them as UDAFs, it would be quite flexible to use them with 
> DataFrames, e.g.,
> {code}
> df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
> {code}
> Note that some univariate statistics depend on others, e.g., variance might 
> depend on mean and count. It would be nice if SQL can optimize the sequence 
> to avoid duplicate computation.
> Univariate statistics for continuous variables:
> * -min-
> * -max-
> * range
> * sample variance
> * population variance
> * -sample standard deviation-
> * -population standard deviation-
> * skewness
> * kurtosis
> * approximate median
> * approximate quantiles
> Univariate statistics for categorical variables:
> * mode: https://en.wikipedia.org/wiki/Mode_(statistics)
> * number of categories



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs

2015-10-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10384:
--
Description: 
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* -min-
* max
* range
* sample variance
* population variance
* sample standard deviation
* population standard deviation
* skewness
* kurtosis
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories

  was:
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* ~~min~~
* max
* range
* sample variance
* population variance
* sample standard deviation
* population standard deviation
* skewness
* kurtosis
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories


> Univariate statistics as UDAFs
> --
>
> Key: SPARK-10384
> URL: https://issues.apache.org/jira/browse/SPARK-10384
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SQL
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
>
> It would be nice to define univariate statistics as UDAFs. This JIRA 
> discusses general implementation and tracks the process of subtasks. 
> Univariate statistics include:
> continuous: min, max, range, variance, stddev, median, quantiles, skewness, 
> and kurtosis
> categorical: number of categories, mode
> If we define them as UDAFs, it would be quite flexible to use them with 
> DataFrames, e.g.,
> {code}
> df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
> {code}
> Note that some univariate statistics depend on others, e.g., variance might 
> depend on mean and count. It would be nice if SQL can optimize the sequence 
> to avoid duplicate computation.
> Univariate statistics for continuous variables:
> * -min-
> * max
> * range
> * sample variance
> * population variance
> * sample standard deviation
> * population standard deviation
> * skewness
> * kurtosis
> * approximate median
> * approximate quantiles
> Univariate statistics for categorical variables:
> * mode: https://en.wikipedia.org/wiki/Mode_(statistics)
> * number of categories



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs

2015-10-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10384:
--
Description: 
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* ~~min~~
* max
* range
* sample variance
* population variance
* sample standard deviation
* population standard deviation
* skewness
* kurtosis
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories

  was:
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* min
* max
* range
* sample variance
* population variance
* sample standard deviation
* population standard deviation
* skewness
* kurtosis
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories


> Univariate statistics as UDAFs
> --
>
> Key: SPARK-10384
> URL: https://issues.apache.org/jira/browse/SPARK-10384
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SQL
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
>
> It would be nice to define univariate statistics as UDAFs. This JIRA 
> discusses general implementation and tracks the process of subtasks. 
> Univariate statistics include:
> continuous: min, max, range, variance, stddev, median, quantiles, skewness, 
> and kurtosis
> categorical: number of categories, mode
> If we define them as UDAFs, it would be quite flexible to use them with 
> DataFrames, e.g.,
> {code}
> df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
> {code}
> Note that some univariate statistics depend on others, e.g., variance might 
> depend on mean and count. It would be nice if SQL can optimize the sequence 
> to avoid duplicate computation.
> Univariate statistics for continuous variables:
> * ~~min~~
> * max
> * range
> * sample variance
> * population variance
> * sample standard deviation
> * population standard deviation
> * skewness
> * kurtosis
> * approximate median
> * approximate quantiles
> Univariate statistics for categorical variables:
> * mode: https://en.wikipedia.org/wiki/Mode_(statistics)
> * number of categories



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10868) monotonicallyIncreasingId() supports offset for indexing

2015-10-05 Thread Martin Senne (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944137#comment-14944137
 ] 

Martin Senne commented on SPARK-10868:
--

I will do (and give my best!) Thx for offering this opportunity 

> monotonicallyIncreasingId() supports offset for indexing
> 
>
> Key: SPARK-10868
> URL: https://issues.apache.org/jira/browse/SPARK-10868
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Martin Senne
>
> With SPARK-7135 and https://github.com/apache/spark/pull/5709 
> `monotonicallyIncreasingID()` allows to create an index column with unique 
> ids. The indexing always starts at 0 (no offset).
> *Feature wish*
> Having a parameter `offset`, such that the function can be used as
> {{monotonicallyIncreasingID( offset )}}
> and indexing _starts at *offset* instead of 0_.
> *Use-case* 
> Add rows to a DataFrame that is already written to a DB (via 
> _.write.jdbc(...)_).
> In detail:
> - A DataFrame *A* (containing an ID column) and having indices from 0 to 199 
> in that column is existent in DB.
> - New rows need to be added to *A*. This included
> -- Creating a DataFrame *A'* with new rows, but without id column
> -- Add the index column to *A'* - this time starting at *200*, as there are 
> already entries with id's from 0 to 199 (*here, monotonicallyInreasingID( 200 
> ) is required.*)
> -- union *A* and *A'*
> -- store into DB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10384) Univariate statistics as UDAFs

2015-10-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10384:
--
Description: 
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* min
* max
* range
* sample variance
* population variance
* sample standard deviation
* population standard deviation
* skewness
* kurtosis
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories

  was:
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.


> Univariate statistics as UDAFs
> --
>
> Key: SPARK-10384
> URL: https://issues.apache.org/jira/browse/SPARK-10384
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SQL
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
>
> It would be nice to define univariate statistics as UDAFs. This JIRA 
> discusses general implementation and tracks the process of subtasks. 
> Univariate statistics include:
> continuous: min, max, range, variance, stddev, median, quantiles, skewness, 
> and kurtosis
> categorical: number of categories, mode
> If we define them as UDAFs, it would be quite flexible to use them with 
> DataFrames, e.g.,
> {code}
> df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
> {code}
> Note that some univariate statistics depend on others, e.g., variance might 
> depend on mean and count. It would be nice if SQL can optimize the sequence 
> to avoid duplicate computation.
> Univariate statistics for continuous variables:
> * min
> * max
> * range
> * sample variance
> * population variance
> * sample standard deviation
> * population standard deviation
> * skewness
> * kurtosis
> * approximate median
> * approximate quantiles
> Univariate statistics for categorical variables:
> * mode: https://en.wikipedia.org/wiki/Mode_(statistics)
> * number of categories



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9941) Try ML pipeline API on Kaggle competitions

2015-10-05 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944128#comment-14944128
 ] 

Xiangrui Meng commented on SPARK-9941:
--

Created https://issues.apache.org/jira/browse/SPARK-10935. I think we could 
start by importing the datasets using spark-csv.

> Try ML pipeline API on Kaggle competitions
> --
>
> Key: SPARK-9941
> URL: https://issues.apache.org/jira/browse/SPARK-9941
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This is an umbrella JIRA to track some fun tasks :)
> We have built many features under the ML pipeline API, and we want to see how 
> it works on real-world datasets, e.g., Kaggle competition datasets 
> (https://www.kaggle.com/competitions). We want to invite community members to 
> help test. The goal is NOT to win the competitions but to provide code 
> examples and to find out missing features and other issues to help shape the 
> roadmap.
> For people who are interested, please do the following:
> 1. Create a subtask (or leave a comment if you cannot create a subtask) to 
> claim a Kaggle dataset.
> 2. Use the ML pipeline API to build and tune an ML pipeline that works for 
> the Kaggle dataset.
> 3. Paste the code to gist (https://gist.github.com/) and provide the link 
> here.
> 4. Report missing features, issues, running times, and accuracy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >