date:20150814


[ 
https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696889#comment-14696889
 ] 

Frank Rosner commented on SPARK-9971:
-

Ok so shall we close it as a wontfix?

 MaxFunction not working correctly with columns containing Double.NaN
 

 Key: SPARK-9971
 URL: https://issues.apache.org/jira/browse/SPARK-9971
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: Frank Rosner
Priority: Minor

 h4. Problem Description
 When using the {{max}} function on a {{DoubleType}} column that contains 
 {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 
 This is because it compares all values with the running maximum. However, {{x 
  Double.NaN}} will always lead false for all {{x: Double}}, so will {{x  
 Double.NaN}}.
 h4. How to Reproduce
 {code}
 import org.apache.spark.sql.{SQLContext, Row}
 import org.apache.spark.sql.types._
 val sql = new SQLContext(sc)
 val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
 val dataFrame = sql.createDataFrame(rdd, StructType(List(
   StructField(col, DoubleType, false)
 )))
 dataFrame.select(max(col)).first
 // returns org.apache.spark.sql.Row = [NaN]
 {code}
 h4. Solution
 The {{max}} and {{min}} functions should ignore NaN values, as they are not 
 numbers. If a column contains only NaN values, then the maximum and minimum 
 is not defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9906) User guide for LogisticRegressionSummary


[ 
https://issues.apache.org/jira/browse/SPARK-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696891#comment-14696891
 ] 

Apache Spark commented on SPARK-9906:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/8197

 User guide for LogisticRegressionSummary
 

 Key: SPARK-9906
 URL: https://issues.apache.org/jira/browse/SPARK-9906
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Feynman Liang
Assignee: Manoj Kumar

 SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model 
 statistics to ML pipeline logistic regression models. This feature is not 
 present in mllib and should be documented within {{ml-guide}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9906) User guide for LogisticRegressionSummary


 [ 
https://issues.apache.org/jira/browse/SPARK-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9906:
---

Assignee: Manoj Kumar  (was: Apache Spark)

 User guide for LogisticRegressionSummary
 

 Key: SPARK-9906
 URL: https://issues.apache.org/jira/browse/SPARK-9906
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Feynman Liang
Assignee: Manoj Kumar

 SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model 
 statistics to ML pipeline logistic regression models. This feature is not 
 present in mllib and should be documented within {{ml-guide}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9906) User guide for LogisticRegressionSummary


 [ 
https://issues.apache.org/jira/browse/SPARK-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9906:
---

Assignee: Apache Spark  (was: Manoj Kumar)

 User guide for LogisticRegressionSummary
 

 Key: SPARK-9906
 URL: https://issues.apache.org/jira/browse/SPARK-9906
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Feynman Liang
Assignee: Apache Spark

 SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model 
 statistics to ML pipeline logistic regression models. This feature is not 
 present in mllib and should be documented within {{ml-guide}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN

2015-08-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696894#comment-14696894
 ] 

Sean Owen commented on SPARK-9971:
--

I'd always prefer to leave it open at least ~1 week to see if anyone else has 
comments. If not, yes I personally would argue that the current behavior is the 
right one.

 MaxFunction not working correctly with columns containing Double.NaN
 

 Key: SPARK-9971
 URL: https://issues.apache.org/jira/browse/SPARK-9971
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: Frank Rosner
Priority: Minor

 h4. Problem Description
 When using the {{max}} function on a {{DoubleType}} column that contains 
 {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 
 This is because it compares all values with the running maximum. However, {{x 
  Double.NaN}} will always lead false for all {{x: Double}}, so will {{x  
 Double.NaN}}.
 h4. How to Reproduce
 {code}
 import org.apache.spark.sql.{SQLContext, Row}
 import org.apache.spark.sql.types._
 val sql = new SQLContext(sc)
 val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
 val dataFrame = sql.createDataFrame(rdd, StructType(List(
   StructField(col, DoubleType, false)
 )))
 dataFrame.select(max(col)).first
 // returns org.apache.spark.sql.Row = [NaN]
 {code}
 h4. Solution
 The {{max}} and {{min}} functions should ignore NaN values, as they are not 
 numbers. If a column contains only NaN values, then the maximum and minimum 
 is not defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9974) SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar


[ 
https://issues.apache.org/jira/browse/SPARK-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696919#comment-14696919
 ] 

Apache Spark commented on SPARK-9974:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8198

 SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the 
 assembly jar
 

 Key: SPARK-9974
 URL: https://issues.apache.org/jira/browse/SPARK-9974
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Priority: Blocker

 One of the consequence of this issue is that Parquet tables created in Hive 
 are not accessible from Spark SQL built with SBT. Maven build is OK.
 Git commit: 
 [69930310115501f0de094fe6f5c6c60dade342bd|https://github.com/apache/spark/commit/69930310115501f0de094fe6f5c6c60dade342bd]
 Build with SBT and check the assembly jar for 
 {{parquet.hadoop.api.WriteSupport}}:
 {noformat}
 $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 
 clean assembly/assembly
 ...
 $ jar tf 
 assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | 
 fgrep parquet/hadoop/api
 org/apache/parquet/hadoop/api/
 org/apache/parquet/hadoop/api/DelegatingReadSupport.class
 org/apache/parquet/hadoop/api/DelegatingWriteSupport.class
 org/apache/parquet/hadoop/api/InitContext.class
 org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class
 org/apache/parquet/hadoop/api/ReadSupport.class
 org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
 org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class
 org/apache/parquet/hadoop/api/WriteSupport.class
 {noformat}
 Only classes of {{org.apache.parquet:parquet-mr:1.7.0}}. Note that classes in 
 {{com.twitter:parquet-hadoop-bundle:1.6.0}} are not under the {{org.apache}} 
 namespace.
 Build with Maven and check the assembly jar for 
 {{parquet.hadoop.api.WriteSupport}}:
 {noformat}
 $ ./build/mvn -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 
 -DskipTests clean package
 $ jar tf 
 assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | 
 fgrep parquet/hadoop/api
 org/apache/parquet/hadoop/api/
 org/apache/parquet/hadoop/api/DelegatingReadSupport.class
 org/apache/parquet/hadoop/api/DelegatingWriteSupport.class
 org/apache/parquet/hadoop/api/InitContext.class
 org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class
 org/apache/parquet/hadoop/api/ReadSupport.class
 org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
 org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class
 org/apache/parquet/hadoop/api/WriteSupport.class
 parquet/hadoop/api/
 parquet/hadoop/api/DelegatingReadSupport.class
 parquet/hadoop/api/DelegatingWriteSupport.class
 parquet/hadoop/api/InitContext.class
 parquet/hadoop/api/ReadSupport$ReadContext.class
 parquet/hadoop/api/ReadSupport.class
 parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
 parquet/hadoop/api/WriteSupport$WriteContext.class
 parquet/hadoop/api/WriteSupport.class
 {noformat}
 Expected classes are packaged properly.
 To reproduce the Parquet table access issue, first create a Parquet table 
 with Hive (say 0.13.1):
 {noformat}
 hive CREATE TABLE parquet_test STORED AS PARQUET AS SELECT 1;
 {noformat}
 Build Spark assembly jar with the SBT command above, start {{spark-shell}}:
 {noformat}
 scala sqlContext.table(parquet_test).show()
 15/08/14 17:52:50 INFO HiveMetaStore: 0: get_table : db=default 
 tbl=parquet_test
 15/08/14 17:52:50 INFO audit: ugi=lian  ip=unknown-ip-addr  cmd=get_table 
 : db=default tbl=parquet_test
 java.lang.NoClassDefFoundError: parquet/hadoop/api/WriteSupport
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at 
 org.apache.hadoop.hive.ql.metadata.Table.getOutputFormatClass(Table.java:328)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:320)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:303)
 at scala.Option.map(Option.scala:145)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:303)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:298)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)

[jira] [Assigned] (SPARK-9974) SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar


 [ 
https://issues.apache.org/jira/browse/SPARK-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9974:
---

Assignee: (was: Apache Spark)

 SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the 
 assembly jar
 

 Key: SPARK-9974
 URL: https://issues.apache.org/jira/browse/SPARK-9974
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Priority: Blocker

 One of the consequence of this issue is that Parquet tables created in Hive 
 are not accessible from Spark SQL built with SBT. Maven build is OK.
 Git commit: 
 [69930310115501f0de094fe6f5c6c60dade342bd|https://github.com/apache/spark/commit/69930310115501f0de094fe6f5c6c60dade342bd]
 Build with SBT and check the assembly jar for 
 {{parquet.hadoop.api.WriteSupport}}:
 {noformat}
 $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 
 clean assembly/assembly
 ...
 $ jar tf 
 assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | 
 fgrep parquet/hadoop/api
 org/apache/parquet/hadoop/api/
 org/apache/parquet/hadoop/api/DelegatingReadSupport.class
 org/apache/parquet/hadoop/api/DelegatingWriteSupport.class
 org/apache/parquet/hadoop/api/InitContext.class
 org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class
 org/apache/parquet/hadoop/api/ReadSupport.class
 org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
 org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class
 org/apache/parquet/hadoop/api/WriteSupport.class
 {noformat}
 Only classes of {{org.apache.parquet:parquet-mr:1.7.0}}. Note that classes in 
 {{com.twitter:parquet-hadoop-bundle:1.6.0}} are not under the {{org.apache}} 
 namespace.
 Build with Maven and check the assembly jar for 
 {{parquet.hadoop.api.WriteSupport}}:
 {noformat}
 $ ./build/mvn -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 
 -DskipTests clean package
 $ jar tf 
 assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | 
 fgrep parquet/hadoop/api
 org/apache/parquet/hadoop/api/
 org/apache/parquet/hadoop/api/DelegatingReadSupport.class
 org/apache/parquet/hadoop/api/DelegatingWriteSupport.class
 org/apache/parquet/hadoop/api/InitContext.class
 org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class
 org/apache/parquet/hadoop/api/ReadSupport.class
 org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
 org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class
 org/apache/parquet/hadoop/api/WriteSupport.class
 parquet/hadoop/api/
 parquet/hadoop/api/DelegatingReadSupport.class
 parquet/hadoop/api/DelegatingWriteSupport.class
 parquet/hadoop/api/InitContext.class
 parquet/hadoop/api/ReadSupport$ReadContext.class
 parquet/hadoop/api/ReadSupport.class
 parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
 parquet/hadoop/api/WriteSupport$WriteContext.class
 parquet/hadoop/api/WriteSupport.class
 {noformat}
 Expected classes are packaged properly.
 To reproduce the Parquet table access issue, first create a Parquet table 
 with Hive (say 0.13.1):
 {noformat}
 hive CREATE TABLE parquet_test STORED AS PARQUET AS SELECT 1;
 {noformat}
 Build Spark assembly jar with the SBT command above, start {{spark-shell}}:
 {noformat}
 scala sqlContext.table(parquet_test).show()
 15/08/14 17:52:50 INFO HiveMetaStore: 0: get_table : db=default 
 tbl=parquet_test
 15/08/14 17:52:50 INFO audit: ugi=lian  ip=unknown-ip-addr  cmd=get_table 
 : db=default tbl=parquet_test
 java.lang.NoClassDefFoundError: parquet/hadoop/api/WriteSupport
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at 
 org.apache.hadoop.hive.ql.metadata.Table.getOutputFormatClass(Table.java:328)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:320)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:303)
 at scala.Option.map(Option.scala:145)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:303)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:298)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper.getTableOption(ClientWrapper.scala:298)
 at

[jira] [Assigned] (SPARK-9974) SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar


 [ 
https://issues.apache.org/jira/browse/SPARK-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9974:
---

Assignee: Apache Spark

 SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the 
 assembly jar
 

 Key: SPARK-9974
 URL: https://issues.apache.org/jira/browse/SPARK-9974
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Apache Spark
Priority: Blocker

 One of the consequence of this issue is that Parquet tables created in Hive 
 are not accessible from Spark SQL built with SBT. Maven build is OK.
 Git commit: 
 [69930310115501f0de094fe6f5c6c60dade342bd|https://github.com/apache/spark/commit/69930310115501f0de094fe6f5c6c60dade342bd]
 Build with SBT and check the assembly jar for 
 {{parquet.hadoop.api.WriteSupport}}:
 {noformat}
 $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 
 clean assembly/assembly
 ...
 $ jar tf 
 assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | 
 fgrep parquet/hadoop/api
 org/apache/parquet/hadoop/api/
 org/apache/parquet/hadoop/api/DelegatingReadSupport.class
 org/apache/parquet/hadoop/api/DelegatingWriteSupport.class
 org/apache/parquet/hadoop/api/InitContext.class
 org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class
 org/apache/parquet/hadoop/api/ReadSupport.class
 org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
 org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class
 org/apache/parquet/hadoop/api/WriteSupport.class
 {noformat}
 Only classes of {{org.apache.parquet:parquet-mr:1.7.0}}. Note that classes in 
 {{com.twitter:parquet-hadoop-bundle:1.6.0}} are not under the {{org.apache}} 
 namespace.
 Build with Maven and check the assembly jar for 
 {{parquet.hadoop.api.WriteSupport}}:
 {noformat}
 $ ./build/mvn -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 
 -DskipTests clean package
 $ jar tf 
 assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | 
 fgrep parquet/hadoop/api
 org/apache/parquet/hadoop/api/
 org/apache/parquet/hadoop/api/DelegatingReadSupport.class
 org/apache/parquet/hadoop/api/DelegatingWriteSupport.class
 org/apache/parquet/hadoop/api/InitContext.class
 org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class
 org/apache/parquet/hadoop/api/ReadSupport.class
 org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
 org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class
 org/apache/parquet/hadoop/api/WriteSupport.class
 parquet/hadoop/api/
 parquet/hadoop/api/DelegatingReadSupport.class
 parquet/hadoop/api/DelegatingWriteSupport.class
 parquet/hadoop/api/InitContext.class
 parquet/hadoop/api/ReadSupport$ReadContext.class
 parquet/hadoop/api/ReadSupport.class
 parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
 parquet/hadoop/api/WriteSupport$WriteContext.class
 parquet/hadoop/api/WriteSupport.class
 {noformat}
 Expected classes are packaged properly.
 To reproduce the Parquet table access issue, first create a Parquet table 
 with Hive (say 0.13.1):
 {noformat}
 hive CREATE TABLE parquet_test STORED AS PARQUET AS SELECT 1;
 {noformat}
 Build Spark assembly jar with the SBT command above, start {{spark-shell}}:
 {noformat}
 scala sqlContext.table(parquet_test).show()
 15/08/14 17:52:50 INFO HiveMetaStore: 0: get_table : db=default 
 tbl=parquet_test
 15/08/14 17:52:50 INFO audit: ugi=lian  ip=unknown-ip-addr  cmd=get_table 
 : db=default tbl=parquet_test
 java.lang.NoClassDefFoundError: parquet/hadoop/api/WriteSupport
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at 
 org.apache.hadoop.hive.ql.metadata.Table.getOutputFormatClass(Table.java:328)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:320)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:303)
 at scala.Option.map(Option.scala:145)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:303)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:298)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
 at

[jira] [Commented] (SPARK-9960) run-example SparkPi fails on Mac


[ 
https://issues.apache.org/jira/browse/SPARK-9960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696546#comment-14696546
 ] 

Apache Spark commented on SPARK-9960:
-

User 'farseer90718' has created a pull request for this issue:
https://github.com/apache/spark/pull/8188

 run-example SparkPi fails on Mac
 

 Key: SPARK-9960
 URL: https://issues.apache.org/jira/browse/SPARK-9960
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 1.4.1
 Environment: java version 1.7.0_71, Mac OS X 10.9.5
Reporter: Naga
  Labels: run-example





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)


[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696561#comment-14696561
 ] 

Yadong Qi commented on SPARK-9213:
--

[~rxin] Yes, I will do it as below:
```
case class Like(left: Expression, right: Expression) {
  if (flag) {
JavaLike(left, right)
  } else {
JoniLike(left, right)
  }
}

case class JavaLike(left: Expression, right: Expression)
case class JoniLike(left: Expression, right: Expression)
```
Right?

 Improve regular expression performance (via joni)
 -

 Key: SPARK-9213
 URL: https://issues.apache.org/jira/browse/SPARK-9213
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin

 I'm creating an umbrella ticket to improve regular expression performance for 
 string expressions. Right now our use of regular expressions is inefficient 
 for two reasons:
 1. Java regex in general is slow.
 2. We have to convert everything from UTF8 encoded bytes into Java String, 
 and then run regex on it, and then convert it back.
 There are libraries in Java that provide regex support directly on UTF8 
 encoded bytes. One prominent example is joni, used in JRuby.
 Note: all regex functions are in 
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9966) Handle a couple of corner cases in the PID rate estimator

Tathagata Das created SPARK-9966:


 Summary: Handle a couple of corner cases in the PID rate estimator
 Key: SPARK-9966
 URL: https://issues.apache.org/jira/browse/SPARK-9966
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker


1. The rate estimator should not estimate any rate when there are no records in 
the batch, as there is no data to estimate the rate. In the current state, it 
estimates and set the rate to zero. That is incorrect.

2. The rate estimator should not never set the rate to zero under any 
circumstances. Otherwise the system will stop receiving data, and stop 
generating useful estimates (see reason 1). So the fix is to define a 
parameters that sets a lower bound on the estimated rate, so that the system 
always receives some data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9967) Rename the SparkConf property to spark.streaming.backpressure.{enable -- enabled}

Tathagata Das created SPARK-9967:


 Summary: Rename the SparkConf property to 
spark.streaming.backpressure.{enable -- enabled}
 Key: SPARK-9967
 URL: https://issues.apache.org/jira/browse/SPARK-9967
 Project: Spark
  Issue Type: Sub-task
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Minor


... to better align with most other spark parameters having enable in them...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)


[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696596#comment-14696596
 ] 

Reynold Xin commented on SPARK-9213:


We just need to handle it in the analyzer to rewrite it, and also pattern match 
it in the optimizer. No need to handle it everywhere else, since the analyzer 
will take care of the rewrite.


 Improve regular expression performance (via joni)
 -

 Key: SPARK-9213
 URL: https://issues.apache.org/jira/browse/SPARK-9213
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin

 I'm creating an umbrella ticket to improve regular expression performance for 
 string expressions. Right now our use of regular expressions is inefficient 
 for two reasons:
 1. Java regex in general is slow.
 2. We have to convert everything from UTF8 encoded bytes into Java String, 
 and then run regex on it, and then convert it back.
 There are libraries in Java that provide regex support directly on UTF8 
 encoded bytes. One prominent example is joni, used in JRuby.
 Note: all regex functions are in 
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)


[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696630#comment-14696630
 ] 

Yadong Qi commented on SPARK-9213:
--

I know, thanks!

 Improve regular expression performance (via joni)
 -

 Key: SPARK-9213
 URL: https://issues.apache.org/jira/browse/SPARK-9213
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin

 I'm creating an umbrella ticket to improve regular expression performance for 
 string expressions. Right now our use of regular expressions is inefficient 
 for two reasons:
 1. Java regex in general is slow.
 2. We have to convert everything from UTF8 encoded bytes into Java String, 
 and then run regex on it, and then convert it back.
 There are libraries in Java that provide regex support directly on UTF8 
 encoded bytes. One prominent example is joni, used in JRuby.
 Note: all regex functions are in 
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9958) HiveThriftServer2Listener is not thread-safe

2015-08-14 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-9958:
--
Assignee: Shixiong Zhu

 HiveThriftServer2Listener is not thread-safe
 

 Key: SPARK-9958
 URL: https://issues.apache.org/jira/browse/SPARK-9958
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
 Fix For: 1.5.0


 HiveThriftServer2Listener should be thread-safe since it's accessed by 
 multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9958) HiveThriftServer2Listener is not thread-safe

2015-08-14 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-9958.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8185
[https://github.com/apache/spark/pull/8185]

 HiveThriftServer2Listener is not thread-safe
 

 Key: SPARK-9958
 URL: https://issues.apache.org/jira/browse/SPARK-9958
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Shixiong Zhu
 Fix For: 1.5.0


 HiveThriftServer2Listener should be thread-safe since it's accessed by 
 multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields


[ 
https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696640#comment-14696640
 ] 

Apache Spark commented on SPARK-9961:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/8190

 ML prediction abstractions should have defaultEvaluator fields
 --

 Key: SPARK-9961
 URL: https://issues.apache.org/jira/browse/SPARK-9961
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley

 Predictor and PredictionModel should have abstract defaultEvaluator methods 
 which return Evaluators.  Subclasses like Regressor, Classifier, etc. should 
 all provide natural evaluators, set to use the correct input columns and 
 metrics.  Concrete classes may later be modified to 
 The initial implementation should be marked as DeveloperApi since we may need 
 to change the defaults later on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields


 [ 
https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9961:
---

Assignee: (was: Apache Spark)

 ML prediction abstractions should have defaultEvaluator fields
 --

 Key: SPARK-9961
 URL: https://issues.apache.org/jira/browse/SPARK-9961
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley

 Predictor and PredictionModel should have abstract defaultEvaluator methods 
 which return Evaluators.  Subclasses like Regressor, Classifier, etc. should 
 all provide natural evaluators, set to use the correct input columns and 
 metrics.  Concrete classes may later be modified to 
 The initial implementation should be marked as DeveloperApi since we may need 
 to change the defaults later on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9661) ML 1.5 QA: API: Java compatibility, docs

[
https://issues.apache.org/jira/browse/SPARK-9661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696641#comment-14696641
]

Apache Spark commented on SPARK-9661:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/8190

ML 1.5 QA: API: Java compatibility, docs

Key: SPARK-9661
URL: https://issues.apache.org/jira/browse/SPARK-9661
Project: Spark
Issue Type: Sub-task
Components: Documentation, Java API, ML, MLlib
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar
Fix For: 1.5.0

Check Java compatibility for MLlib for this release.
Checking compatibility means:
* comparing with the Scala doc
* verifying that Java docs are not messed up by Scala type incompatibilities.
Some items to look out for are:
** Check for generic Object types where Java cannot understand complex
Scala types.
*** *Note*: The Java docs do not always match the bytecode. If you find a
problem, please verify it using {{javap}}.
** Check Scala objects (especially with nesting!) carefully.
** Check for uses of Scala and Java enumerations, which can show up oddly in
the other language's doc.
* If needed for complex issues, create small Java unit tests which execute
each method. (The correctness can be checked in Scala.)
If you find issues, please comment here, or for larger items, create separate
JIRAs and link here.
Note that we should not break APIs from previous releases. So if you find a
problem, check if it was introduced in this Spark release (in which case we
can fix it) or in a previous one (in which case we can create a java-friendly
version of the API).

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields


 [ 
https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9961:
---

Assignee: Apache Spark

 ML prediction abstractions should have defaultEvaluator fields
 --

 Key: SPARK-9961
 URL: https://issues.apache.org/jira/browse/SPARK-9961
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Apache Spark

 Predictor and PredictionModel should have abstract defaultEvaluator methods 
 which return Evaluators.  Subclasses like Regressor, Classifier, etc. should 
 all provide natural evaluators, set to use the correct input columns and 
 metrics.  Concrete classes may later be modified to 
 The initial implementation should be marked as DeveloperApi since we may need 
 to change the defaults later on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-9661) ML 1.5 QA: API: Java compatibility, docs

[
https://issues.apache.org/jira/browse/SPARK-9661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Xiangrui Meng reopened SPARK-9661:
--

re-open for some clean-ups

ML 1.5 QA: API: Java compatibility, docs

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9964) PySpark DataFrameReader accept RDD of String for JSON

Joseph K. Bradley created SPARK-9964:


 Summary: PySpark DataFrameReader accept RDD of String for JSON
 Key: SPARK-9964
 URL: https://issues.apache.org/jira/browse/SPARK-9964
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, SQL
Reporter: Joseph K. Bradley
Priority: Minor


It would be nice (but not necessary) for the PySpark DataFrameReader to accept 
an RDD of Strings (like the Scala version does) for JSON, rather than only 
taking a path.

If this JIRA is accepted, it should probably be duplicated to cover the other 
input types (not just JSON).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl

Joseph K. Bradley created SPARK-9963:


 Summary: ML RandomForest cleanup: replace predictNodeIndex with 
predictImpl
 Key: SPARK-9963
 URL: https://issues.apache.org/jira/browse/SPARK-9963
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Trivial


Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl.

This should be straightforward, but please ping me if anything is unclear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9965) Scala, Python SQLContext input methods' deprecation statuses do not match

Joseph K. Bradley created SPARK-9965:


 Summary: Scala, Python SQLContext input methods' deprecation 
statuses do not match
 Key: SPARK-9965
 URL: https://issues.apache.org/jira/browse/SPARK-9965
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, SQL
Reporter: Joseph K. Bradley
Priority: Minor


Scala's SQLContext has several methods for data input (jsonFile, jsonRDD, etc.) 
deprecated.  These methods are not deprecated in Python's SQLContext.  They 
should be, but only after Python's DataFrameReader implements analogous methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9213) Improve regular expression performance (via joni)


[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696561#comment-14696561
 ] 

Yadong Qi edited comment on SPARK-9213 at 8/14/15 6:34 AM:
---

[~rxin] Yes, I will do it as below:
```
case class Like(left: Expression, right: Expression) {
  if (flag) {
JavaLike(left, right)
  } else {
JoniLike(left, right)
  }
}

case class JavaLike(left: Expression, right: Expression)
case class JoniLike(left: Expression, right: Expression)
```
Right?


was (Author: waterman):
[~rxin] Yes, I will do it as below:

```
case class Like(left: Expression, right: Expression) {
  if (flag) {
JavaLike(left, right)
  } else {
JoniLike(left, right)
  }
}

case class JavaLike(left: Expression, right: Expression)
case class JoniLike(left: Expression, right: Expression)
```

Right?

 Improve regular expression performance (via joni)
 -

 Key: SPARK-9213
 URL: https://issues.apache.org/jira/browse/SPARK-9213
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin

 I'm creating an umbrella ticket to improve regular expression performance for 
 string expressions. Right now our use of regular expressions is inefficient 
 for two reasons:
 1. Java regex in general is slow.
 2. We have to convert everything from UTF8 encoded bytes into Java String, 
 and then run regex on it, and then convert it back.
 There are libraries in Java that provide regex support directly on UTF8 
 encoded bytes. One prominent example is joni, used in JRuby.
 Note: all regex functions are in 
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9968) BlockGenerator lock structure can cause lock starvation of the block updating thread


 [ 
https://issues.apache.org/jira/browse/SPARK-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-9968:
-
Description: 
When the rate limiter is actually limiting the rate at which data is inserted 
into the buffer, the synchronized block of BlockGenerator.addData stays blocked 
for long time. This causes the thread switching the buffer and generating 
blocks (synchronized with addData) to starve and not generate blocks for 
seconds. The correct solution is to not block on the rate limiter within the 
synchronized block for adding data to the buffer. 


  was:When the rate limiter is actually limiting the rate at which data is 
inserted into the buffer, the synchronized block of BlockGenerator.addData 
stays blocked for long time. This causes the thread switching the buffer and 
generating blocks (synchronized with addData) to starve and not generate blocks 
for seconds. 


 BlockGenerator lock structure can cause lock starvation of the block updating 
 thread
 

 Key: SPARK-9968
 URL: https://issues.apache.org/jira/browse/SPARK-9968
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das

 When the rate limiter is actually limiting the rate at which data is inserted 
 into the buffer, the synchronized block of BlockGenerator.addData stays 
 blocked for long time. This causes the thread switching the buffer and 
 generating blocks (synchronized with addData) to starve and not generate 
 blocks for seconds. The correct solution is to not block on the rate limiter 
 within the synchronized block for adding data to the buffer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9968) BlockGenerator lock structure can cause lock starvation of the block updating thread

Tathagata Das created SPARK-9968:


 Summary: BlockGenerator lock structure can cause lock starvation 
of the block updating thread
 Key: SPARK-9968
 URL: https://issues.apache.org/jira/browse/SPARK-9968
 Project: Spark
  Issue Type: Sub-task
Reporter: Tathagata Das
Assignee: Tathagata Das


When the rate limiter is actually limiting the rate at which data is inserted 
into the buffer, the synchronized block of BlockGenerator.addData stays blocked 
for long time. This causes the thread switching the buffer and generating 
blocks (synchronized with addData) to starve and not generate blocks for 
seconds. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields


 [ 
https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9961:
-
Comment: was deleted

(was: User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/8190)

 ML prediction abstractions should have defaultEvaluator fields
 --

 Key: SPARK-9961
 URL: https://issues.apache.org/jira/browse/SPARK-9961
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley

 Predictor and PredictionModel should have abstract defaultEvaluator methods 
 which return Evaluators.  Subclasses like Regressor, Classifier, etc. should 
 all provide natural evaluators, set to use the correct input columns and 
 metrics.  Concrete classes may later be modified to 
 The initial implementation should be marked as DeveloperApi since we may need 
 to change the defaults later on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields

Joseph K. Bradley created SPARK-9961:


 Summary: ML prediction abstractions should have defaultEvaluator 
fields
 Key: SPARK-9961
 URL: https://issues.apache.org/jira/browse/SPARK-9961
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley


Predictor and PredictionModel should have abstract defaultEvaluator methods 
which return Evaluators.  Subclasses like Regressor, Classifier, etc. should 
all provide natural evaluators, set to use the correct input columns and 
metrics.  Concrete classes may later be modified to 

The initial implementation should be marked as DeveloperApi since we may need 
to change the defaults later on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9962) Decision Tree training: prevNodeIdsForInstances.unpersist() at end of training

Joseph K. Bradley created SPARK-9962:


 Summary: Decision Tree training: 
prevNodeIdsForInstances.unpersist() at end of training
 Key: SPARK-9962
 URL: https://issues.apache.org/jira/browse/SPARK-9962
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor


NodeIdCache: prevNodeIdsForInstances.unpersist() needs to be called at end of 
training.

This applies to both the ML and MLlib implementations, but it is Ok to skip the 
MLlib implementation since it will eventually be replaced by the ML one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9213) Improve regular expression performance (via joni)


[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696561#comment-14696561
 ] 

Yadong Qi edited comment on SPARK-9213 at 8/14/15 6:33 AM:
---

[~rxin] Yes, I will do it as below:

```
case class Like(left: Expression, right: Expression) {
  if (flag) {
JavaLike(left, right)
  } else {
JoniLike(left, right)
  }
}

case class JavaLike(left: Expression, right: Expression)
case class JoniLike(left: Expression, right: Expression)
```

Right?


was (Author: waterman):
[~rxin] Yes, I will do it as below:
```
case class Like(left: Expression, right: Expression) {
  if (flag) {
JavaLike(left, right)
  } else {
JoniLike(left, right)
  }
}

case class JavaLike(left: Expression, right: Expression)
case class JoniLike(left: Expression, right: Expression)
```
Right?

 Improve regular expression performance (via joni)
 -

 Key: SPARK-9213
 URL: https://issues.apache.org/jira/browse/SPARK-9213
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin

 I'm creating an umbrella ticket to improve regular expression performance for 
 string expressions. Right now our use of regular expressions is inefficient 
 for two reasons:
 1. Java regex in general is slow.
 2. We have to convert everything from UTF8 encoded bytes into Java String, 
 and then run regex on it, and then convert it back.
 There are libraries in Java that provide regex support directly on UTF8 
 encoded bytes. One prominent example is joni, used in JRuby.
 Note: all regex functions are in 
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)


[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696574#comment-14696574
 ] 

Reynold Xin commented on SPARK-9213:


I'm thinking just have Like for Joni, and then LikeJavaFallback for Java. In 
the analyzer, we replace Like with LikeJavaFallback if the config option sets 
to java.


 Improve regular expression performance (via joni)
 -

 Key: SPARK-9213
 URL: https://issues.apache.org/jira/browse/SPARK-9213
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin

 I'm creating an umbrella ticket to improve regular expression performance for 
 string expressions. Right now our use of regular expressions is inefficient 
 for two reasons:
 1. Java regex in general is slow.
 2. We have to convert everything from UTF8 encoded bytes into Java String, 
 and then run regex on it, and then convert it back.
 There are libraries in Java that provide regex support directly on UTF8 
 encoded bytes. One prominent example is joni, used in JRuby.
 Note: all regex functions are in 
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)


[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696595#comment-14696595
 ] 

Yadong Qi commented on SPARK-9213:
--

[~rxin] There're many place use Like, It does not matter to check the config 
option everywhere?

 Improve regular expression performance (via joni)
 -

 Key: SPARK-9213
 URL: https://issues.apache.org/jira/browse/SPARK-9213
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin

 I'm creating an umbrella ticket to improve regular expression performance for 
 string expressions. Right now our use of regular expressions is inefficient 
 for two reasons:
 1. Java regex in general is slow.
 2. We have to convert everything from UTF8 encoded bytes into Java String, 
 and then run regex on it, and then convert it back.
 There are libraries in Java that provide regex support directly on UTF8 
 encoded bytes. One prominent example is joni, used in JRuby.
 Note: all regex functions are in 
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6624) Convert filters into CNF for data sources


[ 
https://issues.apache.org/jira/browse/SPARK-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696988#comment-14696988
 ] 

Apache Spark commented on SPARK-6624:
-

User 'yjshen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8193

 Convert filters into CNF for data sources
 -

 Key: SPARK-6624
 URL: https://issues.apache.org/jira/browse/SPARK-6624
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Yijie Shen

 We should turn filters into conjunctive normal form (CNF) before we pass them 
 to data sources. Otherwise, filters are not very useful if there is a single 
 filter with a bunch of ORs.
 Note that we already try to do some of these in BooleanSimplification, but I 
 think we should just formalize it to use CNF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9968) BlockGenerator lock structure can cause lock starvation of the block updating thread


 [ 
https://issues.apache.org/jira/browse/SPARK-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9968:
---

Assignee: Apache Spark  (was: Tathagata Das)

 BlockGenerator lock structure can cause lock starvation of the block updating 
 thread
 

 Key: SPARK-9968
 URL: https://issues.apache.org/jira/browse/SPARK-9968
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Apache Spark

 When the rate limiter is actually limiting the rate at which data is inserted 
 into the buffer, the synchronized block of BlockGenerator.addData stays 
 blocked for long time. This causes the thread switching the buffer and 
 generating blocks (synchronized with addData) to starve and not generate 
 blocks for seconds. The correct solution is to not block on the rate limiter 
 within the synchronized block for adding data to the buffer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9968) BlockGenerator lock structure can cause lock starvation of the block updating thread


[ 
https://issues.apache.org/jira/browse/SPARK-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696987#comment-14696987
 ] 

Apache Spark commented on SPARK-9968:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/8204

 BlockGenerator lock structure can cause lock starvation of the block updating 
 thread
 

 Key: SPARK-9968
 URL: https://issues.apache.org/jira/browse/SPARK-9968
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das

 When the rate limiter is actually limiting the rate at which data is inserted 
 into the buffer, the synchronized block of BlockGenerator.addData stays 
 blocked for long time. This causes the thread switching the buffer and 
 generating blocks (synchronized with addData) to starve and not generate 
 blocks for seconds. The correct solution is to not block on the rate limiter 
 within the synchronized block for adding data to the buffer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9968) BlockGenerator lock structure can cause lock starvation of the block updating thread


 [ 
https://issues.apache.org/jira/browse/SPARK-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9968:
---

Assignee: Tathagata Das  (was: Apache Spark)

 BlockGenerator lock structure can cause lock starvation of the block updating 
 thread
 

 Key: SPARK-9968
 URL: https://issues.apache.org/jira/browse/SPARK-9968
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das

 When the rate limiter is actually limiting the rate at which data is inserted 
 into the buffer, the synchronized block of BlockGenerator.addData stays 
 blocked for long time. This causes the thread switching the buffer and 
 generating blocks (synchronized with addData) to starve and not generate 
 blocks for seconds. The correct solution is to not block on the rate limiter 
 within the synchronized block for adding data to the buffer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9734) java.lang.IllegalArgumentException: Don't know how to save StructField(sal,DecimalType(7,2),true) to JDBC

2015-08-14 Thread Rama Mullapudi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697003#comment-14697003
 ] 

Rama Mullapudi commented on SPARK-9734:
---

Current 1.5 still gives error when creating table using dataframe write.jdbc

Create statement issued by spark looks as below
CREATE TABLE foo (TKT_GID DECIMAL(10},0}) NOT NULL)

There are closing braces } in the decimal format which causing database to 
throw error.

I looked into the code on github and found in jdbcutils class schemaString 
function has the extra closing braces } which is causing the issue.


  /**
   * Compute the schema string for this RDD.
   */
  def schemaString(df: DataFrame, url: String): String = {
 .
case BooleanType = BIT(1)
case StringType = TEXT
case BinaryType = BLOB
case TimestampType = TIMESTAMP
case DateType = DATE
case t: DecimalType = sDECIMAL(${t.precision}},${t.scale}})
case _ = throw new IllegalArgumentException(sDon't know how to 
save $field to JDBC)
  })

  }

 java.lang.IllegalArgumentException: Don't know how to save 
 StructField(sal,DecimalType(7,2),true) to JDBC
 -

 Key: SPARK-9734
 URL: https://issues.apache.org/jira/browse/SPARK-9734
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: Greg Rahn
Assignee: Davies Liu
 Fix For: 1.5.0


 When using a basic example of reading the EMP table from Redshift via 
 spark-redshift, and writing the data back to Redshift, Spark fails with the 
 below error, related to Numeric/Decimal data types.
 Redshift table:
 {code}
 testdb=# \d emp
   Table public.emp
   Column  | Type  | Modifiers
 --+---+---
  empno| integer   |
  ename| character varying(10) |
  job  | character varying(9)  |
  mgr  | integer   |
  hiredate | date  |
  sal  | numeric(7,2)  |
  comm | numeric(7,2)  |
  deptno   | integer   |
 testdb=# select * from emp;
  empno | ename  |job| mgr  |  hiredate  |   sal   |  comm   | deptno
 ---++---+--++-+-+
   7369 | SMITH  | CLERK | 7902 | 1980-12-17 |  800.00 |NULL | 20
   7521 | WARD   | SALESMAN  | 7698 | 1981-02-22 | 1250.00 |  500.00 | 30
   7654 | MARTIN | SALESMAN  | 7698 | 1981-09-28 | 1250.00 | 1400.00 | 30
   7782 | CLARK  | MANAGER   | 7839 | 1981-06-09 | 2450.00 |NULL | 10
   7839 | KING   | PRESIDENT | NULL | 1981-11-17 | 5000.00 |NULL | 10
   7876 | ADAMS  | CLERK | 7788 | 1983-01-12 | 1100.00 |NULL | 20
   7902 | FORD   | ANALYST   | 7566 | 1981-12-03 | 3000.00 |NULL | 20
   7499 | ALLEN  | SALESMAN  | 7698 | 1981-02-20 | 1600.00 |  300.00 | 30
   7566 | JONES  | MANAGER   | 7839 | 1981-04-02 | 2975.00 |NULL | 20
   7698 | BLAKE  | MANAGER   | 7839 | 1981-05-01 | 2850.00 |NULL | 30
   7788 | SCOTT  | ANALYST   | 7566 | 1982-12-09 | 3000.00 |NULL | 20
   7844 | TURNER | SALESMAN  | 7698 | 1981-09-08 | 1500.00 |0.00 | 30
   7900 | JAMES  | CLERK | 7698 | 1981-12-03 |  950.00 |NULL | 30
   7934 | MILLER | CLERK | 7782 | 1982-01-23 | 1300.00 |NULL | 10
 (14 rows)
 {code}
 Spark Code:
 {code}
 val url = jdbc:redshift://rshost:5439/testdb?user=xxxpassword=xxx
 val driver = com.amazon.redshift.jdbc41.Driver
 val t = 
 sqlContext.read.format(com.databricks.spark.redshift).option(jdbcdriver, 
 driver).option(url, url).option(dbtable, emp).option(tempdir, 
 s3n://spark-temp-dir).load()
 t.registerTempTable(SparkTempTable)
 val t1 = sqlContext.sql(select * from SparkTempTable)
 t1.write.format(com.databricks.spark.redshift).option(driver, 
 driver).option(url, url).option(dbtable, t1).option(tempdir, 
 s3n://spark-temp-dir).option(avrocompression, 
 snappy).mode(error).save()
 {code}
 Error Stack:
 {code}
 java.lang.IllegalArgumentException: Don't know how to save 
 StructField(sal,DecimalType(7,2),true) to JDBC
   at 
 org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1$$anonfun$2.apply(jdbc.scala:149)
   at 
 org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1$$anonfun$2.apply(jdbc.scala:136)
   at scala.Option.getOrElse(Option.scala:120)
   at 
 org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1.apply(jdbc.scala:135)
   at 
 org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1.apply(jdbc.scala:132)
   at

[jira] [Created] (SPARK-9977) The usage of a label generated by StringIndexer

2015-08-14 Thread Kai Sasaki (JIRA)

Kai Sasaki created SPARK-9977:
-

 Summary: The usage of a label generated by StringIndexer
 Key: SPARK-9977
 URL: https://issues.apache.org/jira/browse/SPARK-9977
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.1
Reporter: Kai Sasaki
Priority: Trivial


By using {{StringIndexer}}, we can obtain indexed label on new column. So a 
following estimator should use this new column through pipeline if it wants to 
use string indexed label. 
I think it is better to make it explicit on documentation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9976) create function do not work

2015-08-14 Thread cen yuhai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-9976:
-
Description: 
I use beeline to connect to ThriftServer, but add jar can not work, so I use 
create function , see the link below.
http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_mc_hive_udf.html

I do as blow:
create function gdecodeorder as 'com.hive.udf.GOrderDecode' USING JAR 
'hdfs://mycluster/user/spark/lib/gorderdecode.jar'; 

It returns Ok, and I connect to the metastore, I see records in table FUNCS.

select gdecodeorder(t1)  from tableX  limit 1;
It returns error 'Couldn't find function default.gdecodeorder'


This is the Exception
15/08/14 14:53:51 ERROR UserGroupInformation: PriviledgedActionException 
as:xiaoju (auth:SIMPLE) cause:org.apache.hive.service.cli.HiveSQLException: 
java.lang.RuntimeException: Couldn't find function default.gdecodeorder
15/08/14 15:04:47 ERROR RetryingHMSHandler: 
MetaException(message:NoSuchObjectException(message:Function 
default.t_gdecodeorder does not exist))
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newMetaException(HiveMetaStore.java:4613)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_function(HiveMetaStore.java:4740)
at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
at com.sun.proxy.$Proxy21.get_function(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getFunction(HiveMetaStoreClient.java:1721)
at sun.reflect.GeneratedMethodAccessor56.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
at com.sun.proxy.$Proxy22.getFunction(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.getFunction(Hive.java:2662)
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfoFromMetastore(FunctionRegistry.java:546)
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.getQualifiedFunctionInfo(FunctionRegistry.java:579)
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:645)
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:652)
at 
org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUdfs.scala:54)
at 
org.apache.spark.sql.hive.HiveContext$$anon$3.org$apache$spark$sql$catalyst$analysis$OverrideFunctionRegistry$$super$lookupFunction(HiveContext.scala:376)
at 
org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$class.lookupFunction(FunctionRegistry.scala:44)
at 
org.apache.spark.sql.hive.HiveContext$$anon$3.lookupFunction(HiveContext.scala:376)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:465)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:463)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at

[jira] [Created] (SPARK-9978) Window functions require partitionBy to work as expected

2015-08-14 Thread Maciej Szymkiewicz (JIRA)

Maciej Szymkiewicz created SPARK-9978:
-

 Summary: Window functions require partitionBy to work as expected
 Key: SPARK-9978
 URL: https://issues.apache.org/jira/browse/SPARK-9978
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.1
Reporter: Maciej Szymkiewicz


I am trying to reproduce following query:

{code}
df.registerTempTable(df)
sqlContext.sql(SELECT x, row_number() OVER (ORDER BY x) as rn FROM df).show()

++--+
|   x|rn|
++--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
++--+
{code}

using PySpark API. Unfortunately it doesn't work as expected:

{code}
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber

df = sqlContext.createDataFrame([{x: 0.25}, {x: 0.5}, {x: 0.75}])
df.select(df[x], rowNumber().over(Window.orderBy(x)).alias(rn)).show()

++--+
|   x|rn|
++--+
| 0.5| 1|
|0.25| 1|
|0.75| 1|
++--+
{code}

As a workaround It is possible to call partitionBy without additional arguments:

{code}
df.select(
df[x],
rowNumber().over(Window.partitionBy().orderBy(x)).alias(rn)
).show()

++--+
|   x|rn|
++--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
++--+
{code}

but as far as I can tell it is not documented and is rather counterintuitive 
considering SQL syntax. Moreover this problem doesn't affect Scala API:

{code}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber

case class Record(x: Double)
val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75))
df.select($x, rowNumber().over(Window.orderBy($x)).alias(rn)).show

++--+
|   x|rn|
++--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
++--+
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9978) Window functions require partitionBy to work as expected

2015-08-14 Thread Maciej Szymkiewicz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-9978:
--
Description: 
I am trying to reproduce following SQL query:

{code}
df.registerTempTable(df)
sqlContext.sql(SELECT x, row_number() OVER (ORDER BY x) as rn FROM df).show()

++--+
|   x|rn|
++--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
++--+
{code}

using PySpark API. Unfortunately it doesn't work as expected:

{code}
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber

df = sqlContext.createDataFrame([{x: 0.25}, {x: 0.5}, {x: 0.75}])
df.select(df[x], rowNumber().over(Window.orderBy(x)).alias(rn)).show()

++--+
|   x|rn|
++--+
| 0.5| 1|
|0.25| 1|
|0.75| 1|
++--+
{code}

As a workaround It is possible to call partitionBy without additional arguments:

{code}
df.select(
df[x],
rowNumber().over(Window.partitionBy().orderBy(x)).alias(rn)
).show()

++--+
|   x|rn|
++--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
++--+
{code}

but as far as I can tell it is not documented and is rather counterintuitive 
considering SQL syntax. Moreover this problem doesn't affect Scala API:

{code}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber

case class Record(x: Double)
val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75))
df.select($x, rowNumber().over(Window.orderBy($x)).alias(rn)).show

++--+
|   x|rn|
++--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
++--+
{code}


  was:
I am trying to reproduce following query:

{code}
df.registerTempTable(df)
sqlContext.sql(SELECT x, row_number() OVER (ORDER BY x) as rn FROM df).show()

++--+
|   x|rn|
++--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
++--+
{code}

using PySpark API. Unfortunately it doesn't work as expected:

{code}
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber

df = sqlContext.createDataFrame([{x: 0.25}, {x: 0.5}, {x: 0.75}])
df.select(df[x], rowNumber().over(Window.orderBy(x)).alias(rn)).show()

++--+
|   x|rn|
++--+
| 0.5| 1|
|0.25| 1|
|0.75| 1|
++--+
{code}

As a workaround It is possible to call partitionBy without additional arguments:

{code}
df.select(
df[x],
rowNumber().over(Window.partitionBy().orderBy(x)).alias(rn)
).show()

++--+
|   x|rn|
++--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
++--+
{code}

but as far as I can tell it is not documented and is rather counterintuitive 
considering SQL syntax. Moreover this problem doesn't affect Scala API:

{code}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber

case class Record(x: Double)
val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75))
df.select($x, rowNumber().over(Window.orderBy($x)).alias(rn)).show

++--+
|   x|rn|
++--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
++--+
{code}



 Window functions require partitionBy to work as expected
 

 Key: SPARK-9978
 URL: https://issues.apache.org/jira/browse/SPARK-9978
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.1
Reporter: Maciej Szymkiewicz

 I am trying to reproduce following SQL query:
 {code}
 df.registerTempTable(df)
 sqlContext.sql(SELECT x, row_number() OVER (ORDER BY x) as rn FROM 
 df).show()
 ++--+
 |   x|rn|
 ++--+
 |0.25| 1|
 | 0.5| 2|
 |0.75| 3|
 ++--+
 {code}
 using PySpark API. Unfortunately it doesn't work as expected:
 {code}
 from pyspark.sql.window import Window
 from pyspark.sql.functions import rowNumber
 df = sqlContext.createDataFrame([{x: 0.25}, {x: 0.5}, {x: 0.75}])
 df.select(df[x], rowNumber().over(Window.orderBy(x)).alias(rn)).show()
 ++--+
 |   x|rn|
 ++--+
 | 0.5| 1|
 |0.25| 1|
 |0.75| 1|
 ++--+
 {code}
 As a workaround It is possible to call partitionBy without additional 
 arguments:
 {code}
 df.select(
 df[x],
 rowNumber().over(Window.partitionBy().orderBy(x)).alias(rn)
 ).show()
 ++--+
 |   x|rn|
 ++--+
 |0.25| 1|
 | 0.5| 2|
 |0.75| 3|
 ++--+
 {code}
 but as far as I can tell it is not documented and is rather counterintuitive 
 considering SQL syntax. Moreover this problem doesn't affect Scala API:
 {code}
 import org.apache.spark.sql.expressions.Window
 import org.apache.spark.sql.functions.rowNumber
 case class Record(x: Double)
 val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: 
 Record(0.75))
 df.select($x, rowNumber().over(Window.orderBy($x)).alias(rn)).show
 ++--+
 |   x|rn|
 ++--+
 |0.25| 1|
 | 0.5| 2|
 |0.75| 3|
 ++--+
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional

[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-08-14 Thread ding (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696677#comment-14696677
 ] 

ding commented on SPARK-5556:
-

The code can be found https://github.com/intel-analytics/TopicModeling. 

There is an example in the package, you can try gibbs sampling lda or online 
lda by setting --optimizer as gibbs or online


 Latent Dirichlet Allocation (LDA) using Gibbs sampler 
 --

 Key: SPARK-5556
 URL: https://issues.apache.org/jira/browse/SPARK-5556
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Pedro Rodriguez
 Attachments: LDA_test.xlsx, spark-summit.pptx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9725) spark sql query string field return empty/garbled string

2015-08-14 Thread Zhongshuai Pei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696684#comment-14696684
 ] 

Zhongshuai Pei commented on SPARK-9725:
---

In my environment,offset did not change according to BYTE_ARRAY_OFFSET.
when set spark.executor.memory=36g, BYTE_ARRAY_OFFSET is 24, offset is 16 , 
but set spark.executor.memory=30g, BYTE_ARRAY_OFFSET is 16,  offset is 16 
too. So if (offset == BYTE_ARRAY_OFFSET  base instanceof byte[]   
((byte[]) base).length == numBytes) will get different result. [~rxin] 

 spark sql query string field return empty/garbled string
 

 Key: SPARK-9725
 URL: https://issues.apache.org/jira/browse/SPARK-9725
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Fei Wang
Assignee: Davies Liu
Priority: Blocker

 to reproduce it:
 1 deploy spark cluster mode, i use standalone mode locally
 2 set executor memory = 32g, set following config in spark-default.xml
spark.executor.memory36g 
 3 run spark-sql.sh with show tables it return empty/garbled string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9970) SQLContext.createDataFrame failed to properly determine column names


 [ 
https://issues.apache.org/jira/browse/SPARK-9970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-9970:
--
Description: 
Hi,
I'm trying to do nested join of tables.
After first join everything is ok, but second join made some of the column 
names lost.

My code is following:

{code:java}
def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, 
joinType = left_outer):
tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: 
r.asDict()[columnRight]))
tmpTable = tmpTable.select(tmpTable._1.alias(joinColumn), 
tmpTable._2.data.alias(columnNested))
return tableLeft.join(tmpTable, tableLeft[columnLeft] == 
tmpTable[joinColumn], joinType).drop(joinColumn)


user = sqlContext.read.json(path + user.json)
user.printSchema()
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)

order =  sqlContext.read.json(path + order.json)
order.printSchema();
root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)

lines =  sqlContext.read.json(path + lines.json)
lines.printSchema();
root
 |-- id: long (nullable = true)
 |-- orderid: long (nullable = true)
 |-- product: string (nullable = true)
{code}

And joining code:

{code:java}
orders = joinTable(order, lines, id, orderid, lines)
orders.printSchema()

root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)
 |-- lines: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- orderid: long (nullable = true)
 |||-- product: string (nullable = true)

clients = joinTable(user, orders, id, userid, orders)
clients.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- price: double (nullable = true)
 |||-- userid: long (nullable = true)
 |||-- lines: array (nullable = true)
 ||||-- element: struct (containsNull = true)
 |||||-- _1: long (nullable = true)
 |||||-- _2: long (nullable = true)
 |||||-- _3: string (nullable = true)
{code}

I tried to check if groupBy isn't the root of the problem but it's looks right
{code:java}
grouped =  orders.rdd.groupBy(lambda r: r.userid)
print grouped.map(lambda x: list(x[1])).collect()

[[Row(id=1, price=202.3, userid=1, lines=[Row(id=1, orderid=1, product=u'XXX'), 
Row(id=2, orderid=1, product=u'YYY')]), Row(id=2, price=343.99, userid=1, 
lines=[Row(id=3, orderid=2, product=u'XXX')])], [Row(id=3, price=399.99, 
userid=2, lines=[Row(id=4, orderid=3, product=u'XXX')])]]
{code}

So I assume that the problem is with createDataFrame.

  was:
Hi,
I'm trying to do nested join of tables.
After first join everything is ok, but second join made some of the column 
names lost.

My code is following:
def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, 
joinType = left_outer):
print Joining tables %s %s % (namestr(tableLeft), namestr(tableRight))
sys.stdout.flush()
tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: 
r.asDict()[columnRight]))
tmpTable = tmpTable.select(tmpTable._1.alias(joinColumn), 
tmpTable._2.data.alias(columnNested))
return tableLeft.join(tmpTable, tableLeft[columnLeft] == 
tmpTable[joinColumn], joinType).drop(joinColumn)

user = sqlContext.read.json(path + user.json)
user.printSchema()
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)

order =  sqlContext.read.json(path + order.json)
order.printSchema();
root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)

lines =  sqlContext.read.json(path + lines.json)
lines.printSchema();
root
 |-- id: long (nullable = true)
 |-- orderid: long (nullable = true)
 |-- product: string (nullable = true)

orders = joinTable(order, lines, id, orderid, lines)
orders.printSchema()

root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)
 |-- lines: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- orderid: long (nullable = true)
 |||-- product: string (nullable = true)

clients = joinTable(user, orders, id, userid, orders)
clients.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- price: double (nullable = true)
 |||-- userid: long (nullable = true)
 |||-- lines: array (nullable = true)
 ||||-- element: struct (containsNull = true)
 ||

[jira] [Assigned] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer


 [ 
https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9899:
---

Assignee: Cheng Lian  (was: Apache Spark)

 JSON/Parquet writing on retry or speculation broken with direct output 
 committer
 

 Key: SPARK-9899
 URL: https://issues.apache.org/jira/browse/SPARK-9899
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker

 If the first task fails all subsequent tasks will.  We probably need to set a 
 different boolean when calling create.
 {code}
 java.io.IOException: File already exists: ...
 ...
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
   at 
 org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128)
   at 
 org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.init(JSONRelation.scala:185)
   at 
 org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160)
   at 
 org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217)
   at 
 org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
   at 
 org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
   at org.apache.spark.scheduler.Task.run(Task.scala:88)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer


[ 
https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696672#comment-14696672
 ] 

Apache Spark commented on SPARK-9899:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8191

 JSON/Parquet writing on retry or speculation broken with direct output 
 committer
 

 Key: SPARK-9899
 URL: https://issues.apache.org/jira/browse/SPARK-9899
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker

 If the first task fails all subsequent tasks will.  We probably need to set a 
 different boolean when calling create.
 {code}
 java.io.IOException: File already exists: ...
 ...
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
   at 
 org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128)
   at 
 org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.init(JSONRelation.scala:185)
   at 
 org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160)
   at 
 org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217)
   at 
 org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
   at 
 org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
   at org.apache.spark.scheduler.Task.run(Task.scala:88)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer


 [ 
https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9899:
---

Assignee: Apache Spark  (was: Cheng Lian)

 JSON/Parquet writing on retry or speculation broken with direct output 
 committer
 

 Key: SPARK-9899
 URL: https://issues.apache.org/jira/browse/SPARK-9899
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Apache Spark
Priority: Blocker

 If the first task fails all subsequent tasks will.  We probably need to set a 
 different boolean when calling create.
 {code}
 java.io.IOException: File already exists: ...
 ...
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
   at 
 org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128)
   at 
 org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.init(JSONRelation.scala:185)
   at 
 org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160)
   at 
 org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217)
   at 
 org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
   at 
 org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
   at org.apache.spark.scheduler.Task.run(Task.scala:88)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9970) SQLContext.createDataFrame failed to properly determine column names

Maciej Bryński created SPARK-9970:
-

 Summary: SQLContext.createDataFrame failed to properly determine 
column names
 Key: SPARK-9970
 URL: https://issues.apache.org/jira/browse/SPARK-9970
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
Reporter: Maciej Bryński
Priority: Minor


Hi,
I'm trying to do nested join of tables.
After first join everything is ok, but second join made some of the column 
names lost.

My code is following:
def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, 
joinType = left_outer):
print Joining tables %s %s % (namestr(tableLeft), namestr(tableRight))
sys.stdout.flush()
tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: 
r.asDict()[columnRight]))
tmpTable = tmpTable.select(tmpTable._1.alias(joinColumn), 
tmpTable._2.data.alias(columnNested))
return tableLeft.join(tmpTable, tableLeft[columnLeft] == 
tmpTable[joinColumn], joinType).drop(joinColumn)

user = sqlContext.read.json(path + user.json)
user.printSchema()
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)

order =  sqlContext.read.json(path + order.json)
order.printSchema();
root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)

lines =  sqlContext.read.json(path + lines.json)
lines.printSchema();
root
 |-- id: long (nullable = true)
 |-- orderid: long (nullable = true)
 |-- product: string (nullable = true)

orders = joinTable(order, lines, id, orderid, lines)
orders.printSchema()

root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)
 |-- lines: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- orderid: long (nullable = true)
 |||-- product: string (nullable = true)

clients = joinTable(user, orders, id, userid, orders)
clients.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- price: double (nullable = true)
 |||-- userid: long (nullable = true)
 |||-- lines: array (nullable = true)
 ||||-- element: struct (containsNull = true)
 |||||-- _1: long (nullable = true)
 |||||-- _2: long (nullable = true)
 |||||-- _3: string (nullable = true)


I tried to check if groupBy isn't the root of the problem but it's looks right
grouped =  orders.rdd.groupBy(lambda r: r.userid)
print grouped.map(lambda x: list(x[1])).collect()

[[Row(id=1, price=202.3, userid=1, lines=[Row(id=1, orderid=1, product=u'XXX'), 
Row(id=2, orderid=1, product=u'YYY')]), Row(id=2, price=343.99, userid=1, 
lines=[Row(id=3, orderid=2, product=u'XXX')])], [Row(id=3, price=399.99, 
userid=2, lines=[Row(id=4, orderid=3, product=u'XXX')])]]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9970) SQLContext.createDataFrame failed to properly determine column names


 [ 
https://issues.apache.org/jira/browse/SPARK-9970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-9970:
--
Description: 
Hi,
I'm trying to do nested join of tables.
After first join everything is ok, but second join made some of the column 
names lost.

My code is following:

{code:java}
def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, 
joinType = left_outer):
tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: 
r.asDict()[columnRight]))
tmpTable = tmpTable.select(tmpTable._1.alias(joinColumn), 
tmpTable._2.data.alias(columnNested))
return tableLeft.join(tmpTable, tableLeft[columnLeft] == 
tmpTable[joinColumn], joinType).drop(joinColumn)


user = sqlContext.read.json(path + user.json)
user.printSchema()
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)

order =  sqlContext.read.json(path + order.json)
order.printSchema();
root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)

lines =  sqlContext.read.json(path + lines.json)
lines.printSchema();
root
 |-- id: long (nullable = true)
 |-- orderid: long (nullable = true)
 |-- product: string (nullable = true)
{code}

And joining code:
Please look for the result of 2nd join. There are columns _1,_2,_3. Should be 
'id', 'orderid', 'product'

{code:java}
orders = joinTable(order, lines, id, orderid, lines)
orders.printSchema()

root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)
 |-- lines: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- orderid: long (nullable = true)
 |||-- product: string (nullable = true)

clients = joinTable(user, orders, id, userid, orders)
clients.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- price: double (nullable = true)
 |||-- userid: long (nullable = true)
 |||-- lines: array (nullable = true)
 ||||-- element: struct (containsNull = true)
 |||||-- _1: long (nullable = true)
 |||||-- _2: long (nullable = true)
 |||||-- _3: string (nullable = true)
{code}

I tried to check if groupBy isn't the cause of the problem but it looks right
{code:java}
grouped =  orders.rdd.groupBy(lambda r: r.userid)
print grouped.map(lambda x: list(x[1])).collect()

[[Row(id=1, price=202.3, userid=1, lines=[Row(id=1, orderid=1, product=u'XXX'), 
Row(id=2, orderid=1, product=u'YYY')]), Row(id=2, price=343.99, userid=1, 
lines=[Row(id=3, orderid=2, product=u'XXX')])], [Row(id=3, price=399.99, 
userid=2, lines=[Row(id=4, orderid=3, product=u'XXX')])]]
{code}

So I assume that the problem is with createDataFrame.

  was:
Hi,
I'm trying to do nested join of tables.
After first join everything is ok, but second join made some of the column 
names lost.

My code is following:

{code:java}
def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, 
joinType = left_outer):
tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: 
r.asDict()[columnRight]))
tmpTable = tmpTable.select(tmpTable._1.alias(joinColumn), 
tmpTable._2.data.alias(columnNested))
return tableLeft.join(tmpTable, tableLeft[columnLeft] == 
tmpTable[joinColumn], joinType).drop(joinColumn)


user = sqlContext.read.json(path + user.json)
user.printSchema()
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)

order =  sqlContext.read.json(path + order.json)
order.printSchema();
root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)

lines =  sqlContext.read.json(path + lines.json)
lines.printSchema();
root
 |-- id: long (nullable = true)
 |-- orderid: long (nullable = true)
 |-- product: string (nullable = true)
{code}

And joining code:
Please look for the result of 2nd join. There are columns _1,_2,_3. Should be 
'id', 'orderid', 'product'

{code:java}
orders = joinTable(order, lines, id, orderid, lines)
orders.printSchema()

root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)
 |-- lines: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- orderid: long (nullable = true)
 |||-- product: string (nullable = true)

clients = joinTable(user, orders, id, userid, orders)
clients.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- price: double (nullable =

[jira] [Commented] (SPARK-9871) Add expression functions into SparkR which have a variable parameter

2015-08-14 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696706#comment-14696706
 ] 

Yu Ishikawa commented on SPARK-9871:


I think it would be better to deal with {{struct}} on another issue. Since 
{{collect}} doesn't work with a DataFrame which has a column of Struct type. So 
we need to improve {{collect}} method.

When I tried to implement {{struct}} function, a strict column converted by 
{{dfToCols}} consists of {{jobj}} 
{noformat}
List of 1
 $ structed:List of 2
  ..$ :Class 'jobj' environment: 0x7fd46efe4e68
  ..$ :Class 'jobj' environment: 0x7fd46efee078
{noformat}

 Add expression functions into SparkR which have a variable parameter
 

 Key: SPARK-9871
 URL: https://issues.apache.org/jira/browse/SPARK-9871
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa

 Add expression functions into SparkR which has a variable parameter, like 
 {{concat}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN

Frank Rosner created SPARK-9971:
---

 Summary: MaxFunction not working correctly with columns containing 
Double.NaN
 Key: SPARK-9971
 URL: https://issues.apache.org/jira/browse/SPARK-9971
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: Frank Rosner


h5. Problem Description

When using the {{max}} function on a {{DoubleType}} column that contains 
{{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 

This is because it compares all values with the running maximum. However, {{x  
Double.NaN}} will always lead false for all {{x: Double}}, so will {{x  
Double.NaN}}.

h5. How to Reproduce

{code}
import org.apache.spark.sql.{SQLContext, Row}
import org.apache.spark.sql.types._

val sql = new SQLContext(sc)
val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
val dataFrame = sql.createDataFrame(rdd, StructType(List(
  StructField(col, DoubleType, false)
)))
dataFrame.select(max(col)).first
// returns org.apache.spark.sql.Row = [NaN]
{code}

h5. Solution

The {{max}} and {{min}} functions should ignore NaN values, as they are not 
numbers. If a column contains only NaN values, then the maximum and minimum is 
not defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9972) Add `struct` function in SparkR

2015-08-14 Thread Yu Ishikawa (JIRA)

Yu Ishikawa created SPARK-9972:
--

 Summary: Add `struct` function in SparkR
 Key: SPARK-9972
 URL: https://issues.apache.org/jira/browse/SPARK-9972
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa


Support {{struct}} function on a DataFrame in SparkR. However, I think we need 
to improve {{collect}} function in SparkR in order to implement {{struct}} 
function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9973) wrong buffle size

2015-08-14 Thread xukun (JIRA)

xukun created SPARK-9973:


 Summary: wrong buffle size
 Key: SPARK-9973
 URL: https://issues.apache.org/jira/browse/SPARK-9973
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: xukun






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9973) wrong buffle size

2015-08-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696717#comment-14696717
 ] 

Sean Owen commented on SPARK-9973:
--

[~xukun] This would be more useful if you added a descriptive title and more 
complete description. What code is affected? what buffer? you explained this in 
the PR to some degree so a brief note about the problem to be solved is 
appropriate. Right now someone reading this doesn't know what it refers to.

 wrong buffle size
 -

 Key: SPARK-9973
 URL: https://issues.apache.org/jira/browse/SPARK-9973
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: xukun

 allocate wrong buffer size (4+ size * columnType.defaultSize  * 
 columnType.defaultSize), right buffer size is  4+ size * 
 columnType.defaultSize.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN

2015-08-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696716#comment-14696716
 ] 

Sean Owen commented on SPARK-9971:
--

My instinct is that this should in fact result in NaN; NaN is not in general 
ignored. For example {{math.max(1.0, Double.NaN)}} and {{math.min(1.0, 
Double.NaN)}} are both NaN.

But are you ready for some weird?

{code}
scala Seq(1.0, Double.NaN).max
res23: Double = NaN

scala Seq(Double.NaN, 1.0).max
res24: Double = 1.0

scala Seq(5.0, Double.NaN, 1.0).max
res25: Double = 1.0

scala Seq(5.0, Double.NaN, 1.0, 6.0).max
res26: Double = 6.0

scala Seq(5.0, Double.NaN, 1.0, 6.0, Double.NaN).max
res27: Double = NaN
{code}

Not sure what to make of that, other than the Scala collection library isn't a 
good reference for behavior. Java?

{code}
scala java.util.Collections.max(java.util.Arrays.asList(new 
java.lang.Double(1.0), new java.lang.Double(Double.NaN)))
res33: Double = NaN

scala java.util.Collections.max(java.util.Arrays.asList(new 
java.lang.Double(Double.NaN), new java.lang.Double(1.0)))
res34: Double = NaN
{code}

Makes more sense at least. I think this is correct behavior and you would 
filter NaN if you want to ignore them, as it is generally not something the 
language ignores.

 MaxFunction not working correctly with columns containing Double.NaN
 

 Key: SPARK-9971
 URL: https://issues.apache.org/jira/browse/SPARK-9971
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: Frank Rosner
Priority: Minor

 h4. Problem Description
 When using the {{max}} function on a {{DoubleType}} column that contains 
 {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 
 This is because it compares all values with the running maximum. However, {{x 
  Double.NaN}} will always lead false for all {{x: Double}}, so will {{x  
 Double.NaN}}.
 h4. How to Reproduce
 {code}
 import org.apache.spark.sql.{SQLContext, Row}
 import org.apache.spark.sql.types._
 val sql = new SQLContext(sc)
 val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
 val dataFrame = sql.createDataFrame(rdd, StructType(List(
   StructField(col, DoubleType, false)
 )))
 dataFrame.select(max(col)).first
 // returns org.apache.spark.sql.Row = [NaN]
 {code}
 h4. Solution
 The {{max}} and {{min}} functions should ignore NaN values, as they are not 
 numbers. If a column contains only NaN values, then the maximum and minimum 
 is not defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9871) Add expression functions into SparkR which have a variable parameter


[ 
https://issues.apache.org/jira/browse/SPARK-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696720#comment-14696720
 ] 

Apache Spark commented on SPARK-9871:
-

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/8194

 Add expression functions into SparkR which have a variable parameter
 

 Key: SPARK-9871
 URL: https://issues.apache.org/jira/browse/SPARK-9871
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa

 Add expression functions into SparkR which has a variable parameter, like 
 {{concat}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9871) Add expression functions into SparkR which have a variable parameter


 [ 
https://issues.apache.org/jira/browse/SPARK-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9871:
---

Assignee: (was: Apache Spark)

 Add expression functions into SparkR which have a variable parameter
 

 Key: SPARK-9871
 URL: https://issues.apache.org/jira/browse/SPARK-9871
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa

 Add expression functions into SparkR which has a variable parameter, like 
 {{concat}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9871) Add expression functions into SparkR which have a variable parameter


 [ 
https://issues.apache.org/jira/browse/SPARK-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9871:
---

Assignee: Apache Spark

 Add expression functions into SparkR which have a variable parameter
 

 Key: SPARK-9871
 URL: https://issues.apache.org/jira/browse/SPARK-9871
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa
Assignee: Apache Spark

 Add expression functions into SparkR which has a variable parameter, like 
 {{concat}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9969) Remove old Yarn MR classpath api support for Spark Yarn client

2015-08-14 Thread Saisai Shao (JIRA)

Saisai Shao created SPARK-9969:
--

 Summary: Remove old Yarn MR classpath api support for Spark Yarn 
client
 Key: SPARK-9969
 URL: https://issues.apache.org/jira/browse/SPARK-9969
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Saisai Shao
Priority: Minor


Since now we only support Yarn stable API, so here propose to remove old 
MRConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH api support in 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9969) Remove old Yarn MR classpath api support for Spark Yarn client


[ 
https://issues.apache.org/jira/browse/SPARK-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696681#comment-14696681
 ] 

Apache Spark commented on SPARK-9969:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/8192

 Remove old Yarn MR classpath api support for Spark Yarn client
 --

 Key: SPARK-9969
 URL: https://issues.apache.org/jira/browse/SPARK-9969
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Saisai Shao
Priority: Minor

 Since now we only support Yarn stable API, so here propose to remove old 
 MRConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH api support in 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9969) Remove old Yarn MR classpath api support for Spark Yarn client


 [ 
https://issues.apache.org/jira/browse/SPARK-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9969:
---

Assignee: Apache Spark

 Remove old Yarn MR classpath api support for Spark Yarn client
 --

 Key: SPARK-9969
 URL: https://issues.apache.org/jira/browse/SPARK-9969
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Saisai Shao
Assignee: Apache Spark
Priority: Minor

 Since now we only support Yarn stable API, so here propose to remove old 
 MRConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH api support in 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9969) Remove old Yarn MR classpath api support for Spark Yarn client


 [ 
https://issues.apache.org/jira/browse/SPARK-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9969:
---

Assignee: (was: Apache Spark)

 Remove old Yarn MR classpath api support for Spark Yarn client
 --

 Key: SPARK-9969
 URL: https://issues.apache.org/jira/browse/SPARK-9969
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Saisai Shao
Priority: Minor

 Since now we only support Yarn stable API, so here propose to remove old 
 MRConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH api support in 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9970) SQLContext.createDataFrame failed to properly determine column names


 [ 
https://issues.apache.org/jira/browse/SPARK-9970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-9970:
--
Description: 
Hi,
I'm trying to do nested join of tables.
After first join everything is ok, but second join made some of the column 
names lost.

My code is following:

{code:java}
def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, 
joinType = left_outer):
tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: 
r.asDict()[columnRight]))
tmpTable = tmpTable.select(tmpTable._1.alias(joinColumn), 
tmpTable._2.data.alias(columnNested))
return tableLeft.join(tmpTable, tableLeft[columnLeft] == 
tmpTable[joinColumn], joinType).drop(joinColumn)


user = sqlContext.read.json(path + user.json)
user.printSchema()
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)

order =  sqlContext.read.json(path + order.json)
order.printSchema();
root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)

lines =  sqlContext.read.json(path + lines.json)
lines.printSchema();
root
 |-- id: long (nullable = true)
 |-- orderid: long (nullable = true)
 |-- product: string (nullable = true)
{code}

And joining code:
Please look for the result of 2nd join. There are columns _1,_2,_3. Should be 
'id', 'orderid', 'product'

{code:java}
orders = joinTable(order, lines, id, orderid, lines)
orders.printSchema()

root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)
 |-- lines: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- orderid: long (nullable = true)
 |||-- product: string (nullable = true)

clients = joinTable(user, orders, id, userid, orders)
clients.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- price: double (nullable = true)
 |||-- userid: long (nullable = true)
 |||-- lines: array (nullable = true)
 ||||-- element: struct (containsNull = true)
 |||||-- _1: long (nullable = true)
 |||||-- _2: long (nullable = true)
 |||||-- _3: string (nullable = true)
{code}

I tried to check if groupBy isn't the root of the problem but it's looks right
{code:java}
grouped =  orders.rdd.groupBy(lambda r: r.userid)
print grouped.map(lambda x: list(x[1])).collect()

[[Row(id=1, price=202.3, userid=1, lines=[Row(id=1, orderid=1, product=u'XXX'), 
Row(id=2, orderid=1, product=u'YYY')]), Row(id=2, price=343.99, userid=1, 
lines=[Row(id=3, orderid=2, product=u'XXX')])], [Row(id=3, price=399.99, 
userid=2, lines=[Row(id=4, orderid=3, product=u'XXX')])]]
{code}

So I assume that the problem is with createDataFrame.

  was:
Hi,
I'm trying to do nested join of tables.
After first join everything is ok, but second join made some of the column 
names lost.

My code is following:

{code:java}
def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, 
joinType = left_outer):
tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: 
r.asDict()[columnRight]))
tmpTable = tmpTable.select(tmpTable._1.alias(joinColumn), 
tmpTable._2.data.alias(columnNested))
return tableLeft.join(tmpTable, tableLeft[columnLeft] == 
tmpTable[joinColumn], joinType).drop(joinColumn)


user = sqlContext.read.json(path + user.json)
user.printSchema()
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)

order =  sqlContext.read.json(path + order.json)
order.printSchema();
root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)

lines =  sqlContext.read.json(path + lines.json)
lines.printSchema();
root
 |-- id: long (nullable = true)
 |-- orderid: long (nullable = true)
 |-- product: string (nullable = true)
{code}

And joining code:

{code:java}
orders = joinTable(order, lines, id, orderid, lines)
orders.printSchema()

root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)
 |-- lines: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- orderid: long (nullable = true)
 |||-- product: string (nullable = true)

clients = joinTable(user, orders, id, userid, orders)
clients.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- price: double (nullable = true)
 |||-- userid: long (nullable = true)
 |||-- lines: array (nullable = true)
 ||

[jira] [Commented] (SPARK-6624) Convert filters into CNF for data sources


[ 
https://issues.apache.org/jira/browse/SPARK-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696700#comment-14696700
 ] 

Apache Spark commented on SPARK-6624:
-

User 'yjshen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8193

 Convert filters into CNF for data sources
 -

 Key: SPARK-6624
 URL: https://issues.apache.org/jira/browse/SPARK-6624
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 We should turn filters into conjunctive normal form (CNF) before we pass them 
 to data sources. Otherwise, filters are not very useful if there is a single 
 filter with a bunch of ORs.
 Note that we already try to do some of these in BooleanSimplification, but I 
 think we should just formalize it to use CNF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6624) Convert filters into CNF for data sources


 [ 
https://issues.apache.org/jira/browse/SPARK-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6624:
---

Assignee: (was: Apache Spark)

 Convert filters into CNF for data sources
 -

 Key: SPARK-6624
 URL: https://issues.apache.org/jira/browse/SPARK-6624
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 We should turn filters into conjunctive normal form (CNF) before we pass them 
 to data sources. Otherwise, filters are not very useful if there is a single 
 filter with a bunch of ORs.
 Note that we already try to do some of these in BooleanSimplification, but I 
 think we should just formalize it to use CNF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6624) Convert filters into CNF for data sources


 [ 
https://issues.apache.org/jira/browse/SPARK-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6624:
---

Assignee: Apache Spark

 Convert filters into CNF for data sources
 -

 Key: SPARK-6624
 URL: https://issues.apache.org/jira/browse/SPARK-6624
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 We should turn filters into conjunctive normal form (CNF) before we pass them 
 to data sources. Otherwise, filters are not very useful if there is a single 
 filter with a bunch of ORs.
 Note that we already try to do some of these in BooleanSimplification, but I 
 think we should just formalize it to use CNF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN


 [ 
https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Rosner updated SPARK-9971:

Priority: Minor  (was: Major)

 MaxFunction not working correctly with columns containing Double.NaN
 

 Key: SPARK-9971
 URL: https://issues.apache.org/jira/browse/SPARK-9971
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: Frank Rosner
Priority: Minor

 h5. Problem Description
 When using the {{max}} function on a {{DoubleType}} column that contains 
 {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 
 This is because it compares all values with the running maximum. However, {{x 
  Double.NaN}} will always lead false for all {{x: Double}}, so will {{x  
 Double.NaN}}.
 h5. How to Reproduce
 {code}
 import org.apache.spark.sql.{SQLContext, Row}
 import org.apache.spark.sql.types._
 val sql = new SQLContext(sc)
 val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
 val dataFrame = sql.createDataFrame(rdd, StructType(List(
   StructField(col, DoubleType, false)
 )))
 dataFrame.select(max(col)).first
 // returns org.apache.spark.sql.Row = [NaN]
 {code}
 h5. Solution
 The {{max}} and {{min}} functions should ignore NaN values, as they are not 
 numbers. If a column contains only NaN values, then the maximum and minimum 
 is not defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN


 [ 
https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Rosner updated SPARK-9971:

Description: 
h4. Problem Description

When using the {{max}} function on a {{DoubleType}} column that contains 
{{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 

This is because it compares all values with the running maximum. However, {{x  
Double.NaN}} will always lead false for all {{x: Double}}, so will {{x  
Double.NaN}}.

h4. How to Reproduce

{code}
import org.apache.spark.sql.{SQLContext, Row}
import org.apache.spark.sql.types._

val sql = new SQLContext(sc)
val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
val dataFrame = sql.createDataFrame(rdd, StructType(List(
  StructField(col, DoubleType, false)
)))
dataFrame.select(max(col)).first
// returns org.apache.spark.sql.Row = [NaN]
{code}

h4. Solution

The {{max}} and {{min}} functions should ignore NaN values, as they are not 
numbers. If a column contains only NaN values, then the maximum and minimum is 
not defined.

  was:
h5. Problem Description

When using the {{max}} function on a {{DoubleType}} column that contains 
{{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 

This is because it compares all values with the running maximum. However, {{x  
Double.NaN}} will always lead false for all {{x: Double}}, so will {{x  
Double.NaN}}.

h5. How to Reproduce

{code}
import org.apache.spark.sql.{SQLContext, Row}
import org.apache.spark.sql.types._

val sql = new SQLContext(sc)
val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
val dataFrame = sql.createDataFrame(rdd, StructType(List(
  StructField(col, DoubleType, false)
)))
dataFrame.select(max(col)).first
// returns org.apache.spark.sql.Row = [NaN]
{code}

h5. Solution

The {{max}} and {{min}} functions should ignore NaN values, as they are not 
numbers. If a column contains only NaN values, then the maximum and minimum is 
not defined.


 MaxFunction not working correctly with columns containing Double.NaN
 

 Key: SPARK-9971
 URL: https://issues.apache.org/jira/browse/SPARK-9971
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: Frank Rosner
Priority: Minor

 h4. Problem Description
 When using the {{max}} function on a {{DoubleType}} column that contains 
 {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 
 This is because it compares all values with the running maximum. However, {{x 
  Double.NaN}} will always lead false for all {{x: Double}}, so will {{x  
 Double.NaN}}.
 h4. How to Reproduce
 {code}
 import org.apache.spark.sql.{SQLContext, Row}
 import org.apache.spark.sql.types._
 val sql = new SQLContext(sc)
 val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
 val dataFrame = sql.createDataFrame(rdd, StructType(List(
   StructField(col, DoubleType, false)
 )))
 dataFrame.select(max(col)).first
 // returns org.apache.spark.sql.Row = [NaN]
 {code}
 h4. Solution
 The {{max}} and {{min}} functions should ignore NaN values, as they are not 
 numbers. If a column contains only NaN values, then the maximum and minimum 
 is not defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9973) wrong buffle size

2015-08-14 Thread xukun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xukun updated SPARK-9973:
-
Description: allocate wrong buffer size (4+ size * columnType.defaultSize  
* columnType.defaultSize), right buffer size is  4+ size * 
columnType.defaultSize.

 wrong buffle size
 -

 Key: SPARK-9973
 URL: https://issues.apache.org/jira/browse/SPARK-9973
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: xukun

 allocate wrong buffer size (4+ size * columnType.defaultSize  * 
 columnType.defaultSize), right buffer size is  4+ size * 
 columnType.defaultSize.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9973) wrong buffle size


[ 
https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696714#comment-14696714
 ] 

Apache Spark commented on SPARK-9973:
-

User 'viper-kun' has created a pull request for this issue:
https://github.com/apache/spark/pull/8189

 wrong buffle size
 -

 Key: SPARK-9973
 URL: https://issues.apache.org/jira/browse/SPARK-9973
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: xukun

 allocate wrong buffer size (4+ size * columnType.defaultSize  * 
 columnType.defaultSize), right buffer size is  4+ size * 
 columnType.defaultSize.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9973) wrong buffle size


 [ 
https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9973:
---

Assignee: (was: Apache Spark)

 wrong buffle size
 -

 Key: SPARK-9973
 URL: https://issues.apache.org/jira/browse/SPARK-9973
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: xukun

 allocate wrong buffer size (4+ size * columnType.defaultSize  * 
 columnType.defaultSize), right buffer size is  4+ size * 
 columnType.defaultSize.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9973) wrong buffle size


 [ 
https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9973:
---

Assignee: Apache Spark

 wrong buffle size
 -

 Key: SPARK-9973
 URL: https://issues.apache.org/jira/browse/SPARK-9973
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: xukun
Assignee: Apache Spark

 allocate wrong buffer size (4+ size * columnType.defaultSize  * 
 columnType.defaultSize), right buffer size is  4+ size * 
 columnType.defaultSize.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN


[ 
https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696760#comment-14696760
 ] 

Frank Rosner commented on SPARK-9971:
-

I would like to provide a patch to make the following unit tests in 
{{DataFrameFunctionsSuite}} succeed:

{code}
  test(max function ignoring Double.NaN) {
val df = Seq(
  (Double.NaN, Double.NaN),
  (-10d, Double.NaN),
  (10d, Double.NaN)
).toDF(col1, col2)
checkAnswer(
  df.select(max(col1)),
  Seq(Row(10d))
)
checkAnswer(
  df.select(max(col2)),
  Seq(Row(Double.NaN))
)
  }

  test(min function ignoring Double.NaN) {
val df = Seq(
  (Double.NaN, Double.NaN),
  (-10d, Double.NaN),
  (10d, Double.NaN)
).toDF(col1, col2)
checkAnswer(
  df.select(min(col1)),
  Seq(Row(-10d))
)
checkAnswer(
  df.select(min(col1)),
  Seq(Row(Double.NaN))
)
  }
{code}

 MaxFunction not working correctly with columns containing Double.NaN
 

 Key: SPARK-9971
 URL: https://issues.apache.org/jira/browse/SPARK-9971
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: Frank Rosner
Priority: Minor

 h4. Problem Description
 When using the {{max}} function on a {{DoubleType}} column that contains 
 {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 
 This is because it compares all values with the running maximum. However, {{x 
  Double.NaN}} will always lead false for all {{x: Double}}, so will {{x  
 Double.NaN}}.
 h4. How to Reproduce
 {code}
 import org.apache.spark.sql.{SQLContext, Row}
 import org.apache.spark.sql.types._
 val sql = new SQLContext(sc)
 val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
 val dataFrame = sql.createDataFrame(rdd, StructType(List(
   StructField(col, DoubleType, false)
 )))
 dataFrame.select(max(col)).first
 // returns org.apache.spark.sql.Row = [NaN]
 {code}
 h4. Solution
 The {{max}} and {{min}} functions should ignore NaN values, as they are not 
 numbers. If a column contains only NaN values, then the maximum and minimum 
 is not defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data


[ 
https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697098#comment-14697098
 ] 

Cody Koeninger commented on SPARK-9947:
---

You already have access to offsets and can save them however you want.  You can 
provide those offsets on restart, regardless of whether checkpointing was 
enabled.

 Separate Metadata and State Checkpoint Data
 ---

 Key: SPARK-9947
 URL: https://issues.apache.org/jira/browse/SPARK-9947
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.4.1
Reporter: Dan Dutrow
   Original Estimate: 168h
  Remaining Estimate: 168h

 Problem: When updating an application that has checkpointing enabled to 
 support the updateStateByKey and 24/7 operation functionality, you encounter 
 the problem where you might like to maintain state data between restarts but 
 delete the metadata containing execution state. 
 If checkpoint data exists between code redeployment, the program may not 
 execute properly or at all. My current workaround for this issue is to wrap 
 updateStateByKey with my own function that persists the state after every 
 update to my own separate directory. (That allows me to delete the checkpoint 
 with its metadata before redeploying) Then, when I restart the application, I 
 initialize the state with this persisted data. This incurs additional 
 overhead due to persisting of the same data twice: once in the checkpoint and 
 once in my persisted data folder. 
 If Kafka Direct API offsets could be stored in another separate checkpoint 
 directory, that would help address the problem of having to blow that away 
 between code redeployment as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9979) Unable to request addition to powered by spark

2015-08-14 Thread Jeff Palmucci (JIRA)

Jeff Palmucci created SPARK-9979:


 Summary: Unable to request addition to powered by spark
 Key: SPARK-9979
 URL: https://issues.apache.org/jira/browse/SPARK-9979
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Jeff Palmucci
Priority: Minor


The powered by spark page asks to submit new listing requests to 
u...@apache.spark.org. However, when I do that, I get:
{quote}
Hi. This is the qmail-send program at apache.org.
I'm afraid I wasn't able to deliver your message to the following addresses.
This is a permanent error; I've given up. Sorry it didn't work out.
u...@spark.apache.org:
Must be sent from an @apache.org address or a subscriber address or an address 
in LDAP.
{quote}

The project I wanted to list is here: 
http://engineering.tripadvisor.com/using-apache-spark-for-massively-parallel-nlp/




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data

2015-08-14 Thread Dan Dutrow (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697157#comment-14697157
 ] 

Dan Dutrow commented on SPARK-9947:
---

The desire is to continue using checkpointing for everything but allow 
selective deleting of certain types of checkpoint data. Using an external 
database to duplicate that functionality is not desired.

 Separate Metadata and State Checkpoint Data
 ---

 Key: SPARK-9947
 URL: https://issues.apache.org/jira/browse/SPARK-9947
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.4.1
Reporter: Dan Dutrow
   Original Estimate: 168h
  Remaining Estimate: 168h

 Problem: When updating an application that has checkpointing enabled to 
 support the updateStateByKey and 24/7 operation functionality, you encounter 
 the problem where you might like to maintain state data between restarts but 
 delete the metadata containing execution state. 
 If checkpoint data exists between code redeployment, the program may not 
 execute properly or at all. My current workaround for this issue is to wrap 
 updateStateByKey with my own function that persists the state after every 
 update to my own separate directory. (That allows me to delete the checkpoint 
 with its metadata before redeploying) Then, when I restart the application, I 
 initialize the state with this persisted data. This incurs additional 
 overhead due to persisting of the same data twice: once in the checkpoint and 
 once in my persisted data folder. 
 If Kafka Direct API offsets could be stored in another separate checkpoint 
 directory, that would help address the problem of having to blow that away 
 between code redeployment as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data


[ 
https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697166#comment-14697166
 ] 

Cody Koeninger commented on SPARK-9947:
---

You can't re-use checkpoint data across application upgrades anyway, so
that honestly seems kind of pointless.







 Separate Metadata and State Checkpoint Data
 ---

 Key: SPARK-9947
 URL: https://issues.apache.org/jira/browse/SPARK-9947
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.4.1
Reporter: Dan Dutrow
   Original Estimate: 168h
  Remaining Estimate: 168h

 Problem: When updating an application that has checkpointing enabled to 
 support the updateStateByKey and 24/7 operation functionality, you encounter 
 the problem where you might like to maintain state data between restarts but 
 delete the metadata containing execution state. 
 If checkpoint data exists between code redeployment, the program may not 
 execute properly or at all. My current workaround for this issue is to wrap 
 updateStateByKey with my own function that persists the state after every 
 update to my own separate directory. (That allows me to delete the checkpoint 
 with its metadata before redeploying) Then, when I restart the application, I 
 initialize the state with this persisted data. This incurs additional 
 overhead due to persisting of the same data twice: once in the checkpoint and 
 once in my persisted data folder. 
 If Kafka Direct API offsets could be stored in another separate checkpoint 
 directory, that would help address the problem of having to blow that away 
 between code redeployment as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9977) The usage of a label generated by StringIndexer


 [ 
https://issues.apache.org/jira/browse/SPARK-9977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9977:
---

Assignee: Apache Spark

 The usage of a label generated by StringIndexer
 ---

 Key: SPARK-9977
 URL: https://issues.apache.org/jira/browse/SPARK-9977
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.1
Reporter: Kai Sasaki
Assignee: Apache Spark
Priority: Trivial
  Labels: documentaion

 By using {{StringIndexer}}, we can obtain indexed label on new column. So a 
 following estimator should use this new column through pipeline if it wants 
 to use string indexed label. 
 I think it is better to make it explicit on documentation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9977) The usage of a label generated by StringIndexer


 [ 
https://issues.apache.org/jira/browse/SPARK-9977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9977:
---

Assignee: (was: Apache Spark)

 The usage of a label generated by StringIndexer
 ---

 Key: SPARK-9977
 URL: https://issues.apache.org/jira/browse/SPARK-9977
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.1
Reporter: Kai Sasaki
Priority: Trivial
  Labels: documentaion

 By using {{StringIndexer}}, we can obtain indexed label on new column. So a 
 following estimator should use this new column through pipeline if it wants 
 to use string indexed label. 
 I think it is better to make it explicit on documentation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9977) The usage of a label generated by StringIndexer


[ 
https://issues.apache.org/jira/browse/SPARK-9977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697104#comment-14697104
 ] 

Apache Spark commented on SPARK-9977:
-

User 'Lewuathe' has created a pull request for this issue:
https://github.com/apache/spark/pull/8205

 The usage of a label generated by StringIndexer
 ---

 Key: SPARK-9977
 URL: https://issues.apache.org/jira/browse/SPARK-9977
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.1
Reporter: Kai Sasaki
Priority: Trivial
  Labels: documentaion

 By using {{StringIndexer}}, we can obtain indexed label on new column. So a 
 following estimator should use this new column through pipeline if it wants 
 to use string indexed label. 
 I think it is better to make it explicit on documentation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data

2015-08-14 Thread Dan Dutrow (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697180#comment-14697180
 ] 

Dan Dutrow commented on SPARK-9947:
---

I want to maintain the data in updateStateByKey between upgrades. This data can 
be recovered between upgrades so long as objects don't change.

 Separate Metadata and State Checkpoint Data
 ---

 Key: SPARK-9947
 URL: https://issues.apache.org/jira/browse/SPARK-9947
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.4.1
Reporter: Dan Dutrow
   Original Estimate: 168h
  Remaining Estimate: 168h

 Problem: When updating an application that has checkpointing enabled to 
 support the updateStateByKey and 24/7 operation functionality, you encounter 
 the problem where you might like to maintain state data between restarts but 
 delete the metadata containing execution state. 
 If checkpoint data exists between code redeployment, the program may not 
 execute properly or at all. My current workaround for this issue is to wrap 
 updateStateByKey with my own function that persists the state after every 
 update to my own separate directory. (That allows me to delete the checkpoint 
 with its metadata before redeploying) Then, when I restart the application, I 
 initialize the state with this persisted data. This incurs additional 
 overhead due to persisting of the same data twice: once in the checkpoint and 
 once in my persisted data folder. 
 If Kafka Direct API offsets could be stored in another separate checkpoint 
 directory, that would help address the problem of having to blow that away 
 between code redeployment as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data


[ 
https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697193#comment-14697193
 ] 

Cody Koeninger commented on SPARK-9947:
---

Didn't you already say that you were saving updateStateByKey state yourself?




 Separate Metadata and State Checkpoint Data
 ---

 Key: SPARK-9947
 URL: https://issues.apache.org/jira/browse/SPARK-9947
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.4.1
Reporter: Dan Dutrow
   Original Estimate: 168h
  Remaining Estimate: 168h

 Problem: When updating an application that has checkpointing enabled to 
 support the updateStateByKey and 24/7 operation functionality, you encounter 
 the problem where you might like to maintain state data between restarts but 
 delete the metadata containing execution state. 
 If checkpoint data exists between code redeployment, the program may not 
 execute properly or at all. My current workaround for this issue is to wrap 
 updateStateByKey with my own function that persists the state after every 
 update to my own separate directory. (That allows me to delete the checkpoint 
 with its metadata before redeploying) Then, when I restart the application, I 
 initialize the state with this persisted data. This incurs additional 
 overhead due to persisting of the same data twice: once in the checkpoint and 
 once in my persisted data folder. 
 If Kafka Direct API offsets could be stored in another separate checkpoint 
 directory, that would help address the problem of having to blow that away 
 between code redeployment as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8118) Turn off noisy log output produced by Parquet 1.7.0

2015-08-14 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697203#comment-14697203
 ] 

Cheng Lian commented on SPARK-8118:
---

Unfortunately no.

 Turn off noisy log output produced by Parquet 1.7.0
 ---

 Key: SPARK-8118
 URL: https://issues.apache.org/jira/browse/SPARK-8118
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.4.1, 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor
 Fix For: 1.5.0


 Parquet 1.7.0 renames package name to org.apache.parquet, need to adjust 
 {{ParquetRelation.enableLogForwarding}} accordingly to avoid noisy log output.
 A better approach than simply muting these log lines is to redirect Parquet 
 logs via SLF4J, so that we can handle them consistently. In general these 
 logs are very useful. Esp. when used to diagnosing Parquet memory issue and 
 filter push-down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6249) Get Kafka offsets from consumer group in ZK when using direct stream

[
https://issues.apache.org/jira/browse/SPARK-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697188#comment-14697188
]

Cody Koeninger commented on SPARK-6249:
---

If you want an api that has imprecise semantics and stores stuff in ZK, use
the receiver based stream.

This ticket has been closed for a while, I'd suggest further discussion
would be better suited for the mailing list.

Get Kafka offsets from consumer group in ZK when using direct stream

Key: SPARK-6249
URL: https://issues.apache.org/jira/browse/SPARK-6249
Project: Spark
Issue Type: Improvement
Components: Streaming
Reporter: Tathagata Das

This is the proposal.
The simpler direct API (the one that does not take explicit offsets) can be
modified to also pick up the initial offset from ZK if group.id is specified.
This is exactly similar to how we find the latest or earliest offset in that
API, just that instead of latest/earliest offset of the topic we want to find
the offset from the consumer group. The group offsets is ZK is not used at
all for any further processing and restarting, so the exactly-once semantics
is not broken.
The use case where this is useful is simplified code upgrade. If the user
wants to upgrade the code, he/she can the context stop gracefully which will
ensure the ZK consumer group offset will be updated with the last offsets
processed. Then the new code is started (not restarted from checkpoint) can
pickup the consumer group offset from ZK and continue where the previous
code had left off.
Without the functionality of picking up consumer group offsets to start (that
is, currently) the only way to do this is for the users to save the offsets
somewhere (file, database, etc.) and manage the offsets themselves. I just
want to simplify this process.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data

2015-08-14 Thread Dan Dutrow (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697220#comment-14697220
 ] 

Dan Dutrow commented on SPARK-9947:
---

Ok, right, we're saving the same state that's outputted from the 
updateStateByKey to HDFS. The thought is that maybe updateStateByKey is saving 
the exact same data in the checkpoint as I have to do in my own function. 
Allowing separation of the different types of data stored in the checkpoint 
data might allow me to not have to save the same state data again.

 Separate Metadata and State Checkpoint Data
 ---

 Key: SPARK-9947
 URL: https://issues.apache.org/jira/browse/SPARK-9947
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.4.1
Reporter: Dan Dutrow
   Original Estimate: 168h
  Remaining Estimate: 168h

 Problem: When updating an application that has checkpointing enabled to 
 support the updateStateByKey and 24/7 operation functionality, you encounter 
 the problem where you might like to maintain state data between restarts but 
 delete the metadata containing execution state. 
 If checkpoint data exists between code redeployment, the program may not 
 execute properly or at all. My current workaround for this issue is to wrap 
 updateStateByKey with my own function that persists the state after every 
 update to my own separate directory. (That allows me to delete the checkpoint 
 with its metadata before redeploying) Then, when I restart the application, I 
 initialize the state with this persisted data. This incurs additional 
 overhead due to persisting of the same data twice: once in the checkpoint and 
 once in my persisted data folder. 
 If Kafka Direct API offsets could be stored in another separate checkpoint 
 directory, that would help address the problem of having to blow that away 
 between code redeployment as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9972) Add `struct` function in SparkR

2015-08-14 Thread Yu Ishikawa (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Ishikawa updated SPARK-9972:
---
Target Version/s: 1.6.0  (was: 1.5.0)

 Add `struct` function in SparkR
 ---

 Key: SPARK-9972
 URL: https://issues.apache.org/jira/browse/SPARK-9972
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa

 Support {{struct}} function on a DataFrame in SparkR. However, I think we 
 need to improve {{collect}} function in SparkR in order to implement 
 {{struct}} function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9972) Add `struct` function in SparkR

2015-08-14 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697232#comment-14697232
 ] 

Shivaram Venkataraman commented on SPARK-9972:
--

Yeah this can be marked as being blocked by 
https://issues.apache.org/jira/browse/SPARK-6819

 Add `struct` function in SparkR
 ---

 Key: SPARK-9972
 URL: https://issues.apache.org/jira/browse/SPARK-9972
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa

 Support {{struct}} function on a DataFrame in SparkR. However, I think we 
 need to improve {{collect}} function in SparkR in order to implement 
 {{struct}} function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9946) NPE in TaskMemoryManager

2015-08-14 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9946.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8177
[https://github.com/apache/spark/pull/8177]

 NPE in TaskMemoryManager
 

 Key: SPARK-9946
 URL: https://issues.apache.org/jira/browse/SPARK-9946
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker
 Fix For: 1.5.0


 {code}
 Failed to execute query using catalyst:
 [info]   Error: Job aborted due to stage failure: Task 6 in stage 6801.0 
 failed 1 times, most recent failure: Lost task 6.0 in stage 6801.0 (TID 
 41123, localhost): java.lang.NullPointerException
 [info]at 
 org.apache.spark.unsafe.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:226)
 [info]at 
 org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortedIterator.loadNext(UnsafeInMemorySorter.java:165)
 [info]at 
 org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:142)
 [info]at 
 org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:129)
 [info]at 
 org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:84)
 [info]at 
 org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoin.scala:302)
 [info]at 
 org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoin.scala:218)
 [info]at 
 org.apache.spark.sql.execution.joins.SortMergeJoin$$anonfun$doExecute$1$$anon$1.advanceNext(SortMergeJoin.scala:110)
 [info]at 
 org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
 [info]at 
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 [info]at 
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 [info]at 
 org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:99)
 [info]at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
 [info]at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
 [info]at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 [info]at org.apache.spark.scheduler.Task.run(Task.scala:88)
 [info]at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
 13:27:17.435 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 23.0 
 in stage 6801.0 (TID 41140, localhost): TaskKilled (killed intentionally)
 [info]at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 [info]at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 13:27:17.436 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 12.0 
 in stage 6801.0 (TID 41129, localhost): TaskKilled (killed intentionally)
 [info]at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9828) Should not share `{}` among instances


 [ 
https://issues.apache.org/jira/browse/SPARK-9828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-9828.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8110
[https://github.com/apache/spark/pull/8110]

 Should not share `{}` among instances
 -

 Key: SPARK-9828
 URL: https://issues.apache.org/jira/browse/SPARK-9828
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 1.4.1, 1.5.0
Reporter: Xiangrui Meng
Assignee: Manoj Kumar
Priority: Critical
 Fix For: 1.5.0


 We use `{}` as the initial value in some places, e.g., 
 https://github.com/apache/spark/blob/master/python/pyspark/ml/param/__init__.py#L64.
  This makes instances sharing the same param map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9984) Create local physical operator interface

Reynold Xin created SPARK-9984:
--

 Summary: Create local physical operator interface
 Key: SPARK-9984
 URL: https://issues.apache.org/jira/browse/SPARK-9984
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


The local operator interface should be similar to traditional database 
iterators, with open/close.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9985) DataFrameWriter jdbc method ignore options that have been set

2015-08-14 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9985:

Assignee: Shixiong Zhu

 DataFrameWriter jdbc method ignore options that have been set
 -

 Key: SPARK-9985
 URL: https://issues.apache.org/jira/browse/SPARK-9985
 Project: Spark
  Issue Type: Bug
Reporter: Richard Garris
Assignee: Shixiong Zhu

 I am working on an RDBMS to DataFrame conversion using Postgres and am 
 hitting a wall where everytime I try to use the Postgresql JDBC driver to get 
 a java.sql.SQLException: No suitable driver found error
 Here is the stack trace:
 {code}
 at java.sql.DriverManager.getConnection(DriverManager.java:596)
 at java.sql.DriverManager.getConnection(DriverManager.java:187)
 at 
 org.apache.spark.sql.jdbc.package$JDBCWriteDetails$.savePartition(jdbc.scala:67)
 at 
 org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$saveTable$1.apply(jdbc.scala:189)
 at 
 org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$saveTable$1.apply(jdbc.scala:188)
 at 
 org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:878)
 at 
 org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:878)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {code}
 It appears that DataFrameWriter and DataFrameReader ignores options that we 
 set before invoking {{jdbc}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9828) Should not share `{}` among instances


 [ 
https://issues.apache.org/jira/browse/SPARK-9828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-9828.
--
   Resolution: Fixed
Fix Version/s: 1.4.2

 Should not share `{}` among instances
 -

 Key: SPARK-9828
 URL: https://issues.apache.org/jira/browse/SPARK-9828
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 1.4.1, 1.5.0
Reporter: Xiangrui Meng
Assignee: Manoj Kumar
Priority: Critical
 Fix For: 1.4.2, 1.5.0


 We use `{}` as the initial value in some places, e.g., 
 https://github.com/apache/spark/blob/master/python/pyspark/ml/param/__init__.py#L64.
  This makes instances sharing the same param map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9986) Create a simple test framework for local operators

Reynold Xin created SPARK-9986:
--

 Summary: Create a simple test framework for local operators
 Key: SPARK-9986
 URL: https://issues.apache.org/jira/browse/SPARK-9986
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


It'd be great if we can just create local query plans and test the correctness 
of their implementation directly.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9994) Create local TopK operator

Reynold Xin created SPARK-9994:
--

 Summary: Create local TopK operator
 Key: SPARK-9994
 URL: https://issues.apache.org/jira/browse/SPARK-9994
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


Similar to the existing TakeOrderedAndProject, except in a single thread.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9993) Create local union operator

Reynold Xin created SPARK-9993:
--

 Summary: Create local union operator
 Key: SPARK-9993
 URL: https://issues.apache.org/jira/browse/SPARK-9993
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9589) Flaky test: HiveCompatibilitySuite.groupby8


 [ 
https://issues.apache.org/jira/browse/SPARK-9589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9589:
---

Assignee: Apache Spark  (was: Josh Rosen)

 Flaky test: HiveCompatibilitySuite.groupby8
 ---

 Key: SPARK-9589
 URL: https://issues.apache.org/jira/browse/SPARK-9589
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Apache Spark
Priority: Blocker

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39662/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/groupby8/
 {code}
 sbt.ForkMain$ForkError: 
 Failed to execute query using catalyst:
 Error: Job aborted due to stage failure: Task 24 in stage 3081.0 failed 1 
 times, most recent failure: Lost task 24.0 in stage 3081.0 (TID 14919, 
 localhost): java.lang.NullPointerException
   at 
 org.apache.spark.unsafe.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:226)
   at 
 org.apache.spark.unsafe.map.BytesToBytesMap$Location.updateAddressesAndSizes(BytesToBytesMap.java:366)
   at 
 org.apache.spark.unsafe.map.BytesToBytesMap$Location.putNewKey(BytesToBytesMap.java:600)
   at 
 org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.getAggregationBuffer(UnsafeFixedWidthAggregationMap.java:134)
   at 
 org.apache.spark.sql.execution.aggregate.UnsafeHybridAggregationIterator.initialize(UnsafeHybridAggregationIterator.scala:276)
   at 
 org.apache.spark.sql.execution.aggregate.UnsafeHybridAggregationIterator.init(UnsafeHybridAggregationIterator.scala:290)
   at 
 org.apache.spark.sql.execution.aggregate.UnsafeHybridAggregationIterator$.createFromInputIterator(UnsafeHybridAggregationIterator.scala:358)
   at 
 org.apache.spark.sql.execution.aggregate.Aggregate$$anonfun$doExecute$1$$anonfun$5.apply(Aggregate.scala:130)
   at 
 org.apache.spark.sql.execution.aggregate.Aggregate$$anonfun$doExecute$1$$anonfun$5.apply(Aggregate.scala:121)
   at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
   at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
   at org.apache.spark.scheduler.Task.run(Task.scala:88)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9561) Enable BroadcastJoinSuite


 [ 
https://issues.apache.org/jira/browse/SPARK-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9561:
---

Assignee: Andrew Or  (was: Apache Spark)

 Enable BroadcastJoinSuite
 -

 Key: SPARK-9561
 URL: https://issues.apache.org/jira/browse/SPARK-9561
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.5.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker

 This is introduced in SPARK-8735, but due to complexities with TestSQLContext 
 this needs to be commented out completely. We need to find a way to make it 
 work before releasing 1.5.
 For more detail, see the discussion: https://github.com/apache/spark/pull/7770



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9561) Enable BroadcastJoinSuite