[jira] [Closed] (SPARK-9826) Cannot use custom classes in log4j.properties
[ https://issues.apache.org/jira/browse/SPARK-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michel Lemay closed SPARK-9826. --- Verified working with a local build of branch-1.4 (1.4.2-snapshot) Cannot use custom classes in log4j.properties - Key: SPARK-9826 URL: https://issues.apache.org/jira/browse/SPARK-9826 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: Michel Lemay Assignee: Michel Lemay Priority: Minor Labels: regression Fix For: 1.4.2, 1.5.0 log4j is initialized before spark class loader is set on the thread context. Therefore it cannot use classes embedded in fat-jars submitted to spark. While parsing arguments, spark calls methods on Utils class and triggers ShutdownHookManager static initialization. This then leads to log4j being initialized before spark gets the chance to specify custom class MutableURLClassLoader on the thread context. See detailed explanation here: http://apache-spark-user-list.1001560.n3.nabble.com/log4j-custom-appender-ClassNotFoundException-with-spark-1-4-1-tt24159.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN
[ https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696889#comment-14696889 ] Frank Rosner commented on SPARK-9971: - Ok so shall we close it as a wontfix? MaxFunction not working correctly with columns containing Double.NaN Key: SPARK-9971 URL: https://issues.apache.org/jira/browse/SPARK-9971 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: Frank Rosner Priority: Minor h4. Problem Description When using the {{max}} function on a {{DoubleType}} column that contains {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. This is because it compares all values with the running maximum. However, {{x Double.NaN}} will always lead false for all {{x: Double}}, so will {{x Double.NaN}}. h4. How to Reproduce {code} import org.apache.spark.sql.{SQLContext, Row} import org.apache.spark.sql.types._ val sql = new SQLContext(sc) val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) val dataFrame = sql.createDataFrame(rdd, StructType(List( StructField(col, DoubleType, false) ))) dataFrame.select(max(col)).first // returns org.apache.spark.sql.Row = [NaN] {code} h4. Solution The {{max}} and {{min}} functions should ignore NaN values, as they are not numbers. If a column contains only NaN values, then the maximum and minimum is not defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9906) User guide for LogisticRegressionSummary
[ https://issues.apache.org/jira/browse/SPARK-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696891#comment-14696891 ] Apache Spark commented on SPARK-9906: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/8197 User guide for LogisticRegressionSummary Key: SPARK-9906 URL: https://issues.apache.org/jira/browse/SPARK-9906 Project: Spark Issue Type: Documentation Components: ML Reporter: Feynman Liang Assignee: Manoj Kumar SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model statistics to ML pipeline logistic regression models. This feature is not present in mllib and should be documented within {{ml-guide}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9906) User guide for LogisticRegressionSummary
[ https://issues.apache.org/jira/browse/SPARK-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9906: --- Assignee: Manoj Kumar (was: Apache Spark) User guide for LogisticRegressionSummary Key: SPARK-9906 URL: https://issues.apache.org/jira/browse/SPARK-9906 Project: Spark Issue Type: Documentation Components: ML Reporter: Feynman Liang Assignee: Manoj Kumar SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model statistics to ML pipeline logistic regression models. This feature is not present in mllib and should be documented within {{ml-guide}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9906) User guide for LogisticRegressionSummary
[ https://issues.apache.org/jira/browse/SPARK-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9906: --- Assignee: Apache Spark (was: Manoj Kumar) User guide for LogisticRegressionSummary Key: SPARK-9906 URL: https://issues.apache.org/jira/browse/SPARK-9906 Project: Spark Issue Type: Documentation Components: ML Reporter: Feynman Liang Assignee: Apache Spark SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model statistics to ML pipeline logistic regression models. This feature is not present in mllib and should be documented within {{ml-guide}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN
[ https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696894#comment-14696894 ] Sean Owen commented on SPARK-9971: -- I'd always prefer to leave it open at least ~1 week to see if anyone else has comments. If not, yes I personally would argue that the current behavior is the right one. MaxFunction not working correctly with columns containing Double.NaN Key: SPARK-9971 URL: https://issues.apache.org/jira/browse/SPARK-9971 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: Frank Rosner Priority: Minor h4. Problem Description When using the {{max}} function on a {{DoubleType}} column that contains {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. This is because it compares all values with the running maximum. However, {{x Double.NaN}} will always lead false for all {{x: Double}}, so will {{x Double.NaN}}. h4. How to Reproduce {code} import org.apache.spark.sql.{SQLContext, Row} import org.apache.spark.sql.types._ val sql = new SQLContext(sc) val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) val dataFrame = sql.createDataFrame(rdd, StructType(List( StructField(col, DoubleType, false) ))) dataFrame.select(max(col)).first // returns org.apache.spark.sql.Row = [NaN] {code} h4. Solution The {{max}} and {{min}} functions should ignore NaN values, as they are not numbers. If a column contains only NaN values, then the maximum and minimum is not defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9974) SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar
[ https://issues.apache.org/jira/browse/SPARK-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696919#comment-14696919 ] Apache Spark commented on SPARK-9974: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/8198 SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar Key: SPARK-9974 URL: https://issues.apache.org/jira/browse/SPARK-9974 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Priority: Blocker One of the consequence of this issue is that Parquet tables created in Hive are not accessible from Spark SQL built with SBT. Maven build is OK. Git commit: [69930310115501f0de094fe6f5c6c60dade342bd|https://github.com/apache/spark/commit/69930310115501f0de094fe6f5c6c60dade342bd] Build with SBT and check the assembly jar for {{parquet.hadoop.api.WriteSupport}}: {noformat} $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 clean assembly/assembly ... $ jar tf assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | fgrep parquet/hadoop/api org/apache/parquet/hadoop/api/ org/apache/parquet/hadoop/api/DelegatingReadSupport.class org/apache/parquet/hadoop/api/DelegatingWriteSupport.class org/apache/parquet/hadoop/api/InitContext.class org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class org/apache/parquet/hadoop/api/ReadSupport.class org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class org/apache/parquet/hadoop/api/WriteSupport.class {noformat} Only classes of {{org.apache.parquet:parquet-mr:1.7.0}}. Note that classes in {{com.twitter:parquet-hadoop-bundle:1.6.0}} are not under the {{org.apache}} namespace. Build with Maven and check the assembly jar for {{parquet.hadoop.api.WriteSupport}}: {noformat} $ ./build/mvn -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 -DskipTests clean package $ jar tf assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | fgrep parquet/hadoop/api org/apache/parquet/hadoop/api/ org/apache/parquet/hadoop/api/DelegatingReadSupport.class org/apache/parquet/hadoop/api/DelegatingWriteSupport.class org/apache/parquet/hadoop/api/InitContext.class org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class org/apache/parquet/hadoop/api/ReadSupport.class org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class org/apache/parquet/hadoop/api/WriteSupport.class parquet/hadoop/api/ parquet/hadoop/api/DelegatingReadSupport.class parquet/hadoop/api/DelegatingWriteSupport.class parquet/hadoop/api/InitContext.class parquet/hadoop/api/ReadSupport$ReadContext.class parquet/hadoop/api/ReadSupport.class parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class parquet/hadoop/api/WriteSupport$WriteContext.class parquet/hadoop/api/WriteSupport.class {noformat} Expected classes are packaged properly. To reproduce the Parquet table access issue, first create a Parquet table with Hive (say 0.13.1): {noformat} hive CREATE TABLE parquet_test STORED AS PARQUET AS SELECT 1; {noformat} Build Spark assembly jar with the SBT command above, start {{spark-shell}}: {noformat} scala sqlContext.table(parquet_test).show() 15/08/14 17:52:50 INFO HiveMetaStore: 0: get_table : db=default tbl=parquet_test 15/08/14 17:52:50 INFO audit: ugi=lian ip=unknown-ip-addr cmd=get_table : db=default tbl=parquet_test java.lang.NoClassDefFoundError: parquet/hadoop/api/WriteSupport at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.hadoop.hive.ql.metadata.Table.getOutputFormatClass(Table.java:328) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:320) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:303) at scala.Option.map(Option.scala:145) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:303) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:298) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256) at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211) at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
[jira] [Assigned] (SPARK-9974) SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar
[ https://issues.apache.org/jira/browse/SPARK-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9974: --- Assignee: (was: Apache Spark) SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar Key: SPARK-9974 URL: https://issues.apache.org/jira/browse/SPARK-9974 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Priority: Blocker One of the consequence of this issue is that Parquet tables created in Hive are not accessible from Spark SQL built with SBT. Maven build is OK. Git commit: [69930310115501f0de094fe6f5c6c60dade342bd|https://github.com/apache/spark/commit/69930310115501f0de094fe6f5c6c60dade342bd] Build with SBT and check the assembly jar for {{parquet.hadoop.api.WriteSupport}}: {noformat} $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 clean assembly/assembly ... $ jar tf assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | fgrep parquet/hadoop/api org/apache/parquet/hadoop/api/ org/apache/parquet/hadoop/api/DelegatingReadSupport.class org/apache/parquet/hadoop/api/DelegatingWriteSupport.class org/apache/parquet/hadoop/api/InitContext.class org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class org/apache/parquet/hadoop/api/ReadSupport.class org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class org/apache/parquet/hadoop/api/WriteSupport.class {noformat} Only classes of {{org.apache.parquet:parquet-mr:1.7.0}}. Note that classes in {{com.twitter:parquet-hadoop-bundle:1.6.0}} are not under the {{org.apache}} namespace. Build with Maven and check the assembly jar for {{parquet.hadoop.api.WriteSupport}}: {noformat} $ ./build/mvn -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 -DskipTests clean package $ jar tf assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | fgrep parquet/hadoop/api org/apache/parquet/hadoop/api/ org/apache/parquet/hadoop/api/DelegatingReadSupport.class org/apache/parquet/hadoop/api/DelegatingWriteSupport.class org/apache/parquet/hadoop/api/InitContext.class org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class org/apache/parquet/hadoop/api/ReadSupport.class org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class org/apache/parquet/hadoop/api/WriteSupport.class parquet/hadoop/api/ parquet/hadoop/api/DelegatingReadSupport.class parquet/hadoop/api/DelegatingWriteSupport.class parquet/hadoop/api/InitContext.class parquet/hadoop/api/ReadSupport$ReadContext.class parquet/hadoop/api/ReadSupport.class parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class parquet/hadoop/api/WriteSupport$WriteContext.class parquet/hadoop/api/WriteSupport.class {noformat} Expected classes are packaged properly. To reproduce the Parquet table access issue, first create a Parquet table with Hive (say 0.13.1): {noformat} hive CREATE TABLE parquet_test STORED AS PARQUET AS SELECT 1; {noformat} Build Spark assembly jar with the SBT command above, start {{spark-shell}}: {noformat} scala sqlContext.table(parquet_test).show() 15/08/14 17:52:50 INFO HiveMetaStore: 0: get_table : db=default tbl=parquet_test 15/08/14 17:52:50 INFO audit: ugi=lian ip=unknown-ip-addr cmd=get_table : db=default tbl=parquet_test java.lang.NoClassDefFoundError: parquet/hadoop/api/WriteSupport at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.hadoop.hive.ql.metadata.Table.getOutputFormatClass(Table.java:328) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:320) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:303) at scala.Option.map(Option.scala:145) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:303) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:298) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256) at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211) at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248) at org.apache.spark.sql.hive.client.ClientWrapper.getTableOption(ClientWrapper.scala:298) at
[jira] [Assigned] (SPARK-9974) SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar
[ https://issues.apache.org/jira/browse/SPARK-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9974: --- Assignee: Apache Spark SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar Key: SPARK-9974 URL: https://issues.apache.org/jira/browse/SPARK-9974 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Apache Spark Priority: Blocker One of the consequence of this issue is that Parquet tables created in Hive are not accessible from Spark SQL built with SBT. Maven build is OK. Git commit: [69930310115501f0de094fe6f5c6c60dade342bd|https://github.com/apache/spark/commit/69930310115501f0de094fe6f5c6c60dade342bd] Build with SBT and check the assembly jar for {{parquet.hadoop.api.WriteSupport}}: {noformat} $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 clean assembly/assembly ... $ jar tf assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | fgrep parquet/hadoop/api org/apache/parquet/hadoop/api/ org/apache/parquet/hadoop/api/DelegatingReadSupport.class org/apache/parquet/hadoop/api/DelegatingWriteSupport.class org/apache/parquet/hadoop/api/InitContext.class org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class org/apache/parquet/hadoop/api/ReadSupport.class org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class org/apache/parquet/hadoop/api/WriteSupport.class {noformat} Only classes of {{org.apache.parquet:parquet-mr:1.7.0}}. Note that classes in {{com.twitter:parquet-hadoop-bundle:1.6.0}} are not under the {{org.apache}} namespace. Build with Maven and check the assembly jar for {{parquet.hadoop.api.WriteSupport}}: {noformat} $ ./build/mvn -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 -DskipTests clean package $ jar tf assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | fgrep parquet/hadoop/api org/apache/parquet/hadoop/api/ org/apache/parquet/hadoop/api/DelegatingReadSupport.class org/apache/parquet/hadoop/api/DelegatingWriteSupport.class org/apache/parquet/hadoop/api/InitContext.class org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class org/apache/parquet/hadoop/api/ReadSupport.class org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class org/apache/parquet/hadoop/api/WriteSupport.class parquet/hadoop/api/ parquet/hadoop/api/DelegatingReadSupport.class parquet/hadoop/api/DelegatingWriteSupport.class parquet/hadoop/api/InitContext.class parquet/hadoop/api/ReadSupport$ReadContext.class parquet/hadoop/api/ReadSupport.class parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class parquet/hadoop/api/WriteSupport$WriteContext.class parquet/hadoop/api/WriteSupport.class {noformat} Expected classes are packaged properly. To reproduce the Parquet table access issue, first create a Parquet table with Hive (say 0.13.1): {noformat} hive CREATE TABLE parquet_test STORED AS PARQUET AS SELECT 1; {noformat} Build Spark assembly jar with the SBT command above, start {{spark-shell}}: {noformat} scala sqlContext.table(parquet_test).show() 15/08/14 17:52:50 INFO HiveMetaStore: 0: get_table : db=default tbl=parquet_test 15/08/14 17:52:50 INFO audit: ugi=lian ip=unknown-ip-addr cmd=get_table : db=default tbl=parquet_test java.lang.NoClassDefFoundError: parquet/hadoop/api/WriteSupport at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.hadoop.hive.ql.metadata.Table.getOutputFormatClass(Table.java:328) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:320) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:303) at scala.Option.map(Option.scala:145) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:303) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:298) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256) at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211) at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248) at
[jira] [Commented] (SPARK-9960) run-example SparkPi fails on Mac
[ https://issues.apache.org/jira/browse/SPARK-9960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696546#comment-14696546 ] Apache Spark commented on SPARK-9960: - User 'farseer90718' has created a pull request for this issue: https://github.com/apache/spark/pull/8188 run-example SparkPi fails on Mac Key: SPARK-9960 URL: https://issues.apache.org/jira/browse/SPARK-9960 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.4.1 Environment: java version 1.7.0_71, Mac OS X 10.9.5 Reporter: Naga Labels: run-example -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)
[ https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696561#comment-14696561 ] Yadong Qi commented on SPARK-9213: -- [~rxin] Yes, I will do it as below: ``` case class Like(left: Expression, right: Expression) { if (flag) { JavaLike(left, right) } else { JoniLike(left, right) } } case class JavaLike(left: Expression, right: Expression) case class JoniLike(left: Expression, right: Expression) ``` Right? Improve regular expression performance (via joni) - Key: SPARK-9213 URL: https://issues.apache.org/jira/browse/SPARK-9213 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin I'm creating an umbrella ticket to improve regular expression performance for string expressions. Right now our use of regular expressions is inefficient for two reasons: 1. Java regex in general is slow. 2. We have to convert everything from UTF8 encoded bytes into Java String, and then run regex on it, and then convert it back. There are libraries in Java that provide regex support directly on UTF8 encoded bytes. One prominent example is joni, used in JRuby. Note: all regex functions are in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9966) Handle a couple of corner cases in the PID rate estimator
Tathagata Das created SPARK-9966: Summary: Handle a couple of corner cases in the PID rate estimator Key: SPARK-9966 URL: https://issues.apache.org/jira/browse/SPARK-9966 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker 1. The rate estimator should not estimate any rate when there are no records in the batch, as there is no data to estimate the rate. In the current state, it estimates and set the rate to zero. That is incorrect. 2. The rate estimator should not never set the rate to zero under any circumstances. Otherwise the system will stop receiving data, and stop generating useful estimates (see reason 1). So the fix is to define a parameters that sets a lower bound on the estimated rate, so that the system always receives some data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9967) Rename the SparkConf property to spark.streaming.backpressure.{enable -- enabled}
Tathagata Das created SPARK-9967: Summary: Rename the SparkConf property to spark.streaming.backpressure.{enable -- enabled} Key: SPARK-9967 URL: https://issues.apache.org/jira/browse/SPARK-9967 Project: Spark Issue Type: Sub-task Reporter: Tathagata Das Assignee: Tathagata Das Priority: Minor ... to better align with most other spark parameters having enable in them... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)
[ https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696596#comment-14696596 ] Reynold Xin commented on SPARK-9213: We just need to handle it in the analyzer to rewrite it, and also pattern match it in the optimizer. No need to handle it everywhere else, since the analyzer will take care of the rewrite. Improve regular expression performance (via joni) - Key: SPARK-9213 URL: https://issues.apache.org/jira/browse/SPARK-9213 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin I'm creating an umbrella ticket to improve regular expression performance for string expressions. Right now our use of regular expressions is inefficient for two reasons: 1. Java regex in general is slow. 2. We have to convert everything from UTF8 encoded bytes into Java String, and then run regex on it, and then convert it back. There are libraries in Java that provide regex support directly on UTF8 encoded bytes. One prominent example is joni, used in JRuby. Note: all regex functions are in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)
[ https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696630#comment-14696630 ] Yadong Qi commented on SPARK-9213: -- I know, thanks! Improve regular expression performance (via joni) - Key: SPARK-9213 URL: https://issues.apache.org/jira/browse/SPARK-9213 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin I'm creating an umbrella ticket to improve regular expression performance for string expressions. Right now our use of regular expressions is inefficient for two reasons: 1. Java regex in general is slow. 2. We have to convert everything from UTF8 encoded bytes into Java String, and then run regex on it, and then convert it back. There are libraries in Java that provide regex support directly on UTF8 encoded bytes. One prominent example is joni, used in JRuby. Note: all regex functions are in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9958) HiveThriftServer2Listener is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-9958: -- Assignee: Shixiong Zhu HiveThriftServer2Listener is not thread-safe Key: SPARK-9958 URL: https://issues.apache.org/jira/browse/SPARK-9958 Project: Spark Issue Type: Bug Components: SQL Reporter: Shixiong Zhu Assignee: Shixiong Zhu Fix For: 1.5.0 HiveThriftServer2Listener should be thread-safe since it's accessed by multiple threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9958) HiveThriftServer2Listener is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-9958. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8185 [https://github.com/apache/spark/pull/8185] HiveThriftServer2Listener is not thread-safe Key: SPARK-9958 URL: https://issues.apache.org/jira/browse/SPARK-9958 Project: Spark Issue Type: Bug Components: SQL Reporter: Shixiong Zhu Fix For: 1.5.0 HiveThriftServer2Listener should be thread-safe since it's accessed by multiple threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields
[ https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696640#comment-14696640 ] Apache Spark commented on SPARK-9961: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/8190 ML prediction abstractions should have defaultEvaluator fields -- Key: SPARK-9961 URL: https://issues.apache.org/jira/browse/SPARK-9961 Project: Spark Issue Type: New Feature Components: ML Reporter: Joseph K. Bradley Predictor and PredictionModel should have abstract defaultEvaluator methods which return Evaluators. Subclasses like Regressor, Classifier, etc. should all provide natural evaluators, set to use the correct input columns and metrics. Concrete classes may later be modified to The initial implementation should be marked as DeveloperApi since we may need to change the defaults later on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields
[ https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9961: --- Assignee: (was: Apache Spark) ML prediction abstractions should have defaultEvaluator fields -- Key: SPARK-9961 URL: https://issues.apache.org/jira/browse/SPARK-9961 Project: Spark Issue Type: New Feature Components: ML Reporter: Joseph K. Bradley Predictor and PredictionModel should have abstract defaultEvaluator methods which return Evaluators. Subclasses like Regressor, Classifier, etc. should all provide natural evaluators, set to use the correct input columns and metrics. Concrete classes may later be modified to The initial implementation should be marked as DeveloperApi since we may need to change the defaults later on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9661) ML 1.5 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-9661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696641#comment-14696641 ] Apache Spark commented on SPARK-9661: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/8190 ML 1.5 QA: API: Java compatibility, docs Key: SPARK-9661 URL: https://issues.apache.org/jira/browse/SPARK-9661 Project: Spark Issue Type: Sub-task Components: Documentation, Java API, ML, MLlib Reporter: Joseph K. Bradley Assignee: Manoj Kumar Fix For: 1.5.0 Check Java compatibility for MLlib for this release. Checking compatibility means: * comparing with the Scala doc * verifying that Java docs are not messed up by Scala type incompatibilities. Some items to look out for are: ** Check for generic Object types where Java cannot understand complex Scala types. *** *Note*: The Java docs do not always match the bytecode. If you find a problem, please verify it using {{javap}}. ** Check Scala objects (especially with nesting!) carefully. ** Check for uses of Scala and Java enumerations, which can show up oddly in the other language's doc. * If needed for complex issues, create small Java unit tests which execute each method. (The correctness can be checked in Scala.) If you find issues, please comment here, or for larger items, create separate JIRAs and link here. Note that we should not break APIs from previous releases. So if you find a problem, check if it was introduced in this Spark release (in which case we can fix it) or in a previous one (in which case we can create a java-friendly version of the API). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields
[ https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9961: --- Assignee: Apache Spark ML prediction abstractions should have defaultEvaluator fields -- Key: SPARK-9961 URL: https://issues.apache.org/jira/browse/SPARK-9961 Project: Spark Issue Type: New Feature Components: ML Reporter: Joseph K. Bradley Assignee: Apache Spark Predictor and PredictionModel should have abstract defaultEvaluator methods which return Evaluators. Subclasses like Regressor, Classifier, etc. should all provide natural evaluators, set to use the correct input columns and metrics. Concrete classes may later be modified to The initial implementation should be marked as DeveloperApi since we may need to change the defaults later on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-9661) ML 1.5 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-9661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reopened SPARK-9661: -- re-open for some clean-ups ML 1.5 QA: API: Java compatibility, docs Key: SPARK-9661 URL: https://issues.apache.org/jira/browse/SPARK-9661 Project: Spark Issue Type: Sub-task Components: Documentation, Java API, ML, MLlib Reporter: Joseph K. Bradley Assignee: Manoj Kumar Fix For: 1.5.0 Check Java compatibility for MLlib for this release. Checking compatibility means: * comparing with the Scala doc * verifying that Java docs are not messed up by Scala type incompatibilities. Some items to look out for are: ** Check for generic Object types where Java cannot understand complex Scala types. *** *Note*: The Java docs do not always match the bytecode. If you find a problem, please verify it using {{javap}}. ** Check Scala objects (especially with nesting!) carefully. ** Check for uses of Scala and Java enumerations, which can show up oddly in the other language's doc. * If needed for complex issues, create small Java unit tests which execute each method. (The correctness can be checked in Scala.) If you find issues, please comment here, or for larger items, create separate JIRAs and link here. Note that we should not break APIs from previous releases. So if you find a problem, check if it was introduced in this Spark release (in which case we can fix it) or in a previous one (in which case we can create a java-friendly version of the API). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9964) PySpark DataFrameReader accept RDD of String for JSON
Joseph K. Bradley created SPARK-9964: Summary: PySpark DataFrameReader accept RDD of String for JSON Key: SPARK-9964 URL: https://issues.apache.org/jira/browse/SPARK-9964 Project: Spark Issue Type: New Feature Components: PySpark, SQL Reporter: Joseph K. Bradley Priority: Minor It would be nice (but not necessary) for the PySpark DataFrameReader to accept an RDD of Strings (like the Scala version does) for JSON, rather than only taking a path. If this JIRA is accepted, it should probably be duplicated to cover the other input types (not just JSON). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl
Joseph K. Bradley created SPARK-9963: Summary: ML RandomForest cleanup: replace predictNodeIndex with predictImpl Key: SPARK-9963 URL: https://issues.apache.org/jira/browse/SPARK-9963 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Trivial Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl. This should be straightforward, but please ping me if anything is unclear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9965) Scala, Python SQLContext input methods' deprecation statuses do not match
Joseph K. Bradley created SPARK-9965: Summary: Scala, Python SQLContext input methods' deprecation statuses do not match Key: SPARK-9965 URL: https://issues.apache.org/jira/browse/SPARK-9965 Project: Spark Issue Type: New Feature Components: PySpark, SQL Reporter: Joseph K. Bradley Priority: Minor Scala's SQLContext has several methods for data input (jsonFile, jsonRDD, etc.) deprecated. These methods are not deprecated in Python's SQLContext. They should be, but only after Python's DataFrameReader implements analogous methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9213) Improve regular expression performance (via joni)
[ https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696561#comment-14696561 ] Yadong Qi edited comment on SPARK-9213 at 8/14/15 6:34 AM: --- [~rxin] Yes, I will do it as below: ``` case class Like(left: Expression, right: Expression) { if (flag) { JavaLike(left, right) } else { JoniLike(left, right) } } case class JavaLike(left: Expression, right: Expression) case class JoniLike(left: Expression, right: Expression) ``` Right? was (Author: waterman): [~rxin] Yes, I will do it as below: ``` case class Like(left: Expression, right: Expression) { if (flag) { JavaLike(left, right) } else { JoniLike(left, right) } } case class JavaLike(left: Expression, right: Expression) case class JoniLike(left: Expression, right: Expression) ``` Right? Improve regular expression performance (via joni) - Key: SPARK-9213 URL: https://issues.apache.org/jira/browse/SPARK-9213 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin I'm creating an umbrella ticket to improve regular expression performance for string expressions. Right now our use of regular expressions is inefficient for two reasons: 1. Java regex in general is slow. 2. We have to convert everything from UTF8 encoded bytes into Java String, and then run regex on it, and then convert it back. There are libraries in Java that provide regex support directly on UTF8 encoded bytes. One prominent example is joni, used in JRuby. Note: all regex functions are in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9968) BlockGenerator lock structure can cause lock starvation of the block updating thread
[ https://issues.apache.org/jira/browse/SPARK-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-9968: - Description: When the rate limiter is actually limiting the rate at which data is inserted into the buffer, the synchronized block of BlockGenerator.addData stays blocked for long time. This causes the thread switching the buffer and generating blocks (synchronized with addData) to starve and not generate blocks for seconds. The correct solution is to not block on the rate limiter within the synchronized block for adding data to the buffer. was:When the rate limiter is actually limiting the rate at which data is inserted into the buffer, the synchronized block of BlockGenerator.addData stays blocked for long time. This causes the thread switching the buffer and generating blocks (synchronized with addData) to starve and not generate blocks for seconds. BlockGenerator lock structure can cause lock starvation of the block updating thread Key: SPARK-9968 URL: https://issues.apache.org/jira/browse/SPARK-9968 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das When the rate limiter is actually limiting the rate at which data is inserted into the buffer, the synchronized block of BlockGenerator.addData stays blocked for long time. This causes the thread switching the buffer and generating blocks (synchronized with addData) to starve and not generate blocks for seconds. The correct solution is to not block on the rate limiter within the synchronized block for adding data to the buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9968) BlockGenerator lock structure can cause lock starvation of the block updating thread
Tathagata Das created SPARK-9968: Summary: BlockGenerator lock structure can cause lock starvation of the block updating thread Key: SPARK-9968 URL: https://issues.apache.org/jira/browse/SPARK-9968 Project: Spark Issue Type: Sub-task Reporter: Tathagata Das Assignee: Tathagata Das When the rate limiter is actually limiting the rate at which data is inserted into the buffer, the synchronized block of BlockGenerator.addData stays blocked for long time. This causes the thread switching the buffer and generating blocks (synchronized with addData) to starve and not generate blocks for seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields
[ https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-9961: - Comment: was deleted (was: User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/8190) ML prediction abstractions should have defaultEvaluator fields -- Key: SPARK-9961 URL: https://issues.apache.org/jira/browse/SPARK-9961 Project: Spark Issue Type: New Feature Components: ML Reporter: Joseph K. Bradley Predictor and PredictionModel should have abstract defaultEvaluator methods which return Evaluators. Subclasses like Regressor, Classifier, etc. should all provide natural evaluators, set to use the correct input columns and metrics. Concrete classes may later be modified to The initial implementation should be marked as DeveloperApi since we may need to change the defaults later on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields
Joseph K. Bradley created SPARK-9961: Summary: ML prediction abstractions should have defaultEvaluator fields Key: SPARK-9961 URL: https://issues.apache.org/jira/browse/SPARK-9961 Project: Spark Issue Type: New Feature Components: ML Reporter: Joseph K. Bradley Predictor and PredictionModel should have abstract defaultEvaluator methods which return Evaluators. Subclasses like Regressor, Classifier, etc. should all provide natural evaluators, set to use the correct input columns and metrics. Concrete classes may later be modified to The initial implementation should be marked as DeveloperApi since we may need to change the defaults later on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9962) Decision Tree training: prevNodeIdsForInstances.unpersist() at end of training
Joseph K. Bradley created SPARK-9962: Summary: Decision Tree training: prevNodeIdsForInstances.unpersist() at end of training Key: SPARK-9962 URL: https://issues.apache.org/jira/browse/SPARK-9962 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor NodeIdCache: prevNodeIdsForInstances.unpersist() needs to be called at end of training. This applies to both the ML and MLlib implementations, but it is Ok to skip the MLlib implementation since it will eventually be replaced by the ML one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9213) Improve regular expression performance (via joni)
[ https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696561#comment-14696561 ] Yadong Qi edited comment on SPARK-9213 at 8/14/15 6:33 AM: --- [~rxin] Yes, I will do it as below: ``` case class Like(left: Expression, right: Expression) { if (flag) { JavaLike(left, right) } else { JoniLike(left, right) } } case class JavaLike(left: Expression, right: Expression) case class JoniLike(left: Expression, right: Expression) ``` Right? was (Author: waterman): [~rxin] Yes, I will do it as below: ``` case class Like(left: Expression, right: Expression) { if (flag) { JavaLike(left, right) } else { JoniLike(left, right) } } case class JavaLike(left: Expression, right: Expression) case class JoniLike(left: Expression, right: Expression) ``` Right? Improve regular expression performance (via joni) - Key: SPARK-9213 URL: https://issues.apache.org/jira/browse/SPARK-9213 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin I'm creating an umbrella ticket to improve regular expression performance for string expressions. Right now our use of regular expressions is inefficient for two reasons: 1. Java regex in general is slow. 2. We have to convert everything from UTF8 encoded bytes into Java String, and then run regex on it, and then convert it back. There are libraries in Java that provide regex support directly on UTF8 encoded bytes. One prominent example is joni, used in JRuby. Note: all regex functions are in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)
[ https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696574#comment-14696574 ] Reynold Xin commented on SPARK-9213: I'm thinking just have Like for Joni, and then LikeJavaFallback for Java. In the analyzer, we replace Like with LikeJavaFallback if the config option sets to java. Improve regular expression performance (via joni) - Key: SPARK-9213 URL: https://issues.apache.org/jira/browse/SPARK-9213 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin I'm creating an umbrella ticket to improve regular expression performance for string expressions. Right now our use of regular expressions is inefficient for two reasons: 1. Java regex in general is slow. 2. We have to convert everything from UTF8 encoded bytes into Java String, and then run regex on it, and then convert it back. There are libraries in Java that provide regex support directly on UTF8 encoded bytes. One prominent example is joni, used in JRuby. Note: all regex functions are in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)
[ https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696595#comment-14696595 ] Yadong Qi commented on SPARK-9213: -- [~rxin] There're many place use Like, It does not matter to check the config option everywhere? Improve regular expression performance (via joni) - Key: SPARK-9213 URL: https://issues.apache.org/jira/browse/SPARK-9213 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin I'm creating an umbrella ticket to improve regular expression performance for string expressions. Right now our use of regular expressions is inefficient for two reasons: 1. Java regex in general is slow. 2. We have to convert everything from UTF8 encoded bytes into Java String, and then run regex on it, and then convert it back. There are libraries in Java that provide regex support directly on UTF8 encoded bytes. One prominent example is joni, used in JRuby. Note: all regex functions are in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6624) Convert filters into CNF for data sources
[ https://issues.apache.org/jira/browse/SPARK-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696988#comment-14696988 ] Apache Spark commented on SPARK-6624: - User 'yjshen' has created a pull request for this issue: https://github.com/apache/spark/pull/8193 Convert filters into CNF for data sources - Key: SPARK-6624 URL: https://issues.apache.org/jira/browse/SPARK-6624 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Yijie Shen We should turn filters into conjunctive normal form (CNF) before we pass them to data sources. Otherwise, filters are not very useful if there is a single filter with a bunch of ORs. Note that we already try to do some of these in BooleanSimplification, but I think we should just formalize it to use CNF. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9968) BlockGenerator lock structure can cause lock starvation of the block updating thread
[ https://issues.apache.org/jira/browse/SPARK-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9968: --- Assignee: Apache Spark (was: Tathagata Das) BlockGenerator lock structure can cause lock starvation of the block updating thread Key: SPARK-9968 URL: https://issues.apache.org/jira/browse/SPARK-9968 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Tathagata Das Assignee: Apache Spark When the rate limiter is actually limiting the rate at which data is inserted into the buffer, the synchronized block of BlockGenerator.addData stays blocked for long time. This causes the thread switching the buffer and generating blocks (synchronized with addData) to starve and not generate blocks for seconds. The correct solution is to not block on the rate limiter within the synchronized block for adding data to the buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9968) BlockGenerator lock structure can cause lock starvation of the block updating thread
[ https://issues.apache.org/jira/browse/SPARK-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696987#comment-14696987 ] Apache Spark commented on SPARK-9968: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/8204 BlockGenerator lock structure can cause lock starvation of the block updating thread Key: SPARK-9968 URL: https://issues.apache.org/jira/browse/SPARK-9968 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das When the rate limiter is actually limiting the rate at which data is inserted into the buffer, the synchronized block of BlockGenerator.addData stays blocked for long time. This causes the thread switching the buffer and generating blocks (synchronized with addData) to starve and not generate blocks for seconds. The correct solution is to not block on the rate limiter within the synchronized block for adding data to the buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9968) BlockGenerator lock structure can cause lock starvation of the block updating thread
[ https://issues.apache.org/jira/browse/SPARK-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9968: --- Assignee: Tathagata Das (was: Apache Spark) BlockGenerator lock structure can cause lock starvation of the block updating thread Key: SPARK-9968 URL: https://issues.apache.org/jira/browse/SPARK-9968 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das When the rate limiter is actually limiting the rate at which data is inserted into the buffer, the synchronized block of BlockGenerator.addData stays blocked for long time. This causes the thread switching the buffer and generating blocks (synchronized with addData) to starve and not generate blocks for seconds. The correct solution is to not block on the rate limiter within the synchronized block for adding data to the buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9734) java.lang.IllegalArgumentException: Don't know how to save StructField(sal,DecimalType(7,2),true) to JDBC
[ https://issues.apache.org/jira/browse/SPARK-9734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697003#comment-14697003 ] Rama Mullapudi commented on SPARK-9734: --- Current 1.5 still gives error when creating table using dataframe write.jdbc Create statement issued by spark looks as below CREATE TABLE foo (TKT_GID DECIMAL(10},0}) NOT NULL) There are closing braces } in the decimal format which causing database to throw error. I looked into the code on github and found in jdbcutils class schemaString function has the extra closing braces } which is causing the issue. /** * Compute the schema string for this RDD. */ def schemaString(df: DataFrame, url: String): String = { . case BooleanType = BIT(1) case StringType = TEXT case BinaryType = BLOB case TimestampType = TIMESTAMP case DateType = DATE case t: DecimalType = sDECIMAL(${t.precision}},${t.scale}}) case _ = throw new IllegalArgumentException(sDon't know how to save $field to JDBC) }) } java.lang.IllegalArgumentException: Don't know how to save StructField(sal,DecimalType(7,2),true) to JDBC - Key: SPARK-9734 URL: https://issues.apache.org/jira/browse/SPARK-9734 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: Greg Rahn Assignee: Davies Liu Fix For: 1.5.0 When using a basic example of reading the EMP table from Redshift via spark-redshift, and writing the data back to Redshift, Spark fails with the below error, related to Numeric/Decimal data types. Redshift table: {code} testdb=# \d emp Table public.emp Column | Type | Modifiers --+---+--- empno| integer | ename| character varying(10) | job | character varying(9) | mgr | integer | hiredate | date | sal | numeric(7,2) | comm | numeric(7,2) | deptno | integer | testdb=# select * from emp; empno | ename |job| mgr | hiredate | sal | comm | deptno ---++---+--++-+-+ 7369 | SMITH | CLERK | 7902 | 1980-12-17 | 800.00 |NULL | 20 7521 | WARD | SALESMAN | 7698 | 1981-02-22 | 1250.00 | 500.00 | 30 7654 | MARTIN | SALESMAN | 7698 | 1981-09-28 | 1250.00 | 1400.00 | 30 7782 | CLARK | MANAGER | 7839 | 1981-06-09 | 2450.00 |NULL | 10 7839 | KING | PRESIDENT | NULL | 1981-11-17 | 5000.00 |NULL | 10 7876 | ADAMS | CLERK | 7788 | 1983-01-12 | 1100.00 |NULL | 20 7902 | FORD | ANALYST | 7566 | 1981-12-03 | 3000.00 |NULL | 20 7499 | ALLEN | SALESMAN | 7698 | 1981-02-20 | 1600.00 | 300.00 | 30 7566 | JONES | MANAGER | 7839 | 1981-04-02 | 2975.00 |NULL | 20 7698 | BLAKE | MANAGER | 7839 | 1981-05-01 | 2850.00 |NULL | 30 7788 | SCOTT | ANALYST | 7566 | 1982-12-09 | 3000.00 |NULL | 20 7844 | TURNER | SALESMAN | 7698 | 1981-09-08 | 1500.00 |0.00 | 30 7900 | JAMES | CLERK | 7698 | 1981-12-03 | 950.00 |NULL | 30 7934 | MILLER | CLERK | 7782 | 1982-01-23 | 1300.00 |NULL | 10 (14 rows) {code} Spark Code: {code} val url = jdbc:redshift://rshost:5439/testdb?user=xxxpassword=xxx val driver = com.amazon.redshift.jdbc41.Driver val t = sqlContext.read.format(com.databricks.spark.redshift).option(jdbcdriver, driver).option(url, url).option(dbtable, emp).option(tempdir, s3n://spark-temp-dir).load() t.registerTempTable(SparkTempTable) val t1 = sqlContext.sql(select * from SparkTempTable) t1.write.format(com.databricks.spark.redshift).option(driver, driver).option(url, url).option(dbtable, t1).option(tempdir, s3n://spark-temp-dir).option(avrocompression, snappy).mode(error).save() {code} Error Stack: {code} java.lang.IllegalArgumentException: Don't know how to save StructField(sal,DecimalType(7,2),true) to JDBC at org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1$$anonfun$2.apply(jdbc.scala:149) at org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1$$anonfun$2.apply(jdbc.scala:136) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1.apply(jdbc.scala:135) at org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1.apply(jdbc.scala:132) at
[jira] [Created] (SPARK-9977) The usage of a label generated by StringIndexer
Kai Sasaki created SPARK-9977: - Summary: The usage of a label generated by StringIndexer Key: SPARK-9977 URL: https://issues.apache.org/jira/browse/SPARK-9977 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.1 Reporter: Kai Sasaki Priority: Trivial By using {{StringIndexer}}, we can obtain indexed label on new column. So a following estimator should use this new column through pipeline if it wants to use string indexed label. I think it is better to make it explicit on documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9976) create function do not work
[ https://issues.apache.org/jira/browse/SPARK-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cen yuhai updated SPARK-9976: - Description: I use beeline to connect to ThriftServer, but add jar can not work, so I use create function , see the link below. http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_mc_hive_udf.html I do as blow: create function gdecodeorder as 'com.hive.udf.GOrderDecode' USING JAR 'hdfs://mycluster/user/spark/lib/gorderdecode.jar'; It returns Ok, and I connect to the metastore, I see records in table FUNCS. select gdecodeorder(t1) from tableX limit 1; It returns error 'Couldn't find function default.gdecodeorder' This is the Exception 15/08/14 14:53:51 ERROR UserGroupInformation: PriviledgedActionException as:xiaoju (auth:SIMPLE) cause:org.apache.hive.service.cli.HiveSQLException: java.lang.RuntimeException: Couldn't find function default.gdecodeorder 15/08/14 15:04:47 ERROR RetryingHMSHandler: MetaException(message:NoSuchObjectException(message:Function default.t_gdecodeorder does not exist)) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newMetaException(HiveMetaStore.java:4613) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_function(HiveMetaStore.java:4740) at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105) at com.sun.proxy.$Proxy21.get_function(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getFunction(HiveMetaStoreClient.java:1721) at sun.reflect.GeneratedMethodAccessor56.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at com.sun.proxy.$Proxy22.getFunction(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getFunction(Hive.java:2662) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfoFromMetastore(FunctionRegistry.java:546) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getQualifiedFunctionInfo(FunctionRegistry.java:579) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:645) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:652) at org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUdfs.scala:54) at org.apache.spark.sql.hive.HiveContext$$anon$3.org$apache$spark$sql$catalyst$analysis$OverrideFunctionRegistry$$super$lookupFunction(HiveContext.scala:376) at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44) at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$class.lookupFunction(FunctionRegistry.scala:44) at org.apache.spark.sql.hive.HiveContext$$anon$3.lookupFunction(HiveContext.scala:376) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:465) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:463) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at
[jira] [Created] (SPARK-9978) Window functions require partitionBy to work as expected
Maciej Szymkiewicz created SPARK-9978: - Summary: Window functions require partitionBy to work as expected Key: SPARK-9978 URL: https://issues.apache.org/jira/browse/SPARK-9978 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Reporter: Maciej Szymkiewicz I am trying to reproduce following query: {code} df.registerTempTable(df) sqlContext.sql(SELECT x, row_number() OVER (ORDER BY x) as rn FROM df).show() ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} using PySpark API. Unfortunately it doesn't work as expected: {code} from pyspark.sql.window import Window from pyspark.sql.functions import rowNumber df = sqlContext.createDataFrame([{x: 0.25}, {x: 0.5}, {x: 0.75}]) df.select(df[x], rowNumber().over(Window.orderBy(x)).alias(rn)).show() ++--+ | x|rn| ++--+ | 0.5| 1| |0.25| 1| |0.75| 1| ++--+ {code} As a workaround It is possible to call partitionBy without additional arguments: {code} df.select( df[x], rowNumber().over(Window.partitionBy().orderBy(x)).alias(rn) ).show() ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} but as far as I can tell it is not documented and is rather counterintuitive considering SQL syntax. Moreover this problem doesn't affect Scala API: {code} import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions.rowNumber case class Record(x: Double) val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75)) df.select($x, rowNumber().over(Window.orderBy($x)).alias(rn)).show ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9978) Window functions require partitionBy to work as expected
[ https://issues.apache.org/jira/browse/SPARK-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-9978: -- Description: I am trying to reproduce following SQL query: {code} df.registerTempTable(df) sqlContext.sql(SELECT x, row_number() OVER (ORDER BY x) as rn FROM df).show() ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} using PySpark API. Unfortunately it doesn't work as expected: {code} from pyspark.sql.window import Window from pyspark.sql.functions import rowNumber df = sqlContext.createDataFrame([{x: 0.25}, {x: 0.5}, {x: 0.75}]) df.select(df[x], rowNumber().over(Window.orderBy(x)).alias(rn)).show() ++--+ | x|rn| ++--+ | 0.5| 1| |0.25| 1| |0.75| 1| ++--+ {code} As a workaround It is possible to call partitionBy without additional arguments: {code} df.select( df[x], rowNumber().over(Window.partitionBy().orderBy(x)).alias(rn) ).show() ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} but as far as I can tell it is not documented and is rather counterintuitive considering SQL syntax. Moreover this problem doesn't affect Scala API: {code} import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions.rowNumber case class Record(x: Double) val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75)) df.select($x, rowNumber().over(Window.orderBy($x)).alias(rn)).show ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} was: I am trying to reproduce following query: {code} df.registerTempTable(df) sqlContext.sql(SELECT x, row_number() OVER (ORDER BY x) as rn FROM df).show() ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} using PySpark API. Unfortunately it doesn't work as expected: {code} from pyspark.sql.window import Window from pyspark.sql.functions import rowNumber df = sqlContext.createDataFrame([{x: 0.25}, {x: 0.5}, {x: 0.75}]) df.select(df[x], rowNumber().over(Window.orderBy(x)).alias(rn)).show() ++--+ | x|rn| ++--+ | 0.5| 1| |0.25| 1| |0.75| 1| ++--+ {code} As a workaround It is possible to call partitionBy without additional arguments: {code} df.select( df[x], rowNumber().over(Window.partitionBy().orderBy(x)).alias(rn) ).show() ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} but as far as I can tell it is not documented and is rather counterintuitive considering SQL syntax. Moreover this problem doesn't affect Scala API: {code} import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions.rowNumber case class Record(x: Double) val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75)) df.select($x, rowNumber().over(Window.orderBy($x)).alias(rn)).show ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} Window functions require partitionBy to work as expected Key: SPARK-9978 URL: https://issues.apache.org/jira/browse/SPARK-9978 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Reporter: Maciej Szymkiewicz I am trying to reproduce following SQL query: {code} df.registerTempTable(df) sqlContext.sql(SELECT x, row_number() OVER (ORDER BY x) as rn FROM df).show() ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} using PySpark API. Unfortunately it doesn't work as expected: {code} from pyspark.sql.window import Window from pyspark.sql.functions import rowNumber df = sqlContext.createDataFrame([{x: 0.25}, {x: 0.5}, {x: 0.75}]) df.select(df[x], rowNumber().over(Window.orderBy(x)).alias(rn)).show() ++--+ | x|rn| ++--+ | 0.5| 1| |0.25| 1| |0.75| 1| ++--+ {code} As a workaround It is possible to call partitionBy without additional arguments: {code} df.select( df[x], rowNumber().over(Window.partitionBy().orderBy(x)).alias(rn) ).show() ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} but as far as I can tell it is not documented and is rather counterintuitive considering SQL syntax. Moreover this problem doesn't affect Scala API: {code} import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions.rowNumber case class Record(x: Double) val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75)) df.select($x, rowNumber().over(Window.orderBy($x)).alias(rn)).show ++--+ | x|rn| ++--+ |0.25| 1| | 0.5| 2| |0.75| 3| ++--+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696677#comment-14696677 ] ding commented on SPARK-5556: - The code can be found https://github.com/intel-analytics/TopicModeling. There is an example in the package, you can try gibbs sampling lda or online lda by setting --optimizer as gibbs or online Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9725) spark sql query string field return empty/garbled string
[ https://issues.apache.org/jira/browse/SPARK-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696684#comment-14696684 ] Zhongshuai Pei commented on SPARK-9725: --- In my environment,offset did not change according to BYTE_ARRAY_OFFSET. when set spark.executor.memory=36g, BYTE_ARRAY_OFFSET is 24, offset is 16 , but set spark.executor.memory=30g, BYTE_ARRAY_OFFSET is 16, offset is 16 too. So if (offset == BYTE_ARRAY_OFFSET base instanceof byte[] ((byte[]) base).length == numBytes) will get different result. [~rxin] spark sql query string field return empty/garbled string Key: SPARK-9725 URL: https://issues.apache.org/jira/browse/SPARK-9725 Project: Spark Issue Type: Bug Components: SQL Reporter: Fei Wang Assignee: Davies Liu Priority: Blocker to reproduce it: 1 deploy spark cluster mode, i use standalone mode locally 2 set executor memory = 32g, set following config in spark-default.xml spark.executor.memory36g 3 run spark-sql.sh with show tables it return empty/garbled string -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9970) SQLContext.createDataFrame failed to properly determine column names
[ https://issues.apache.org/jira/browse/SPARK-9970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-9970: -- Description: Hi, I'm trying to do nested join of tables. After first join everything is ok, but second join made some of the column names lost. My code is following: {code:java} def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, joinType = left_outer): tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: r.asDict()[columnRight])) tmpTable = tmpTable.select(tmpTable._1.alias(joinColumn), tmpTable._2.data.alias(columnNested)) return tableLeft.join(tmpTable, tableLeft[columnLeft] == tmpTable[joinColumn], joinType).drop(joinColumn) user = sqlContext.read.json(path + user.json) user.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) order = sqlContext.read.json(path + order.json) order.printSchema(); root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) lines = sqlContext.read.json(path + lines.json) lines.printSchema(); root |-- id: long (nullable = true) |-- orderid: long (nullable = true) |-- product: string (nullable = true) {code} And joining code: {code:java} orders = joinTable(order, lines, id, orderid, lines) orders.printSchema() root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) |-- lines: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- orderid: long (nullable = true) |||-- product: string (nullable = true) clients = joinTable(user, orders, id, userid, orders) clients.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- orders: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- price: double (nullable = true) |||-- userid: long (nullable = true) |||-- lines: array (nullable = true) ||||-- element: struct (containsNull = true) |||||-- _1: long (nullable = true) |||||-- _2: long (nullable = true) |||||-- _3: string (nullable = true) {code} I tried to check if groupBy isn't the root of the problem but it's looks right {code:java} grouped = orders.rdd.groupBy(lambda r: r.userid) print grouped.map(lambda x: list(x[1])).collect() [[Row(id=1, price=202.3, userid=1, lines=[Row(id=1, orderid=1, product=u'XXX'), Row(id=2, orderid=1, product=u'YYY')]), Row(id=2, price=343.99, userid=1, lines=[Row(id=3, orderid=2, product=u'XXX')])], [Row(id=3, price=399.99, userid=2, lines=[Row(id=4, orderid=3, product=u'XXX')])]] {code} So I assume that the problem is with createDataFrame. was: Hi, I'm trying to do nested join of tables. After first join everything is ok, but second join made some of the column names lost. My code is following: def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, joinType = left_outer): print Joining tables %s %s % (namestr(tableLeft), namestr(tableRight)) sys.stdout.flush() tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: r.asDict()[columnRight])) tmpTable = tmpTable.select(tmpTable._1.alias(joinColumn), tmpTable._2.data.alias(columnNested)) return tableLeft.join(tmpTable, tableLeft[columnLeft] == tmpTable[joinColumn], joinType).drop(joinColumn) user = sqlContext.read.json(path + user.json) user.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) order = sqlContext.read.json(path + order.json) order.printSchema(); root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) lines = sqlContext.read.json(path + lines.json) lines.printSchema(); root |-- id: long (nullable = true) |-- orderid: long (nullable = true) |-- product: string (nullable = true) orders = joinTable(order, lines, id, orderid, lines) orders.printSchema() root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) |-- lines: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- orderid: long (nullable = true) |||-- product: string (nullable = true) clients = joinTable(user, orders, id, userid, orders) clients.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- orders: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- price: double (nullable = true) |||-- userid: long (nullable = true) |||-- lines: array (nullable = true) ||||-- element: struct (containsNull = true) ||
[jira] [Assigned] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer
[ https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9899: --- Assignee: Cheng Lian (was: Apache Spark) JSON/Parquet writing on retry or speculation broken with direct output committer Key: SPARK-9899 URL: https://issues.apache.org/jira/browse/SPARK-9899 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker If the first task fails all subsequent tasks will. We probably need to set a different boolean when calling create. {code} java.io.IOException: File already exists: ... ... at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452) at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128) at org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.init(JSONRelation.scala:185) at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer
[ https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696672#comment-14696672 ] Apache Spark commented on SPARK-9899: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/8191 JSON/Parquet writing on retry or speculation broken with direct output committer Key: SPARK-9899 URL: https://issues.apache.org/jira/browse/SPARK-9899 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker If the first task fails all subsequent tasks will. We probably need to set a different boolean when calling create. {code} java.io.IOException: File already exists: ... ... at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452) at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128) at org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.init(JSONRelation.scala:185) at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer
[ https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9899: --- Assignee: Apache Spark (was: Cheng Lian) JSON/Parquet writing on retry or speculation broken with direct output committer Key: SPARK-9899 URL: https://issues.apache.org/jira/browse/SPARK-9899 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Apache Spark Priority: Blocker If the first task fails all subsequent tasks will. We probably need to set a different boolean when calling create. {code} java.io.IOException: File already exists: ... ... at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452) at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128) at org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.init(JSONRelation.scala:185) at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9970) SQLContext.createDataFrame failed to properly determine column names
Maciej Bryński created SPARK-9970: - Summary: SQLContext.createDataFrame failed to properly determine column names Key: SPARK-9970 URL: https://issues.apache.org/jira/browse/SPARK-9970 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Reporter: Maciej Bryński Priority: Minor Hi, I'm trying to do nested join of tables. After first join everything is ok, but second join made some of the column names lost. My code is following: def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, joinType = left_outer): print Joining tables %s %s % (namestr(tableLeft), namestr(tableRight)) sys.stdout.flush() tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: r.asDict()[columnRight])) tmpTable = tmpTable.select(tmpTable._1.alias(joinColumn), tmpTable._2.data.alias(columnNested)) return tableLeft.join(tmpTable, tableLeft[columnLeft] == tmpTable[joinColumn], joinType).drop(joinColumn) user = sqlContext.read.json(path + user.json) user.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) order = sqlContext.read.json(path + order.json) order.printSchema(); root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) lines = sqlContext.read.json(path + lines.json) lines.printSchema(); root |-- id: long (nullable = true) |-- orderid: long (nullable = true) |-- product: string (nullable = true) orders = joinTable(order, lines, id, orderid, lines) orders.printSchema() root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) |-- lines: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- orderid: long (nullable = true) |||-- product: string (nullable = true) clients = joinTable(user, orders, id, userid, orders) clients.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- orders: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- price: double (nullable = true) |||-- userid: long (nullable = true) |||-- lines: array (nullable = true) ||||-- element: struct (containsNull = true) |||||-- _1: long (nullable = true) |||||-- _2: long (nullable = true) |||||-- _3: string (nullable = true) I tried to check if groupBy isn't the root of the problem but it's looks right grouped = orders.rdd.groupBy(lambda r: r.userid) print grouped.map(lambda x: list(x[1])).collect() [[Row(id=1, price=202.3, userid=1, lines=[Row(id=1, orderid=1, product=u'XXX'), Row(id=2, orderid=1, product=u'YYY')]), Row(id=2, price=343.99, userid=1, lines=[Row(id=3, orderid=2, product=u'XXX')])], [Row(id=3, price=399.99, userid=2, lines=[Row(id=4, orderid=3, product=u'XXX')])]] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9970) SQLContext.createDataFrame failed to properly determine column names
[ https://issues.apache.org/jira/browse/SPARK-9970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-9970: -- Description: Hi, I'm trying to do nested join of tables. After first join everything is ok, but second join made some of the column names lost. My code is following: {code:java} def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, joinType = left_outer): tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: r.asDict()[columnRight])) tmpTable = tmpTable.select(tmpTable._1.alias(joinColumn), tmpTable._2.data.alias(columnNested)) return tableLeft.join(tmpTable, tableLeft[columnLeft] == tmpTable[joinColumn], joinType).drop(joinColumn) user = sqlContext.read.json(path + user.json) user.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) order = sqlContext.read.json(path + order.json) order.printSchema(); root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) lines = sqlContext.read.json(path + lines.json) lines.printSchema(); root |-- id: long (nullable = true) |-- orderid: long (nullable = true) |-- product: string (nullable = true) {code} And joining code: Please look for the result of 2nd join. There are columns _1,_2,_3. Should be 'id', 'orderid', 'product' {code:java} orders = joinTable(order, lines, id, orderid, lines) orders.printSchema() root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) |-- lines: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- orderid: long (nullable = true) |||-- product: string (nullable = true) clients = joinTable(user, orders, id, userid, orders) clients.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- orders: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- price: double (nullable = true) |||-- userid: long (nullable = true) |||-- lines: array (nullable = true) ||||-- element: struct (containsNull = true) |||||-- _1: long (nullable = true) |||||-- _2: long (nullable = true) |||||-- _3: string (nullable = true) {code} I tried to check if groupBy isn't the cause of the problem but it looks right {code:java} grouped = orders.rdd.groupBy(lambda r: r.userid) print grouped.map(lambda x: list(x[1])).collect() [[Row(id=1, price=202.3, userid=1, lines=[Row(id=1, orderid=1, product=u'XXX'), Row(id=2, orderid=1, product=u'YYY')]), Row(id=2, price=343.99, userid=1, lines=[Row(id=3, orderid=2, product=u'XXX')])], [Row(id=3, price=399.99, userid=2, lines=[Row(id=4, orderid=3, product=u'XXX')])]] {code} So I assume that the problem is with createDataFrame. was: Hi, I'm trying to do nested join of tables. After first join everything is ok, but second join made some of the column names lost. My code is following: {code:java} def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, joinType = left_outer): tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: r.asDict()[columnRight])) tmpTable = tmpTable.select(tmpTable._1.alias(joinColumn), tmpTable._2.data.alias(columnNested)) return tableLeft.join(tmpTable, tableLeft[columnLeft] == tmpTable[joinColumn], joinType).drop(joinColumn) user = sqlContext.read.json(path + user.json) user.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) order = sqlContext.read.json(path + order.json) order.printSchema(); root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) lines = sqlContext.read.json(path + lines.json) lines.printSchema(); root |-- id: long (nullable = true) |-- orderid: long (nullable = true) |-- product: string (nullable = true) {code} And joining code: Please look for the result of 2nd join. There are columns _1,_2,_3. Should be 'id', 'orderid', 'product' {code:java} orders = joinTable(order, lines, id, orderid, lines) orders.printSchema() root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) |-- lines: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- orderid: long (nullable = true) |||-- product: string (nullable = true) clients = joinTable(user, orders, id, userid, orders) clients.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- orders: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- price: double (nullable =
[jira] [Commented] (SPARK-9871) Add expression functions into SparkR which have a variable parameter
[ https://issues.apache.org/jira/browse/SPARK-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696706#comment-14696706 ] Yu Ishikawa commented on SPARK-9871: I think it would be better to deal with {{struct}} on another issue. Since {{collect}} doesn't work with a DataFrame which has a column of Struct type. So we need to improve {{collect}} method. When I tried to implement {{struct}} function, a strict column converted by {{dfToCols}} consists of {{jobj}} {noformat} List of 1 $ structed:List of 2 ..$ :Class 'jobj' environment: 0x7fd46efe4e68 ..$ :Class 'jobj' environment: 0x7fd46efee078 {noformat} Add expression functions into SparkR which have a variable parameter Key: SPARK-9871 URL: https://issues.apache.org/jira/browse/SPARK-9871 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Add expression functions into SparkR which has a variable parameter, like {{concat}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN
Frank Rosner created SPARK-9971: --- Summary: MaxFunction not working correctly with columns containing Double.NaN Key: SPARK-9971 URL: https://issues.apache.org/jira/browse/SPARK-9971 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: Frank Rosner h5. Problem Description When using the {{max}} function on a {{DoubleType}} column that contains {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. This is because it compares all values with the running maximum. However, {{x Double.NaN}} will always lead false for all {{x: Double}}, so will {{x Double.NaN}}. h5. How to Reproduce {code} import org.apache.spark.sql.{SQLContext, Row} import org.apache.spark.sql.types._ val sql = new SQLContext(sc) val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) val dataFrame = sql.createDataFrame(rdd, StructType(List( StructField(col, DoubleType, false) ))) dataFrame.select(max(col)).first // returns org.apache.spark.sql.Row = [NaN] {code} h5. Solution The {{max}} and {{min}} functions should ignore NaN values, as they are not numbers. If a column contains only NaN values, then the maximum and minimum is not defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9972) Add `struct` function in SparkR
Yu Ishikawa created SPARK-9972: -- Summary: Add `struct` function in SparkR Key: SPARK-9972 URL: https://issues.apache.org/jira/browse/SPARK-9972 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Support {{struct}} function on a DataFrame in SparkR. However, I think we need to improve {{collect}} function in SparkR in order to implement {{struct}} function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9973) wrong buffle size
xukun created SPARK-9973: Summary: wrong buffle size Key: SPARK-9973 URL: https://issues.apache.org/jira/browse/SPARK-9973 Project: Spark Issue Type: Bug Components: SQL Reporter: xukun -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9973) wrong buffle size
[ https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696717#comment-14696717 ] Sean Owen commented on SPARK-9973: -- [~xukun] This would be more useful if you added a descriptive title and more complete description. What code is affected? what buffer? you explained this in the PR to some degree so a brief note about the problem to be solved is appropriate. Right now someone reading this doesn't know what it refers to. wrong buffle size - Key: SPARK-9973 URL: https://issues.apache.org/jira/browse/SPARK-9973 Project: Spark Issue Type: Bug Components: SQL Reporter: xukun allocate wrong buffer size (4+ size * columnType.defaultSize * columnType.defaultSize), right buffer size is 4+ size * columnType.defaultSize. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN
[ https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696716#comment-14696716 ] Sean Owen commented on SPARK-9971: -- My instinct is that this should in fact result in NaN; NaN is not in general ignored. For example {{math.max(1.0, Double.NaN)}} and {{math.min(1.0, Double.NaN)}} are both NaN. But are you ready for some weird? {code} scala Seq(1.0, Double.NaN).max res23: Double = NaN scala Seq(Double.NaN, 1.0).max res24: Double = 1.0 scala Seq(5.0, Double.NaN, 1.0).max res25: Double = 1.0 scala Seq(5.0, Double.NaN, 1.0, 6.0).max res26: Double = 6.0 scala Seq(5.0, Double.NaN, 1.0, 6.0, Double.NaN).max res27: Double = NaN {code} Not sure what to make of that, other than the Scala collection library isn't a good reference for behavior. Java? {code} scala java.util.Collections.max(java.util.Arrays.asList(new java.lang.Double(1.0), new java.lang.Double(Double.NaN))) res33: Double = NaN scala java.util.Collections.max(java.util.Arrays.asList(new java.lang.Double(Double.NaN), new java.lang.Double(1.0))) res34: Double = NaN {code} Makes more sense at least. I think this is correct behavior and you would filter NaN if you want to ignore them, as it is generally not something the language ignores. MaxFunction not working correctly with columns containing Double.NaN Key: SPARK-9971 URL: https://issues.apache.org/jira/browse/SPARK-9971 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: Frank Rosner Priority: Minor h4. Problem Description When using the {{max}} function on a {{DoubleType}} column that contains {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. This is because it compares all values with the running maximum. However, {{x Double.NaN}} will always lead false for all {{x: Double}}, so will {{x Double.NaN}}. h4. How to Reproduce {code} import org.apache.spark.sql.{SQLContext, Row} import org.apache.spark.sql.types._ val sql = new SQLContext(sc) val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) val dataFrame = sql.createDataFrame(rdd, StructType(List( StructField(col, DoubleType, false) ))) dataFrame.select(max(col)).first // returns org.apache.spark.sql.Row = [NaN] {code} h4. Solution The {{max}} and {{min}} functions should ignore NaN values, as they are not numbers. If a column contains only NaN values, then the maximum and minimum is not defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9871) Add expression functions into SparkR which have a variable parameter
[ https://issues.apache.org/jira/browse/SPARK-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696720#comment-14696720 ] Apache Spark commented on SPARK-9871: - User 'yu-iskw' has created a pull request for this issue: https://github.com/apache/spark/pull/8194 Add expression functions into SparkR which have a variable parameter Key: SPARK-9871 URL: https://issues.apache.org/jira/browse/SPARK-9871 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Add expression functions into SparkR which has a variable parameter, like {{concat}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9871) Add expression functions into SparkR which have a variable parameter
[ https://issues.apache.org/jira/browse/SPARK-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9871: --- Assignee: (was: Apache Spark) Add expression functions into SparkR which have a variable parameter Key: SPARK-9871 URL: https://issues.apache.org/jira/browse/SPARK-9871 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Add expression functions into SparkR which has a variable parameter, like {{concat}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9871) Add expression functions into SparkR which have a variable parameter
[ https://issues.apache.org/jira/browse/SPARK-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9871: --- Assignee: Apache Spark Add expression functions into SparkR which have a variable parameter Key: SPARK-9871 URL: https://issues.apache.org/jira/browse/SPARK-9871 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Assignee: Apache Spark Add expression functions into SparkR which has a variable parameter, like {{concat}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9969) Remove old Yarn MR classpath api support for Spark Yarn client
Saisai Shao created SPARK-9969: -- Summary: Remove old Yarn MR classpath api support for Spark Yarn client Key: SPARK-9969 URL: https://issues.apache.org/jira/browse/SPARK-9969 Project: Spark Issue Type: Bug Components: YARN Reporter: Saisai Shao Priority: Minor Since now we only support Yarn stable API, so here propose to remove old MRConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH api support in 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9969) Remove old Yarn MR classpath api support for Spark Yarn client
[ https://issues.apache.org/jira/browse/SPARK-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696681#comment-14696681 ] Apache Spark commented on SPARK-9969: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/8192 Remove old Yarn MR classpath api support for Spark Yarn client -- Key: SPARK-9969 URL: https://issues.apache.org/jira/browse/SPARK-9969 Project: Spark Issue Type: Bug Components: YARN Reporter: Saisai Shao Priority: Minor Since now we only support Yarn stable API, so here propose to remove old MRConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH api support in 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9969) Remove old Yarn MR classpath api support for Spark Yarn client
[ https://issues.apache.org/jira/browse/SPARK-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9969: --- Assignee: Apache Spark Remove old Yarn MR classpath api support for Spark Yarn client -- Key: SPARK-9969 URL: https://issues.apache.org/jira/browse/SPARK-9969 Project: Spark Issue Type: Bug Components: YARN Reporter: Saisai Shao Assignee: Apache Spark Priority: Minor Since now we only support Yarn stable API, so here propose to remove old MRConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH api support in 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9969) Remove old Yarn MR classpath api support for Spark Yarn client
[ https://issues.apache.org/jira/browse/SPARK-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9969: --- Assignee: (was: Apache Spark) Remove old Yarn MR classpath api support for Spark Yarn client -- Key: SPARK-9969 URL: https://issues.apache.org/jira/browse/SPARK-9969 Project: Spark Issue Type: Bug Components: YARN Reporter: Saisai Shao Priority: Minor Since now we only support Yarn stable API, so here propose to remove old MRConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH api support in 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9970) SQLContext.createDataFrame failed to properly determine column names
[ https://issues.apache.org/jira/browse/SPARK-9970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-9970: -- Description: Hi, I'm trying to do nested join of tables. After first join everything is ok, but second join made some of the column names lost. My code is following: {code:java} def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, joinType = left_outer): tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: r.asDict()[columnRight])) tmpTable = tmpTable.select(tmpTable._1.alias(joinColumn), tmpTable._2.data.alias(columnNested)) return tableLeft.join(tmpTable, tableLeft[columnLeft] == tmpTable[joinColumn], joinType).drop(joinColumn) user = sqlContext.read.json(path + user.json) user.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) order = sqlContext.read.json(path + order.json) order.printSchema(); root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) lines = sqlContext.read.json(path + lines.json) lines.printSchema(); root |-- id: long (nullable = true) |-- orderid: long (nullable = true) |-- product: string (nullable = true) {code} And joining code: Please look for the result of 2nd join. There are columns _1,_2,_3. Should be 'id', 'orderid', 'product' {code:java} orders = joinTable(order, lines, id, orderid, lines) orders.printSchema() root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) |-- lines: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- orderid: long (nullable = true) |||-- product: string (nullable = true) clients = joinTable(user, orders, id, userid, orders) clients.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- orders: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- price: double (nullable = true) |||-- userid: long (nullable = true) |||-- lines: array (nullable = true) ||||-- element: struct (containsNull = true) |||||-- _1: long (nullable = true) |||||-- _2: long (nullable = true) |||||-- _3: string (nullable = true) {code} I tried to check if groupBy isn't the root of the problem but it's looks right {code:java} grouped = orders.rdd.groupBy(lambda r: r.userid) print grouped.map(lambda x: list(x[1])).collect() [[Row(id=1, price=202.3, userid=1, lines=[Row(id=1, orderid=1, product=u'XXX'), Row(id=2, orderid=1, product=u'YYY')]), Row(id=2, price=343.99, userid=1, lines=[Row(id=3, orderid=2, product=u'XXX')])], [Row(id=3, price=399.99, userid=2, lines=[Row(id=4, orderid=3, product=u'XXX')])]] {code} So I assume that the problem is with createDataFrame. was: Hi, I'm trying to do nested join of tables. After first join everything is ok, but second join made some of the column names lost. My code is following: {code:java} def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, joinType = left_outer): tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: r.asDict()[columnRight])) tmpTable = tmpTable.select(tmpTable._1.alias(joinColumn), tmpTable._2.data.alias(columnNested)) return tableLeft.join(tmpTable, tableLeft[columnLeft] == tmpTable[joinColumn], joinType).drop(joinColumn) user = sqlContext.read.json(path + user.json) user.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) order = sqlContext.read.json(path + order.json) order.printSchema(); root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) lines = sqlContext.read.json(path + lines.json) lines.printSchema(); root |-- id: long (nullable = true) |-- orderid: long (nullable = true) |-- product: string (nullable = true) {code} And joining code: {code:java} orders = joinTable(order, lines, id, orderid, lines) orders.printSchema() root |-- id: long (nullable = true) |-- price: double (nullable = true) |-- userid: long (nullable = true) |-- lines: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- orderid: long (nullable = true) |||-- product: string (nullable = true) clients = joinTable(user, orders, id, userid, orders) clients.printSchema() root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- orders: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: long (nullable = true) |||-- price: double (nullable = true) |||-- userid: long (nullable = true) |||-- lines: array (nullable = true) ||
[jira] [Commented] (SPARK-6624) Convert filters into CNF for data sources
[ https://issues.apache.org/jira/browse/SPARK-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696700#comment-14696700 ] Apache Spark commented on SPARK-6624: - User 'yjshen' has created a pull request for this issue: https://github.com/apache/spark/pull/8193 Convert filters into CNF for data sources - Key: SPARK-6624 URL: https://issues.apache.org/jira/browse/SPARK-6624 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin We should turn filters into conjunctive normal form (CNF) before we pass them to data sources. Otherwise, filters are not very useful if there is a single filter with a bunch of ORs. Note that we already try to do some of these in BooleanSimplification, but I think we should just formalize it to use CNF. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6624) Convert filters into CNF for data sources
[ https://issues.apache.org/jira/browse/SPARK-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6624: --- Assignee: (was: Apache Spark) Convert filters into CNF for data sources - Key: SPARK-6624 URL: https://issues.apache.org/jira/browse/SPARK-6624 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin We should turn filters into conjunctive normal form (CNF) before we pass them to data sources. Otherwise, filters are not very useful if there is a single filter with a bunch of ORs. Note that we already try to do some of these in BooleanSimplification, but I think we should just formalize it to use CNF. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6624) Convert filters into CNF for data sources
[ https://issues.apache.org/jira/browse/SPARK-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6624: --- Assignee: Apache Spark Convert filters into CNF for data sources - Key: SPARK-6624 URL: https://issues.apache.org/jira/browse/SPARK-6624 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark We should turn filters into conjunctive normal form (CNF) before we pass them to data sources. Otherwise, filters are not very useful if there is a single filter with a bunch of ORs. Note that we already try to do some of these in BooleanSimplification, but I think we should just formalize it to use CNF. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN
[ https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Rosner updated SPARK-9971: Priority: Minor (was: Major) MaxFunction not working correctly with columns containing Double.NaN Key: SPARK-9971 URL: https://issues.apache.org/jira/browse/SPARK-9971 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: Frank Rosner Priority: Minor h5. Problem Description When using the {{max}} function on a {{DoubleType}} column that contains {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. This is because it compares all values with the running maximum. However, {{x Double.NaN}} will always lead false for all {{x: Double}}, so will {{x Double.NaN}}. h5. How to Reproduce {code} import org.apache.spark.sql.{SQLContext, Row} import org.apache.spark.sql.types._ val sql = new SQLContext(sc) val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) val dataFrame = sql.createDataFrame(rdd, StructType(List( StructField(col, DoubleType, false) ))) dataFrame.select(max(col)).first // returns org.apache.spark.sql.Row = [NaN] {code} h5. Solution The {{max}} and {{min}} functions should ignore NaN values, as they are not numbers. If a column contains only NaN values, then the maximum and minimum is not defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN
[ https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Rosner updated SPARK-9971: Description: h4. Problem Description When using the {{max}} function on a {{DoubleType}} column that contains {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. This is because it compares all values with the running maximum. However, {{x Double.NaN}} will always lead false for all {{x: Double}}, so will {{x Double.NaN}}. h4. How to Reproduce {code} import org.apache.spark.sql.{SQLContext, Row} import org.apache.spark.sql.types._ val sql = new SQLContext(sc) val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) val dataFrame = sql.createDataFrame(rdd, StructType(List( StructField(col, DoubleType, false) ))) dataFrame.select(max(col)).first // returns org.apache.spark.sql.Row = [NaN] {code} h4. Solution The {{max}} and {{min}} functions should ignore NaN values, as they are not numbers. If a column contains only NaN values, then the maximum and minimum is not defined. was: h5. Problem Description When using the {{max}} function on a {{DoubleType}} column that contains {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. This is because it compares all values with the running maximum. However, {{x Double.NaN}} will always lead false for all {{x: Double}}, so will {{x Double.NaN}}. h5. How to Reproduce {code} import org.apache.spark.sql.{SQLContext, Row} import org.apache.spark.sql.types._ val sql = new SQLContext(sc) val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) val dataFrame = sql.createDataFrame(rdd, StructType(List( StructField(col, DoubleType, false) ))) dataFrame.select(max(col)).first // returns org.apache.spark.sql.Row = [NaN] {code} h5. Solution The {{max}} and {{min}} functions should ignore NaN values, as they are not numbers. If a column contains only NaN values, then the maximum and minimum is not defined. MaxFunction not working correctly with columns containing Double.NaN Key: SPARK-9971 URL: https://issues.apache.org/jira/browse/SPARK-9971 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: Frank Rosner Priority: Minor h4. Problem Description When using the {{max}} function on a {{DoubleType}} column that contains {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. This is because it compares all values with the running maximum. However, {{x Double.NaN}} will always lead false for all {{x: Double}}, so will {{x Double.NaN}}. h4. How to Reproduce {code} import org.apache.spark.sql.{SQLContext, Row} import org.apache.spark.sql.types._ val sql = new SQLContext(sc) val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) val dataFrame = sql.createDataFrame(rdd, StructType(List( StructField(col, DoubleType, false) ))) dataFrame.select(max(col)).first // returns org.apache.spark.sql.Row = [NaN] {code} h4. Solution The {{max}} and {{min}} functions should ignore NaN values, as they are not numbers. If a column contains only NaN values, then the maximum and minimum is not defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9973) wrong buffle size
[ https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xukun updated SPARK-9973: - Description: allocate wrong buffer size (4+ size * columnType.defaultSize * columnType.defaultSize), right buffer size is 4+ size * columnType.defaultSize. wrong buffle size - Key: SPARK-9973 URL: https://issues.apache.org/jira/browse/SPARK-9973 Project: Spark Issue Type: Bug Components: SQL Reporter: xukun allocate wrong buffer size (4+ size * columnType.defaultSize * columnType.defaultSize), right buffer size is 4+ size * columnType.defaultSize. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9973) wrong buffle size
[ https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696714#comment-14696714 ] Apache Spark commented on SPARK-9973: - User 'viper-kun' has created a pull request for this issue: https://github.com/apache/spark/pull/8189 wrong buffle size - Key: SPARK-9973 URL: https://issues.apache.org/jira/browse/SPARK-9973 Project: Spark Issue Type: Bug Components: SQL Reporter: xukun allocate wrong buffer size (4+ size * columnType.defaultSize * columnType.defaultSize), right buffer size is 4+ size * columnType.defaultSize. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9973) wrong buffle size
[ https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9973: --- Assignee: (was: Apache Spark) wrong buffle size - Key: SPARK-9973 URL: https://issues.apache.org/jira/browse/SPARK-9973 Project: Spark Issue Type: Bug Components: SQL Reporter: xukun allocate wrong buffer size (4+ size * columnType.defaultSize * columnType.defaultSize), right buffer size is 4+ size * columnType.defaultSize. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9973) wrong buffle size
[ https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9973: --- Assignee: Apache Spark wrong buffle size - Key: SPARK-9973 URL: https://issues.apache.org/jira/browse/SPARK-9973 Project: Spark Issue Type: Bug Components: SQL Reporter: xukun Assignee: Apache Spark allocate wrong buffer size (4+ size * columnType.defaultSize * columnType.defaultSize), right buffer size is 4+ size * columnType.defaultSize. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN
[ https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696760#comment-14696760 ] Frank Rosner commented on SPARK-9971: - I would like to provide a patch to make the following unit tests in {{DataFrameFunctionsSuite}} succeed: {code} test(max function ignoring Double.NaN) { val df = Seq( (Double.NaN, Double.NaN), (-10d, Double.NaN), (10d, Double.NaN) ).toDF(col1, col2) checkAnswer( df.select(max(col1)), Seq(Row(10d)) ) checkAnswer( df.select(max(col2)), Seq(Row(Double.NaN)) ) } test(min function ignoring Double.NaN) { val df = Seq( (Double.NaN, Double.NaN), (-10d, Double.NaN), (10d, Double.NaN) ).toDF(col1, col2) checkAnswer( df.select(min(col1)), Seq(Row(-10d)) ) checkAnswer( df.select(min(col1)), Seq(Row(Double.NaN)) ) } {code} MaxFunction not working correctly with columns containing Double.NaN Key: SPARK-9971 URL: https://issues.apache.org/jira/browse/SPARK-9971 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: Frank Rosner Priority: Minor h4. Problem Description When using the {{max}} function on a {{DoubleType}} column that contains {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. This is because it compares all values with the running maximum. However, {{x Double.NaN}} will always lead false for all {{x: Double}}, so will {{x Double.NaN}}. h4. How to Reproduce {code} import org.apache.spark.sql.{SQLContext, Row} import org.apache.spark.sql.types._ val sql = new SQLContext(sc) val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) val dataFrame = sql.createDataFrame(rdd, StructType(List( StructField(col, DoubleType, false) ))) dataFrame.select(max(col)).first // returns org.apache.spark.sql.Row = [NaN] {code} h4. Solution The {{max}} and {{min}} functions should ignore NaN values, as they are not numbers. If a column contains only NaN values, then the maximum and minimum is not defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data
[ https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697098#comment-14697098 ] Cody Koeninger commented on SPARK-9947: --- You already have access to offsets and can save them however you want. You can provide those offsets on restart, regardless of whether checkpointing was enabled. Separate Metadata and State Checkpoint Data --- Key: SPARK-9947 URL: https://issues.apache.org/jira/browse/SPARK-9947 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.4.1 Reporter: Dan Dutrow Original Estimate: 168h Remaining Estimate: 168h Problem: When updating an application that has checkpointing enabled to support the updateStateByKey and 24/7 operation functionality, you encounter the problem where you might like to maintain state data between restarts but delete the metadata containing execution state. If checkpoint data exists between code redeployment, the program may not execute properly or at all. My current workaround for this issue is to wrap updateStateByKey with my own function that persists the state after every update to my own separate directory. (That allows me to delete the checkpoint with its metadata before redeploying) Then, when I restart the application, I initialize the state with this persisted data. This incurs additional overhead due to persisting of the same data twice: once in the checkpoint and once in my persisted data folder. If Kafka Direct API offsets could be stored in another separate checkpoint directory, that would help address the problem of having to blow that away between code redeployment as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9979) Unable to request addition to powered by spark
Jeff Palmucci created SPARK-9979: Summary: Unable to request addition to powered by spark Key: SPARK-9979 URL: https://issues.apache.org/jira/browse/SPARK-9979 Project: Spark Issue Type: Bug Components: Documentation Reporter: Jeff Palmucci Priority: Minor The powered by spark page asks to submit new listing requests to u...@apache.spark.org. However, when I do that, I get: {quote} Hi. This is the qmail-send program at apache.org. I'm afraid I wasn't able to deliver your message to the following addresses. This is a permanent error; I've given up. Sorry it didn't work out. u...@spark.apache.org: Must be sent from an @apache.org address or a subscriber address or an address in LDAP. {quote} The project I wanted to list is here: http://engineering.tripadvisor.com/using-apache-spark-for-massively-parallel-nlp/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data
[ https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697157#comment-14697157 ] Dan Dutrow commented on SPARK-9947: --- The desire is to continue using checkpointing for everything but allow selective deleting of certain types of checkpoint data. Using an external database to duplicate that functionality is not desired. Separate Metadata and State Checkpoint Data --- Key: SPARK-9947 URL: https://issues.apache.org/jira/browse/SPARK-9947 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.4.1 Reporter: Dan Dutrow Original Estimate: 168h Remaining Estimate: 168h Problem: When updating an application that has checkpointing enabled to support the updateStateByKey and 24/7 operation functionality, you encounter the problem where you might like to maintain state data between restarts but delete the metadata containing execution state. If checkpoint data exists between code redeployment, the program may not execute properly or at all. My current workaround for this issue is to wrap updateStateByKey with my own function that persists the state after every update to my own separate directory. (That allows me to delete the checkpoint with its metadata before redeploying) Then, when I restart the application, I initialize the state with this persisted data. This incurs additional overhead due to persisting of the same data twice: once in the checkpoint and once in my persisted data folder. If Kafka Direct API offsets could be stored in another separate checkpoint directory, that would help address the problem of having to blow that away between code redeployment as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data
[ https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697166#comment-14697166 ] Cody Koeninger commented on SPARK-9947: --- You can't re-use checkpoint data across application upgrades anyway, so that honestly seems kind of pointless. Separate Metadata and State Checkpoint Data --- Key: SPARK-9947 URL: https://issues.apache.org/jira/browse/SPARK-9947 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.4.1 Reporter: Dan Dutrow Original Estimate: 168h Remaining Estimate: 168h Problem: When updating an application that has checkpointing enabled to support the updateStateByKey and 24/7 operation functionality, you encounter the problem where you might like to maintain state data between restarts but delete the metadata containing execution state. If checkpoint data exists between code redeployment, the program may not execute properly or at all. My current workaround for this issue is to wrap updateStateByKey with my own function that persists the state after every update to my own separate directory. (That allows me to delete the checkpoint with its metadata before redeploying) Then, when I restart the application, I initialize the state with this persisted data. This incurs additional overhead due to persisting of the same data twice: once in the checkpoint and once in my persisted data folder. If Kafka Direct API offsets could be stored in another separate checkpoint directory, that would help address the problem of having to blow that away between code redeployment as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9977) The usage of a label generated by StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-9977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9977: --- Assignee: Apache Spark The usage of a label generated by StringIndexer --- Key: SPARK-9977 URL: https://issues.apache.org/jira/browse/SPARK-9977 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.1 Reporter: Kai Sasaki Assignee: Apache Spark Priority: Trivial Labels: documentaion By using {{StringIndexer}}, we can obtain indexed label on new column. So a following estimator should use this new column through pipeline if it wants to use string indexed label. I think it is better to make it explicit on documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9977) The usage of a label generated by StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-9977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9977: --- Assignee: (was: Apache Spark) The usage of a label generated by StringIndexer --- Key: SPARK-9977 URL: https://issues.apache.org/jira/browse/SPARK-9977 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.1 Reporter: Kai Sasaki Priority: Trivial Labels: documentaion By using {{StringIndexer}}, we can obtain indexed label on new column. So a following estimator should use this new column through pipeline if it wants to use string indexed label. I think it is better to make it explicit on documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9977) The usage of a label generated by StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-9977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697104#comment-14697104 ] Apache Spark commented on SPARK-9977: - User 'Lewuathe' has created a pull request for this issue: https://github.com/apache/spark/pull/8205 The usage of a label generated by StringIndexer --- Key: SPARK-9977 URL: https://issues.apache.org/jira/browse/SPARK-9977 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.1 Reporter: Kai Sasaki Priority: Trivial Labels: documentaion By using {{StringIndexer}}, we can obtain indexed label on new column. So a following estimator should use this new column through pipeline if it wants to use string indexed label. I think it is better to make it explicit on documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data
[ https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697180#comment-14697180 ] Dan Dutrow commented on SPARK-9947: --- I want to maintain the data in updateStateByKey between upgrades. This data can be recovered between upgrades so long as objects don't change. Separate Metadata and State Checkpoint Data --- Key: SPARK-9947 URL: https://issues.apache.org/jira/browse/SPARK-9947 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.4.1 Reporter: Dan Dutrow Original Estimate: 168h Remaining Estimate: 168h Problem: When updating an application that has checkpointing enabled to support the updateStateByKey and 24/7 operation functionality, you encounter the problem where you might like to maintain state data between restarts but delete the metadata containing execution state. If checkpoint data exists between code redeployment, the program may not execute properly or at all. My current workaround for this issue is to wrap updateStateByKey with my own function that persists the state after every update to my own separate directory. (That allows me to delete the checkpoint with its metadata before redeploying) Then, when I restart the application, I initialize the state with this persisted data. This incurs additional overhead due to persisting of the same data twice: once in the checkpoint and once in my persisted data folder. If Kafka Direct API offsets could be stored in another separate checkpoint directory, that would help address the problem of having to blow that away between code redeployment as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data
[ https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697193#comment-14697193 ] Cody Koeninger commented on SPARK-9947: --- Didn't you already say that you were saving updateStateByKey state yourself? Separate Metadata and State Checkpoint Data --- Key: SPARK-9947 URL: https://issues.apache.org/jira/browse/SPARK-9947 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.4.1 Reporter: Dan Dutrow Original Estimate: 168h Remaining Estimate: 168h Problem: When updating an application that has checkpointing enabled to support the updateStateByKey and 24/7 operation functionality, you encounter the problem where you might like to maintain state data between restarts but delete the metadata containing execution state. If checkpoint data exists between code redeployment, the program may not execute properly or at all. My current workaround for this issue is to wrap updateStateByKey with my own function that persists the state after every update to my own separate directory. (That allows me to delete the checkpoint with its metadata before redeploying) Then, when I restart the application, I initialize the state with this persisted data. This incurs additional overhead due to persisting of the same data twice: once in the checkpoint and once in my persisted data folder. If Kafka Direct API offsets could be stored in another separate checkpoint directory, that would help address the problem of having to blow that away between code redeployment as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8118) Turn off noisy log output produced by Parquet 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697203#comment-14697203 ] Cheng Lian commented on SPARK-8118: --- Unfortunately no. Turn off noisy log output produced by Parquet 1.7.0 --- Key: SPARK-8118 URL: https://issues.apache.org/jira/browse/SPARK-8118 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.4.1, 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor Fix For: 1.5.0 Parquet 1.7.0 renames package name to org.apache.parquet, need to adjust {{ParquetRelation.enableLogForwarding}} accordingly to avoid noisy log output. A better approach than simply muting these log lines is to redirect Parquet logs via SLF4J, so that we can handle them consistently. In general these logs are very useful. Esp. when used to diagnosing Parquet memory issue and filter push-down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6249) Get Kafka offsets from consumer group in ZK when using direct stream
[ https://issues.apache.org/jira/browse/SPARK-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697188#comment-14697188 ] Cody Koeninger commented on SPARK-6249: --- If you want an api that has imprecise semantics and stores stuff in ZK, use the receiver based stream. This ticket has been closed for a while, I'd suggest further discussion would be better suited for the mailing list. Get Kafka offsets from consumer group in ZK when using direct stream Key: SPARK-6249 URL: https://issues.apache.org/jira/browse/SPARK-6249 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das This is the proposal. The simpler direct API (the one that does not take explicit offsets) can be modified to also pick up the initial offset from ZK if group.id is specified. This is exactly similar to how we find the latest or earliest offset in that API, just that instead of latest/earliest offset of the topic we want to find the offset from the consumer group. The group offsets is ZK is not used at all for any further processing and restarting, so the exactly-once semantics is not broken. The use case where this is useful is simplified code upgrade. If the user wants to upgrade the code, he/she can the context stop gracefully which will ensure the ZK consumer group offset will be updated with the last offsets processed. Then the new code is started (not restarted from checkpoint) can pickup the consumer group offset from ZK and continue where the previous code had left off. Without the functionality of picking up consumer group offsets to start (that is, currently) the only way to do this is for the users to save the offsets somewhere (file, database, etc.) and manage the offsets themselves. I just want to simplify this process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data
[ https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697220#comment-14697220 ] Dan Dutrow commented on SPARK-9947: --- Ok, right, we're saving the same state that's outputted from the updateStateByKey to HDFS. The thought is that maybe updateStateByKey is saving the exact same data in the checkpoint as I have to do in my own function. Allowing separation of the different types of data stored in the checkpoint data might allow me to not have to save the same state data again. Separate Metadata and State Checkpoint Data --- Key: SPARK-9947 URL: https://issues.apache.org/jira/browse/SPARK-9947 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.4.1 Reporter: Dan Dutrow Original Estimate: 168h Remaining Estimate: 168h Problem: When updating an application that has checkpointing enabled to support the updateStateByKey and 24/7 operation functionality, you encounter the problem where you might like to maintain state data between restarts but delete the metadata containing execution state. If checkpoint data exists between code redeployment, the program may not execute properly or at all. My current workaround for this issue is to wrap updateStateByKey with my own function that persists the state after every update to my own separate directory. (That allows me to delete the checkpoint with its metadata before redeploying) Then, when I restart the application, I initialize the state with this persisted data. This incurs additional overhead due to persisting of the same data twice: once in the checkpoint and once in my persisted data folder. If Kafka Direct API offsets could be stored in another separate checkpoint directory, that would help address the problem of having to blow that away between code redeployment as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9972) Add `struct` function in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Ishikawa updated SPARK-9972: --- Target Version/s: 1.6.0 (was: 1.5.0) Add `struct` function in SparkR --- Key: SPARK-9972 URL: https://issues.apache.org/jira/browse/SPARK-9972 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Support {{struct}} function on a DataFrame in SparkR. However, I think we need to improve {{collect}} function in SparkR in order to implement {{struct}} function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9972) Add `struct` function in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697232#comment-14697232 ] Shivaram Venkataraman commented on SPARK-9972: -- Yeah this can be marked as being blocked by https://issues.apache.org/jira/browse/SPARK-6819 Add `struct` function in SparkR --- Key: SPARK-9972 URL: https://issues.apache.org/jira/browse/SPARK-9972 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Support {{struct}} function on a DataFrame in SparkR. However, I think we need to improve {{collect}} function in SparkR in order to implement {{struct}} function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9946) NPE in TaskMemoryManager
[ https://issues.apache.org/jira/browse/SPARK-9946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-9946. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8177 [https://github.com/apache/spark/pull/8177] NPE in TaskMemoryManager Key: SPARK-9946 URL: https://issues.apache.org/jira/browse/SPARK-9946 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker Fix For: 1.5.0 {code} Failed to execute query using catalyst: [info] Error: Job aborted due to stage failure: Task 6 in stage 6801.0 failed 1 times, most recent failure: Lost task 6.0 in stage 6801.0 (TID 41123, localhost): java.lang.NullPointerException [info]at org.apache.spark.unsafe.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:226) [info]at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortedIterator.loadNext(UnsafeInMemorySorter.java:165) [info]at org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:142) [info]at org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:129) [info]at org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:84) [info]at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoin.scala:302) [info]at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoin.scala:218) [info]at org.apache.spark.sql.execution.joins.SortMergeJoin$$anonfun$doExecute$1$$anon$1.advanceNext(SortMergeJoin.scala:110) [info]at org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) [info]at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) [info]at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) [info]at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:99) [info]at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) [info]at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) [info]at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) [info]at org.apache.spark.scheduler.Task.run(Task.scala:88) [info]at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) 13:27:17.435 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 23.0 in stage 6801.0 (TID 41140, localhost): TaskKilled (killed intentionally) [info]at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [info]at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 13:27:17.436 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 12.0 in stage 6801.0 (TID 41129, localhost): TaskKilled (killed intentionally) [info]at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9828) Should not share `{}` among instances
[ https://issues.apache.org/jira/browse/SPARK-9828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-9828. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8110 [https://github.com/apache/spark/pull/8110] Should not share `{}` among instances - Key: SPARK-9828 URL: https://issues.apache.org/jira/browse/SPARK-9828 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 1.4.1, 1.5.0 Reporter: Xiangrui Meng Assignee: Manoj Kumar Priority: Critical Fix For: 1.5.0 We use `{}` as the initial value in some places, e.g., https://github.com/apache/spark/blob/master/python/pyspark/ml/param/__init__.py#L64. This makes instances sharing the same param map. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9984) Create local physical operator interface
Reynold Xin created SPARK-9984: -- Summary: Create local physical operator interface Key: SPARK-9984 URL: https://issues.apache.org/jira/browse/SPARK-9984 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin The local operator interface should be similar to traditional database iterators, with open/close. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9985) DataFrameWriter jdbc method ignore options that have been set
[ https://issues.apache.org/jira/browse/SPARK-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-9985: Assignee: Shixiong Zhu DataFrameWriter jdbc method ignore options that have been set - Key: SPARK-9985 URL: https://issues.apache.org/jira/browse/SPARK-9985 Project: Spark Issue Type: Bug Reporter: Richard Garris Assignee: Shixiong Zhu I am working on an RDBMS to DataFrame conversion using Postgres and am hitting a wall where everytime I try to use the Postgresql JDBC driver to get a java.sql.SQLException: No suitable driver found error Here is the stack trace: {code} at java.sql.DriverManager.getConnection(DriverManager.java:596) at java.sql.DriverManager.getConnection(DriverManager.java:187) at org.apache.spark.sql.jdbc.package$JDBCWriteDetails$.savePartition(jdbc.scala:67) at org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$saveTable$1.apply(jdbc.scala:189) at org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$saveTable$1.apply(jdbc.scala:188) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:878) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:878) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} It appears that DataFrameWriter and DataFrameReader ignores options that we set before invoking {{jdbc}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9828) Should not share `{}` among instances
[ https://issues.apache.org/jira/browse/SPARK-9828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-9828. -- Resolution: Fixed Fix Version/s: 1.4.2 Should not share `{}` among instances - Key: SPARK-9828 URL: https://issues.apache.org/jira/browse/SPARK-9828 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 1.4.1, 1.5.0 Reporter: Xiangrui Meng Assignee: Manoj Kumar Priority: Critical Fix For: 1.4.2, 1.5.0 We use `{}` as the initial value in some places, e.g., https://github.com/apache/spark/blob/master/python/pyspark/ml/param/__init__.py#L64. This makes instances sharing the same param map. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9986) Create a simple test framework for local operators
Reynold Xin created SPARK-9986: -- Summary: Create a simple test framework for local operators Key: SPARK-9986 URL: https://issues.apache.org/jira/browse/SPARK-9986 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin It'd be great if we can just create local query plans and test the correctness of their implementation directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9994) Create local TopK operator
Reynold Xin created SPARK-9994: -- Summary: Create local TopK operator Key: SPARK-9994 URL: https://issues.apache.org/jira/browse/SPARK-9994 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Similar to the existing TakeOrderedAndProject, except in a single thread. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9993) Create local union operator
Reynold Xin created SPARK-9993: -- Summary: Create local union operator Key: SPARK-9993 URL: https://issues.apache.org/jira/browse/SPARK-9993 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9589) Flaky test: HiveCompatibilitySuite.groupby8
[ https://issues.apache.org/jira/browse/SPARK-9589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9589: --- Assignee: Apache Spark (was: Josh Rosen) Flaky test: HiveCompatibilitySuite.groupby8 --- Key: SPARK-9589 URL: https://issues.apache.org/jira/browse/SPARK-9589 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Apache Spark Priority: Blocker https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39662/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/groupby8/ {code} sbt.ForkMain$ForkError: Failed to execute query using catalyst: Error: Job aborted due to stage failure: Task 24 in stage 3081.0 failed 1 times, most recent failure: Lost task 24.0 in stage 3081.0 (TID 14919, localhost): java.lang.NullPointerException at org.apache.spark.unsafe.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:226) at org.apache.spark.unsafe.map.BytesToBytesMap$Location.updateAddressesAndSizes(BytesToBytesMap.java:366) at org.apache.spark.unsafe.map.BytesToBytesMap$Location.putNewKey(BytesToBytesMap.java:600) at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.getAggregationBuffer(UnsafeFixedWidthAggregationMap.java:134) at org.apache.spark.sql.execution.aggregate.UnsafeHybridAggregationIterator.initialize(UnsafeHybridAggregationIterator.scala:276) at org.apache.spark.sql.execution.aggregate.UnsafeHybridAggregationIterator.init(UnsafeHybridAggregationIterator.scala:290) at org.apache.spark.sql.execution.aggregate.UnsafeHybridAggregationIterator$.createFromInputIterator(UnsafeHybridAggregationIterator.scala:358) at org.apache.spark.sql.execution.aggregate.Aggregate$$anonfun$doExecute$1$$anonfun$5.apply(Aggregate.scala:130) at org.apache.spark.sql.execution.aggregate.Aggregate$$anonfun$doExecute$1$$anonfun$5.apply(Aggregate.scala:121) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9561) Enable BroadcastJoinSuite
[ https://issues.apache.org/jira/browse/SPARK-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9561: --- Assignee: Andrew Or (was: Apache Spark) Enable BroadcastJoinSuite - Key: SPARK-9561 URL: https://issues.apache.org/jira/browse/SPARK-9561 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.5.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker This is introduced in SPARK-8735, but due to complexities with TestSQLContext this needs to be commented out completely. We need to find a way to make it work before releasing 1.5. For more detail, see the discussion: https://github.com/apache/spark/pull/7770 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9561) Enable BroadcastJoinSuite
[ https://issues.apache.org/jira/browse/SPARK-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697425#comment-14697425 ] Apache Spark commented on SPARK-9561: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/8208 Enable BroadcastJoinSuite - Key: SPARK-9561 URL: https://issues.apache.org/jira/browse/SPARK-9561 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.5.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker This is introduced in SPARK-8735, but due to complexities with TestSQLContext this needs to be commented out completely. We need to find a way to make it work before releasing 1.5. For more detail, see the discussion: https://github.com/apache/spark/pull/7770 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org