[jira] [Created] (SPARK-18484) case class datasets - ability to specify decimal precision and scale
Damian Momot created SPARK-18484: Summary: case class datasets - ability to specify decimal precision and scale Key: SPARK-18484 URL: https://issues.apache.org/jira/browse/SPARK-18484 Project: Spark Issue Type: Improvement Affects Versions: 2.0.1, 2.0.0 Reporter: Damian Momot Currently when using decimal type (BigDecimal in scala case class) there's no way to enforce precision and scale. This is quite critical when saving data - regarding space usage and compatibility with external systems (for example Hive table) because spark saves data as Decimal(38,18) {code:scala} val spark: SparkSession = ??? case class TestClass(id: String, money: BigDecimal) val testDs = spark.createDataset(Seq( TestClass("1", BigDecimal("22.50")), TestClass("2", BigDecimal("500.66")) )) testDs.printSchema() {code} {code} root |-- id: string (nullable = true) |-- money: decimal(38,18) (nullable = true) {code} Workaround is to convert dataset to dataframe before saving and manually cast to specific decimal scale/precision: {code:scala} import org.apache.spark.sql.types.DecimalType val testDf = testDs.toDF() testDf .withColumn("money", testDf("money").cast(DecimalType(10,2))) .printSchema() {code} {code} root |-- id: string (nullable = true) |-- money: decimal(10,2) (nullable = true) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18484) case class datasets - ability to specify decimal precision and scale
[ https://issues.apache.org/jira/browse/SPARK-18484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Damian Momot updated SPARK-18484: - Description: Currently when using decimal type (BigDecimal in scala case class) there's no way to enforce precision and scale. This is quite critical when saving data - regarding space usage and compatibility with external systems (for example Hive table) because spark saves data as Decimal(38,18) {code} case class TestClass(id: String, money: BigDecimal) val testDs = spark.createDataset(Seq( TestClass("1", BigDecimal("22.50")), TestClass("2", BigDecimal("500.66")) )) testDs.printSchema() {code} {code} root |-- id: string (nullable = true) |-- money: decimal(38,18) (nullable = true) {code} Workaround is to convert dataset to dataframe before saving and manually cast to specific decimal scale/precision: {code} import org.apache.spark.sql.types.DecimalType val testDf = testDs.toDF() testDf .withColumn("money", testDf("money").cast(DecimalType(10,2))) .printSchema() {code} {code} root |-- id: string (nullable = true) |-- money: decimal(10,2) (nullable = true) {code} was: Currently when using decimal type (BigDecimal in scala case class) there's no way to enforce precision and scale. This is quite critical when saving data - regarding space usage and compatibility with external systems (for example Hive table) because spark saves data as Decimal(38,18) {code:scala} val spark: SparkSession = ??? case class TestClass(id: String, money: BigDecimal) val testDs = spark.createDataset(Seq( TestClass("1", BigDecimal("22.50")), TestClass("2", BigDecimal("500.66")) )) testDs.printSchema() {code} {code} root |-- id: string (nullable = true) |-- money: decimal(38,18) (nullable = true) {code} Workaround is to convert dataset to dataframe before saving and manually cast to specific decimal scale/precision: {code:scala} import org.apache.spark.sql.types.DecimalType val testDf = testDs.toDF() testDf .withColumn("money", testDf("money").cast(DecimalType(10,2))) .printSchema() {code} {code} root |-- id: string (nullable = true) |-- money: decimal(10,2) (nullable = true) {code} > case class datasets - ability to specify decimal precision and scale > > > Key: SPARK-18484 > URL: https://issues.apache.org/jira/browse/SPARK-18484 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0, 2.0.1 >Reporter: Damian Momot > > Currently when using decimal type (BigDecimal in scala case class) there's no > way to enforce precision and scale. This is quite critical when saving data - > regarding space usage and compatibility with external systems (for example > Hive table) because spark saves data as Decimal(38,18) > {code} > case class TestClass(id: String, money: BigDecimal) > val testDs = spark.createDataset(Seq( > TestClass("1", BigDecimal("22.50")), > TestClass("2", BigDecimal("500.66")) > )) > testDs.printSchema() > {code} > {code} > root > |-- id: string (nullable = true) > |-- money: decimal(38,18) (nullable = true) > {code} > Workaround is to convert dataset to dataframe before saving and manually cast > to specific decimal scale/precision: > {code} > import org.apache.spark.sql.types.DecimalType > val testDf = testDs.toDF() > testDf > .withColumn("money", testDf("money").cast(DecimalType(10,2))) > .printSchema() > {code} > {code} > root > |-- id: string (nullable = true) > |-- money: decimal(10,2) (nullable = true) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18478) Support codegen for Hive UDFs
[ https://issues.apache.org/jira/browse/SPARK-18478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15673032#comment-15673032 ] Takeshi Yamamuro commented on SPARK-18478: -- ya, I'll make a pr later, thanks! > Support codegen for Hive UDFs > - > > Key: SPARK-18478 > URL: https://issues.apache.org/jira/browse/SPARK-18478 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2 >Reporter: Takeshi Yamamuro > > Spark currently does not codegen Hive UDFs in hiveUDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18478) Support codegen for Hive UDFs
[ https://issues.apache.org/jira/browse/SPARK-18478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15673027#comment-15673027 ] Reynold Xin commented on SPARK-18478: - Yea that it seems like it's worth doing. > Support codegen for Hive UDFs > - > > Key: SPARK-18478 > URL: https://issues.apache.org/jira/browse/SPARK-18478 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2 >Reporter: Takeshi Yamamuro > > Spark currently does not codegen Hive UDFs in hiveUDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18478) Support codegen for Hive UDFs
[ https://issues.apache.org/jira/browse/SPARK-18478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15673017#comment-15673017 ] Takeshi Yamamuro commented on SPARK-18478: -- [~rxin] seems we have some performance gains; https://github.com/apache/spark/compare/master...maropu:SupportHiveUdfCodegen#diff-f781931d35e340e5550357fbb28d14e8R37 {code} /* Java HotSpot(TM) 64-Bit Server VM 1.8.0_31-b13 on Mac OS X 10.10.2 Intel(R) Core(TM) i7-4578U CPU @ 3.00GHz Call Hive UDF: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative --- Call Hive UDF wholestage off3 /4 40941.7 0.0 1.0X Call Hive UDF wholestage on 1 /2 96620.3 0.0 2.4X */ {code} > Support codegen for Hive UDFs > - > > Key: SPARK-18478 > URL: https://issues.apache.org/jira/browse/SPARK-18478 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2 >Reporter: Takeshi Yamamuro > > Spark currently does not codegen Hive UDFs in hiveUDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14974) spark sql job create too many files in HDFS when doing insert overwrite hive table
[ https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672948#comment-15672948 ] Apache Spark commented on SPARK-14974: -- User 'baishuo' has created a pull request for this issue: https://github.com/apache/spark/pull/15914 > spark sql job create too many files in HDFS when doing insert overwrite hive > table > -- > > Key: SPARK-14974 > URL: https://issues.apache.org/jira/browse/SPARK-14974 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.2 >Reporter: zenglinxi >Priority: Minor > > Recently, we often encounter problems using spark sql for inserting data into > a partition table (ex.: insert overwrite table $output_table partition(dt) > select xxx from tmp_table). > After the spark job start running on yarn, the app will create too many files > (ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous > pressure. > We found that the num of files created by spark job is depending on the > partition num of hive table that will be inserted and the num of spark sql > partitions. > files_num = hive_table_partions_num * spark_sql_partitions_num. > We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= > 1000, and the hive_table_partions_num is very small under normal > circumstances, but it will turn out to be more than 2000 when we input a > wrong field as the partion field unconsciously, which will make the files_num > >= 1000 * 2000 = 2,000,000. > There is a configuration parameter in hive that can limit the maximum number > of dynamic partitions allowed to be created in each mapper/reducer named > hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work > when we use hiveContext. > Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the > files_num be smaller, but it will affect the concurrency. > Can we create configuration parameters to limit the maximum number of files > allowed to be create by each task or limit the spark_sql_partitions_num > without affect the concurrency? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18074) UDFs don't work on non-local environment
[ https://issues.apache.org/jira/browse/SPARK-18074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672950#comment-15672950 ] roncenzhao commented on SPARK-18074: I have encountered this problem, too. If any one can give some method to solve this problem? > UDFs don't work on non-local environment > > > Key: SPARK-18074 > URL: https://issues.apache.org/jira/browse/SPARK-18074 > Project: Spark > Issue Type: Bug >Reporter: Alberto Andreotti > > It seems that UDFs fail to deserialize when they are sent to the remote > cluster. This is an app that can help reproduce the problem, > https://github.com/albertoandreottiATgmail/spark_udf > and this is the stack trace with the exception, > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 > (TID 77, 192.168.1.194): java.lang.ClassCastException: cannot assign instance > of scala.collection.immutable.List$SerializationProxy to field > org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type > scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD > at > java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083) > at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) > at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) > at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) > at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at
[jira] [Created] (SPARK-18483) spark on yarn always connect to yarn resourcemanager at 0.0.0.0:8032
inred created SPARK-18483: - Summary: spark on yarn always connect to yarn resourcemanager at 0.0.0.0:8032 Key: SPARK-18483 URL: https://issues.apache.org/jira/browse/SPARK-18483 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.1 Environment: java8 SBT0.13 scala2.11.8 spark-2.0.1-bin-hadoop2.6 Reporter: inred I have installed the yarn resource manager at 192.168.13.159:8032 and have set YARN_CONF_DIR environment var and have the yarn-site.xml configured as the following, but it always connects to 0.0.0.0:8032 instead of 192.168.13.159:8032 set environment E:\app>set yarn YARN_CONF_DIR=D:\Documents\download\hadoop E:\app>set had HADOOP_HOME=D:\Documents\download\hadoop E:\app>cat D:\Documents\download\hadoop\yarn-site.xml yarn.resourcemanager.address 192.168.13.159:8032 "C:\Program Files\Java\jdk1.8.0_92\bin\java" -Didea.launcher.port=7532 "-Didea.launcher.bin.path=C:\Program Files (x86)\JetBrains\IntelliJ IDEA Community Edition 2016.2.5\bin" -Dfile.encoding=UTF-8 -classpath "C:\Program Files\Java\jdk1.8.0_92\jre\lib\charsets.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\deploy.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\ext\access-bridge-64.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\ext\cldrdata.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\ext\dnsns.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\ext\jaccess.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\ext\jfxrt.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\ext\localedata.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\ext\nashorn.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\ext\sunec.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\ext\sunjce_provider.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\ext\sunmscapi.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\ext\sunpkcs11.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\ext\zipfs.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\javaws.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\jce.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\jfr.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\jfxswt.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\jsse.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\management-agent.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\plugin.jar;C:\Program Files\Java\jdk1.8.0_92\jre\lib\resources.jar;C:\Program
[jira] [Commented] (SPARK-17662) Dedup UDAF
[ https://issues.apache.org/jira/browse/SPARK-17662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672928#comment-15672928 ] Ohad Raviv commented on SPARK-17662: you're right, great solution! I didn't know about the "max(struct(date, *))" syntax. thanks! > Dedup UDAF > -- > > Key: SPARK-17662 > URL: https://issues.apache.org/jira/browse/SPARK-17662 > Project: Spark > Issue Type: New Feature >Reporter: Ohad Raviv > > We have a common use case od deduping a table in a creation order. > For example, we have an event log of user actions. A user marks his favorite > category from time to time. > In our analytics we would like to know only the user's last favorite category. > The data: > user_idaction_typevaluedate > 123 fav category 1 2016-02-01 > 123 fav category 4 2016-02-02 > 123 fav category 8 2016-02-03 > 123 fav category 2 2016-02-04 > we would like to get only the last update by the date column. > we could of-course do it in sql: > select * from ( > select *, row_number() over (partition by user_id,action_type order by date > desc) as rnum from tbl) > where rnum=1; > but then, I believe it can't be optimized on the mappers side and we'll get > all the data shuffled to the reducers instead of partially aggregated in the > map side. > We have written a UDAF for this, but then we have other issues - like > blocking push-down-predicate for columns. > do you have any idea for a proper solution? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18470) Provide Spark Streaming Monitor Rest Api
[ https://issues.apache.org/jira/browse/SPARK-18470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672745#comment-15672745 ] Genmao Yu commented on SPARK-18470: --- Thanks for your suggestions, i will provide it later > Provide Spark Streaming Monitor Rest Api > > > Key: SPARK-18470 > URL: https://issues.apache.org/jira/browse/SPARK-18470 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 1.6.3, 2.0.2 >Reporter: Genmao Yu > > provide spark streaming monitor api: > 1. list receiver information > 2. list stream batch infomation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18481) ML 2.1 QA: Remove deprecated methods for ML
[ https://issues.apache.org/jira/browse/SPARK-18481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672708#comment-15672708 ] Apache Spark commented on SPARK-18481: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/15913 > ML 2.1 QA: Remove deprecated methods for ML > > > Key: SPARK-18481 > URL: https://issues.apache.org/jira/browse/SPARK-18481 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > > Remove deprecated methods for ML. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18482) make sure Spark can access the table metadata created by older version of spark
Wenchen Fan created SPARK-18482: --- Summary: make sure Spark can access the table metadata created by older version of spark Key: SPARK-18482 URL: https://issues.apache.org/jira/browse/SPARK-18482 Project: Spark Issue Type: Test Components: SQL Reporter: Wenchen Fan In Spark 2.1, the way of storing table metadata into Hive metastore has been changed a lot. e.g. no runtime schema inference, store partition columns, store schema for hive tables, etc. We should add compatibility tests for them -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1267) Add a pip installer for PySpark
[ https://issues.apache.org/jira/browse/SPARK-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-1267: -- Fix Version/s: 2.1.0 > Add a pip installer for PySpark > --- > > Key: SPARK-1267 > URL: https://issues.apache.org/jira/browse/SPARK-1267 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Reporter: Prabin Banka >Assignee: holdenk >Priority: Minor > Labels: pyspark > Fix For: 2.1.0, 2.2.0 > > > Please refer to this mail archive, > http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3CCAOEPXP7jKiw-3M8eh2giBcs8gEkZ1upHpGb=fqoucvscywj...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18129) Sign pip artifacts
[ https://issues.apache.org/jira/browse/SPARK-18129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-18129: --- Fix Version/s: 2.1.0 > Sign pip artifacts > -- > > Key: SPARK-18129 > URL: https://issues.apache.org/jira/browse/SPARK-18129 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Reporter: holdenk >Assignee: holdenk > Fix For: 2.1.0, 2.2.0 > > > To allow us to distribute the Python pip installable artifacts we need to be > able to sign them with make-release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18481) ML 2.1 QA: Remove deprecated methods for ML
Yanbo Liang created SPARK-18481: --- Summary: ML 2.1 QA: Remove deprecated methods for ML Key: SPARK-18481 URL: https://issues.apache.org/jira/browse/SPARK-18481 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Yanbo Liang Assignee: Yanbo Liang Priority: Minor Remove deprecated methods for ML. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18442) Fix nullability of WrapOption.
[ https://issues.apache.org/jira/browse/SPARK-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-18442. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15887 [https://github.com/apache/spark/pull/15887] > Fix nullability of WrapOption. > -- > > Key: SPARK-18442 > URL: https://issues.apache.org/jira/browse/SPARK-18442 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Minor > Fix For: 2.1.0 > > > The nullability of {{WrapOption}} should be {{false}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18442) Fix nullability of WrapOption.
[ https://issues.apache.org/jira/browse/SPARK-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-18442: Assignee: Takuya Ueshin > Fix nullability of WrapOption. > -- > > Key: SPARK-18442 > URL: https://issues.apache.org/jira/browse/SPARK-18442 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Minor > Fix For: 2.1.0 > > > The nullability of {{WrapOption}} should be {{false}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18480) Link validation for ML guides
[ https://issues.apache.org/jira/browse/SPARK-18480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18480: Assignee: (was: Apache Spark) > Link validation for ML guides > - > > Key: SPARK-18480 > URL: https://issues.apache.org/jira/browse/SPARK-18480 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: zhengruifeng >Priority: Trivial > > I did a skim about the link validation of ML and found following link issues: > 1, There are two {{[Graph.partitionBy]}} in {{graphx-programming-guide.md}}, > the first one had no effert. > 2, {{DataFrame}}, {{Transformer}}, {{Pipeline}} and {{Parameter}} in > {{ml-pipeline.md}} were linked to {{ml-guide.html}} by mistake. > 3, {{PythonMLLibAPI}} in {{mllib-linear-methods.md}} was not accessable, > because class {{PythonMLLibAPI}} is private. > 4, Other links updates. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18480) Link validation for ML guides
[ https://issues.apache.org/jira/browse/SPARK-18480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18480: Assignee: Apache Spark > Link validation for ML guides > - > > Key: SPARK-18480 > URL: https://issues.apache.org/jira/browse/SPARK-18480 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Trivial > > I did a skim about the link validation of ML and found following link issues: > 1, There are two {{[Graph.partitionBy]}} in {{graphx-programming-guide.md}}, > the first one had no effert. > 2, {{DataFrame}}, {{Transformer}}, {{Pipeline}} and {{Parameter}} in > {{ml-pipeline.md}} were linked to {{ml-guide.html}} by mistake. > 3, {{PythonMLLibAPI}} in {{mllib-linear-methods.md}} was not accessable, > because class {{PythonMLLibAPI}} is private. > 4, Other links updates. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18480) Link validation for ML guides
[ https://issues.apache.org/jira/browse/SPARK-18480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672502#comment-15672502 ] Apache Spark commented on SPARK-18480: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/15912 > Link validation for ML guides > - > > Key: SPARK-18480 > URL: https://issues.apache.org/jira/browse/SPARK-18480 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: zhengruifeng >Priority: Trivial > > I did a skim about the link validation of ML and found following link issues: > 1, There are two {{[Graph.partitionBy]}} in {{graphx-programming-guide.md}}, > the first one had no effert. > 2, {{DataFrame}}, {{Transformer}}, {{Pipeline}} and {{Parameter}} in > {{ml-pipeline.md}} were linked to {{ml-guide.html}} by mistake. > 3, {{PythonMLLibAPI}} in {{mllib-linear-methods.md}} was not accessable, > because class {{PythonMLLibAPI}} is private. > 4, Other links updates. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18480) Link validation for ML guides
zhengruifeng created SPARK-18480: Summary: Link validation for ML guides Key: SPARK-18480 URL: https://issues.apache.org/jira/browse/SPARK-18480 Project: Spark Issue Type: Bug Components: Documentation Reporter: zhengruifeng Priority: Trivial I did a skim about the link validation of ML and found following link issues: 1, There are two {{[Graph.partitionBy]}} in {{graphx-programming-guide.md}}, the first one had no effert. 2, {{DataFrame}}, {{Transformer}}, {{Pipeline}} and {{Parameter}} in {{ml-pipeline.md}} were linked to {{ml-guide.html}} by mistake. 3, {{PythonMLLibAPI}} in {{mllib-linear-methods.md}} was not accessable, because class {{PythonMLLibAPI}} is private. 4, Other links updates. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18478) Support codegen for Hive UDFs
[ https://issues.apache.org/jira/browse/SPARK-18478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672493#comment-15672493 ] Takeshi Yamamuro commented on SPARK-18478: -- okay, I'll check > Support codegen for Hive UDFs > - > > Key: SPARK-18478 > URL: https://issues.apache.org/jira/browse/SPARK-18478 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2 >Reporter: Takeshi Yamamuro > > Spark currently does not codegen Hive UDFs in hiveUDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18319) ML, Graph 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-18319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-18319: Assignee: yuhao yang > ML, Graph 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-18319 > URL: https://issues.apache.org/jira/browse/SPARK-18319 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: yuhao yang >Priority: Blocker > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18320) ML 2.1 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-18320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-18320: Assignee: Seth Hendrickson > ML 2.1 QA: API: Python API coverage > --- > > Key: SPARK-18320 > URL: https://issues.apache.org/jira/browse/SPARK-18320 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Seth Hendrickson >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18478) Support codegen for Hive UDFs
[ https://issues.apache.org/jira/browse/SPARK-18478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672453#comment-15672453 ] Reynold Xin commented on SPARK-18478: - Are there any performance improvements we will get? I'm not sure how much gain we can get due to conversions. > Support codegen for Hive UDFs > - > > Key: SPARK-18478 > URL: https://issues.apache.org/jira/browse/SPARK-18478 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2 >Reporter: Takeshi Yamamuro > > Spark currently does not codegen Hive UDFs in hiveUDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18317) ML, Graph 2.1 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-18317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-18317: - Assignee: Xiangrui Meng > ML, Graph 2.1 QA: API: Binary incompatible changes > -- > > Key: SPARK-18317 > URL: https://issues.apache.org/jira/browse/SPARK-18317 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng >Priority: Blocker > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18449) Name option is being ignored when submitting an R application via spark-submit
[ https://issues.apache.org/jira/browse/SPARK-18449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung reassigned SPARK-18449: Assignee: Felix Cheung > Name option is being ignored when submitting an R application via spark-submit > -- > > Key: SPARK-18449 > URL: https://issues.apache.org/jira/browse/SPARK-18449 > Project: Spark > Issue Type: Bug > Components: Spark Submit, SparkR >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Alexander Eckert >Assignee: Felix Cheung >Priority: Minor > > The value of the _--name_ parameter is being ignored, when submitting an R > script via _spark-submit_ to a Standalone Spark Cluster. > This is the case when the R script starts a _sparkR.session()_ as well as > when no sparkR.session() is used. I would expect that the value of the --name > parameter should be used when no appName was specified in the > sparkR.session() function or the session() function has not been called at > all. And when the appName was specified, I would expect the same behaviour > like in Scala or Python applications. Currently {{SparkR}} is displayed when > no appName was specified. > Example: > 1. Edit _examples/src/main/r/dataframe.R_ > 2. Replace _sparkR.session(appName = "SparkR-DataFrame-example")_ with > _sparkR.session()_ > 3. Submit dataframe.R > {code:none} > bin/spark-submit --master spark://ubuntu:7077 --name MyApp > examples/src/main/r/dataframe.R > {code} > 4. Compare application names with names in the Spark WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18449) Name option is being ignored when submitting an R application via spark-submit
[ https://issues.apache.org/jira/browse/SPARK-18449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672403#comment-15672403 ] Felix Cheung commented on SPARK-18449: -- Good catch, this is likely because the R function has a default value of appName = "SparkR" which is then passed to JVM, overriding the command line value. It gets a bit tricky on the order on which get applied first though, I'll need to look into this to see what is the right fix. > Name option is being ignored when submitting an R application via spark-submit > -- > > Key: SPARK-18449 > URL: https://issues.apache.org/jira/browse/SPARK-18449 > Project: Spark > Issue Type: Bug > Components: Spark Submit, SparkR >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Alexander Eckert >Priority: Minor > > The value of the _--name_ parameter is being ignored, when submitting an R > script via _spark-submit_ to a Standalone Spark Cluster. > This is the case when the R script starts a _sparkR.session()_ as well as > when no sparkR.session() is used. I would expect that the value of the --name > parameter should be used when no appName was specified in the > sparkR.session() function or the session() function has not been called at > all. And when the appName was specified, I would expect the same behaviour > like in Scala or Python applications. Currently {{SparkR}} is displayed when > no appName was specified. > Example: > 1. Edit _examples/src/main/r/dataframe.R_ > 2. Replace _sparkR.session(appName = "SparkR-DataFrame-example")_ with > _sparkR.session()_ > 3. Submit dataframe.R > {code:none} > bin/spark-submit --master spark://ubuntu:7077 --name MyApp > examples/src/main/r/dataframe.R > {code} > 4. Compare application names with names in the Spark WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18353) spark.rpc.askTimeout defalut value is not 120s
[ https://issues.apache.org/jira/browse/SPARK-18353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672312#comment-15672312 ] Jason Pan commented on SPARK-18353: --- Thanks sean. It works. Just for the doc: "spark.network.timeout, or 10s in standalone clusters" Actually the default is not 10s in standalone when using rest. > spark.rpc.askTimeout defalut value is not 120s > -- > > Key: SPARK-18353 > URL: https://issues.apache.org/jira/browse/SPARK-18353 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.0.1 > Environment: Linux zzz 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 > 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Jason Pan >Priority: Critical > > in http://spark.apache.org/docs/latest/configuration.html > spark.rpc.askTimeout 120sDuration for an RPC ask operation to wait > before timing out > the defalut value is 120s as documented. > However when I run "spark-summit" with standalone cluster mode: > the cmd is: > Launch Command: "/opt/jdk1.8.0_102/bin/java" "-cp" > "/opt/spark-2.0.1-bin-hadoop2.7/conf/:/opt/spark-2.0.1-bin-hadoop2.7/jars/*" > "-Xmx1024M" "-Dspark.eventLog.enabled=true" > "-Dspark.master=spark://9.111.159.127:7101" "-Dspark.driver.supervise=false" > "-Dspark.app.name=org.apache.spark.examples.SparkPi" > "-Dspark.submit.deployMode=cluster" > "-Dspark.jars=file:/opt/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar" > "-Dspark.history.ui.port=18087" "-Dspark.rpc.askTimeout=10" > "-Dspark.history.fs.logDirectory=file:/opt/tmp/spark-event" > "-Dspark.eventLog.dir=file:///opt/tmp/spark-event" > "org.apache.spark.deploy.worker.DriverWrapper" > "spark://Worker@9.111.159.127:7103" > "/opt/spark-2.0.1-bin-hadoop2.7/work/driver-20161109031939-0002/spark-examples-1.6.1-hadoop2.6.0.jar" > "org.apache.spark.examples.SparkPi" "1000" > Dspark.rpc.askTimeout=10 > the value is 10, it is not the same as document. > Note: when I summit to REST URL, it has no this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18468) Flaky test: org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist Parquet relation with decimal column
[ https://issues.apache.org/jira/browse/SPARK-18468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-18468: - Description: https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.1-test-sbt-hadoop-2.4/71/testReport/junit/org.apache.spark.sql.hive/HiveSparkSubmitSuite/SPARK_9757_Persist_Parquet_relation_with_decimal_column/ https://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.4/71 Seems we failed to stop the driver {code} 2016-11-15 18:36:47.76 - stderr> org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout 2016-11-15 18:36:47.76 - stderr>at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) 2016-11-15 18:36:47.76 - stderr>at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) 2016-11-15 18:36:47.76 - stderr>at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) 2016-11-15 18:36:47.76 - stderr>at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) 2016-11-15 18:36:47.76 - stderr>at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216) 2016-11-15 18:36:47.76 - stderr>at scala.util.Try$.apply(Try.scala:192) 2016-11-15 18:36:47.76 - stderr>at scala.util.Failure.recover(Try.scala:216) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) 2016-11-15 18:36:47.76 - stderr>at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Promise$class.complete(Promise.scala:55) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Promise$class.tryFailure(Promise.scala:112) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153) 2016-11-15 18:36:47.76 - stderr>at org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205) 2016-11-15 18:36:47.76 - stderr>at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239) 2016-11-15 18:36:47.76 - stderr>at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 2016-11-15 18:36:47.76 - stderr>at
[jira] [Updated] (SPARK-18468) Flaky test: org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist Parquet relation with decimal column
[ https://issues.apache.org/jira/browse/SPARK-18468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-18468: - Component/s: (was: SQL) Spark Core > Flaky test: org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist > Parquet relation with decimal column > -- > > Key: SPARK-18468 > URL: https://issues.apache.org/jira/browse/SPARK-18468 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Yin Huai >Priority: Critical > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.1-test-sbt-hadoop-2.4/71/testReport/junit/org.apache.spark.sql.hive/HiveSparkSubmitSuite/SPARK_9757_Persist_Parquet_relation_with_decimal_column/ > Seems we failed to stop the driver > {code} > 2016-11-15 18:36:47.76 - stderr> org.apache.spark.rpc.RpcTimeoutException: > Cannot receive any reply in 120 seconds. This timeout is controlled by > spark.rpc.askTimeout > 2016-11-15 18:36:47.76 - stderr> at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > 2016-11-15 18:36:47.76 - stderr> at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > 2016-11-15 18:36:47.76 - stderr> at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > 2016-11-15 18:36:47.76 - stderr> at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > 2016-11-15 18:36:47.76 - stderr> at > scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216) > 2016-11-15 18:36:47.76 - stderr> at scala.util.Try$.apply(Try.scala:192) > 2016-11-15 18:36:47.76 - stderr> at > scala.util.Failure.recover(Try.scala:216) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > 2016-11-15 18:36:47.76 - stderr> at > com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Promise$class.complete(Promise.scala:55) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Promise$class.tryFailure(Promise.scala:112) > 2016-11-15 18:36:47.76 -
[jira] [Created] (SPARK-18479) spark.sql.shuffle.partitions defaults should be a prime number
Hamel Ajay Kothari created SPARK-18479: -- Summary: spark.sql.shuffle.partitions defaults should be a prime number Key: SPARK-18479 URL: https://issues.apache.org/jira/browse/SPARK-18479 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Reporter: Hamel Ajay Kothari For most hash bucketing use cases it is my understanding that a prime value, such as 199, would be a safer value than the existing value of 200. Using a non-prime value makes the likelihood of collisions much higher when the hash function isn't great. Consider the case where you've got a Timestamp or Long column with millisecond times at midnight each day. With the default value for spark.sql.shuffle.partitions, you'll end up with 120/200 partitions being completely empty. Looking around there doesn't seem to be a good reason why we chose 200 so I don't see a huge risk in changing it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18478) Support codegen for Hive UDFs
[ https://issues.apache.org/jira/browse/SPARK-18478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672272#comment-15672272 ] Takeshi Yamamuro commented on SPARK-18478: -- We can simply fix this (https://github.com/apache/spark/compare/master...maropu:SupportHiveUdfCodegen) though, I'm not sure this fix makes sense because we plan to refactor the Hive integration in SPARK-15691. cc: [~rxin] > Support codegen for Hive UDFs > - > > Key: SPARK-18478 > URL: https://issues.apache.org/jira/browse/SPARK-18478 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2 >Reporter: Takeshi Yamamuro > > Spark currently does not codegen Hive UDFs in hiveUDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672245#comment-15672245 ] peay edited comment on SPARK-18473 at 11/17/16 12:55 AM: - Ok, I see, thanks. The fix is in 2.0.3 though, not 2.0.2, correct? (edit: nevermind, the fix appears to be in 2.0.2 indeed). was (Author: peay): Ok, I see, thanks. The fix is in 2.0.3 though, not 2.0.2, correct? > Correctness issue in INNER join result with window functions > > > Key: SPARK-18473 > URL: https://issues.apache.org/jira/browse/SPARK-18473 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.1 >Reporter: peay >Assignee: Xiao Li > > I have stumbled onto a corner case where an INNER join appears to return > incorrect results. I believe the join should behave as the identity, but > instead, some values are shuffled around, and some are just plain wrong. > This can be reproduced as follows: joining > {code} > +-+-+--+++--+--+ > |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| > +-+-+--+++--+--+ > |1|1| 1| 0| 1| 0| 1| > |2|2| 0| 0| 1| 0| 1| > |1|3| 1| 0| 2| 0| 2| > +-+-+--+++--+--+ > {code} > with > {code} > +--+ > |sessId| > +--+ > | 1| > | 2| > +--+ > {code} > The result is > {code} > +--+-+-+--+++--+ > |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| > +--+-+-+--+++--+ > | 1|2|2| 0| 0| 1| 0| > | 2|1|1| 1| 0| 1|-1| > | 2|1|3| 1| 0| 2| 0| > +--+-+-+--+++--+ > {code} > Note how two rows have a sessId of 2 (instead of one row as expected), and > how `fiftyCount` can now be negative while always zero in the original > dataframe. > The first dataframe uses two windows: > - `hasOne` uses a `window.rowsBetween(-10, 0)`. > - `hasFifty` uses a `window.rowsBetween(-10, -1)`. > The result is *correct* if: > - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of > `window.rowsBetween(-10, -1)`. > - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to > `fillna` -- although there are no visible effect on the dataframe as shown by > `show` as far as I can tell. > - I use a LEFT OUTER join instead of INNER JOIN. > - I write both dataframes to Parquet, read them back and join these. > This can be reproduced in pyspark using: > {code} > import pyspark.sql.functions as F > from pyspark.sql.functions import col > from pyspark.sql.window import Window > df1 = sql_context.createDataFrame( > pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) > ) > window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") > df2 = ( > df1 > .withColumn("hasOne", (col("index") == 1).cast("int")) > .withColumn("hasFifty", (col("index") == 50).cast("int")) > .withColumn("numOnesBefore", > F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) > .withColumn("numFiftyStrictlyBefore", > F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) > .fillna({ 'numFiftyStrictlyBefore': 0 }) > .withColumn("sessId", col("numOnesBefore") - > col("numFiftyStrictlyBefore")) > ) > df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) > df_joined = df_selector.join(df2, "sessId", how="inner") > df2.show() > df_selector.show() > df_joined.show() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18478) Support codegen for Hive UDFs
Takeshi Yamamuro created SPARK-18478: Summary: Support codegen for Hive UDFs Key: SPARK-18478 URL: https://issues.apache.org/jira/browse/SPARK-18478 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.2 Reporter: Takeshi Yamamuro Spark currently does not codegen Hive UDFs in hiveUDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18477) Enable interrupts for HDFS in HDFSMetadataLog
[ https://issues.apache.org/jira/browse/SPARK-18477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672225#comment-15672225 ] Apache Spark commented on SPARK-18477: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/15911 > Enable interrupts for HDFS in HDFSMetadataLog > - > > Key: SPARK-18477 > URL: https://issues.apache.org/jira/browse/SPARK-18477 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > HDFS `write` may just hang until timeout if some network error happens. It's > better to enable interrupts to allow stopping the query fast. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672245#comment-15672245 ] peay commented on SPARK-18473: -- Ok, I see, thanks. The fix is in 2.0.3 though, not 2.0.2, correct? > Correctness issue in INNER join result with window functions > > > Key: SPARK-18473 > URL: https://issues.apache.org/jira/browse/SPARK-18473 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.1 >Reporter: peay >Assignee: Xiao Li > > I have stumbled onto a corner case where an INNER join appears to return > incorrect results. I believe the join should behave as the identity, but > instead, some values are shuffled around, and some are just plain wrong. > This can be reproduced as follows: joining > {code} > +-+-+--+++--+--+ > |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| > +-+-+--+++--+--+ > |1|1| 1| 0| 1| 0| 1| > |2|2| 0| 0| 1| 0| 1| > |1|3| 1| 0| 2| 0| 2| > +-+-+--+++--+--+ > {code} > with > {code} > +--+ > |sessId| > +--+ > | 1| > | 2| > +--+ > {code} > The result is > {code} > +--+-+-+--+++--+ > |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| > +--+-+-+--+++--+ > | 1|2|2| 0| 0| 1| 0| > | 2|1|1| 1| 0| 1|-1| > | 2|1|3| 1| 0| 2| 0| > +--+-+-+--+++--+ > {code} > Note how two rows have a sessId of 2 (instead of one row as expected), and > how `fiftyCount` can now be negative while always zero in the original > dataframe. > The first dataframe uses two windows: > - `hasOne` uses a `window.rowsBetween(-10, 0)`. > - `hasFifty` uses a `window.rowsBetween(-10, -1)`. > The result is *correct* if: > - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of > `window.rowsBetween(-10, -1)`. > - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to > `fillna` -- although there are no visible effect on the dataframe as shown by > `show` as far as I can tell. > - I use a LEFT OUTER join instead of INNER JOIN. > - I write both dataframes to Parquet, read them back and join these. > This can be reproduced in pyspark using: > {code} > import pyspark.sql.functions as F > from pyspark.sql.functions import col > from pyspark.sql.window import Window > df1 = sql_context.createDataFrame( > pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) > ) > window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") > df2 = ( > df1 > .withColumn("hasOne", (col("index") == 1).cast("int")) > .withColumn("hasFifty", (col("index") == 50).cast("int")) > .withColumn("numOnesBefore", > F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) > .withColumn("numFiftyStrictlyBefore", > F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) > .fillna({ 'numFiftyStrictlyBefore': 0 }) > .withColumn("sessId", col("numOnesBefore") - > col("numFiftyStrictlyBefore")) > ) > df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) > df_joined = df_selector.join(df2, "sessId", how="inner") > df2.show() > df_selector.show() > df_joined.show() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18477) Enable interrupts for HDFS in HDFSMetadataLog
[ https://issues.apache.org/jira/browse/SPARK-18477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18477: Assignee: Shixiong Zhu (was: Apache Spark) > Enable interrupts for HDFS in HDFSMetadataLog > - > > Key: SPARK-18477 > URL: https://issues.apache.org/jira/browse/SPARK-18477 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > HDFS `write` may just hang until timeout if some network error happens. It's > better to enable interrupts to allow stopping the query fast. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18477) Enable interrupts for HDFS in HDFSMetadataLog
[ https://issues.apache.org/jira/browse/SPARK-18477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18477: Assignee: Apache Spark (was: Shixiong Zhu) > Enable interrupts for HDFS in HDFSMetadataLog > - > > Key: SPARK-18477 > URL: https://issues.apache.org/jira/browse/SPARK-18477 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Shixiong Zhu >Assignee: Apache Spark > > HDFS `write` may just hang until timeout if some network error happens. It's > better to enable interrupts to allow stopping the query fast. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672231#comment-15672231 ] Dongjoon Hyun commented on SPARK-18473: --- Hi, [~peay]. Maybe, the relevant one is https://issues.apache.org/jira/browse/SPARK-17806 . I think we can close this issue since it's already proven by the example. > Correctness issue in INNER join result with window functions > > > Key: SPARK-18473 > URL: https://issues.apache.org/jira/browse/SPARK-18473 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.1 >Reporter: peay >Assignee: Xiao Li > > I have stumbled onto a corner case where an INNER join appears to return > incorrect results. I believe the join should behave as the identity, but > instead, some values are shuffled around, and some are just plain wrong. > This can be reproduced as follows: joining > {code} > +-+-+--+++--+--+ > |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| > +-+-+--+++--+--+ > |1|1| 1| 0| 1| 0| 1| > |2|2| 0| 0| 1| 0| 1| > |1|3| 1| 0| 2| 0| 2| > +-+-+--+++--+--+ > {code} > with > {code} > +--+ > |sessId| > +--+ > | 1| > | 2| > +--+ > {code} > The result is > {code} > +--+-+-+--+++--+ > |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| > +--+-+-+--+++--+ > | 1|2|2| 0| 0| 1| 0| > | 2|1|1| 1| 0| 1|-1| > | 2|1|3| 1| 0| 2| 0| > +--+-+-+--+++--+ > {code} > Note how two rows have a sessId of 2 (instead of one row as expected), and > how `fiftyCount` can now be negative while always zero in the original > dataframe. > The first dataframe uses two windows: > - `hasOne` uses a `window.rowsBetween(-10, 0)`. > - `hasFifty` uses a `window.rowsBetween(-10, -1)`. > The result is *correct* if: > - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of > `window.rowsBetween(-10, -1)`. > - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to > `fillna` -- although there are no visible effect on the dataframe as shown by > `show` as far as I can tell. > - I use a LEFT OUTER join instead of INNER JOIN. > - I write both dataframes to Parquet, read them back and join these. > This can be reproduced in pyspark using: > {code} > import pyspark.sql.functions as F > from pyspark.sql.functions import col > from pyspark.sql.window import Window > df1 = sql_context.createDataFrame( > pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) > ) > window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") > df2 = ( > df1 > .withColumn("hasOne", (col("index") == 1).cast("int")) > .withColumn("hasFifty", (col("index") == 50).cast("int")) > .withColumn("numOnesBefore", > F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) > .withColumn("numFiftyStrictlyBefore", > F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) > .fillna({ 'numFiftyStrictlyBefore': 0 }) > .withColumn("sessId", col("numOnesBefore") - > col("numFiftyStrictlyBefore")) > ) > df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) > df_joined = df_selector.join(df2, "sessId", how="inner") > df2.show() > df_selector.show() > df_joined.show() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672235#comment-15672235 ] Dongjoon Hyun commented on SPARK-18473: --- Wow. Please forget about my comment. :) I didn't read [~hvanhovell]'s comments. Sorry. > Correctness issue in INNER join result with window functions > > > Key: SPARK-18473 > URL: https://issues.apache.org/jira/browse/SPARK-18473 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.1 >Reporter: peay >Assignee: Xiao Li > > I have stumbled onto a corner case where an INNER join appears to return > incorrect results. I believe the join should behave as the identity, but > instead, some values are shuffled around, and some are just plain wrong. > This can be reproduced as follows: joining > {code} > +-+-+--+++--+--+ > |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| > +-+-+--+++--+--+ > |1|1| 1| 0| 1| 0| 1| > |2|2| 0| 0| 1| 0| 1| > |1|3| 1| 0| 2| 0| 2| > +-+-+--+++--+--+ > {code} > with > {code} > +--+ > |sessId| > +--+ > | 1| > | 2| > +--+ > {code} > The result is > {code} > +--+-+-+--+++--+ > |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| > +--+-+-+--+++--+ > | 1|2|2| 0| 0| 1| 0| > | 2|1|1| 1| 0| 1|-1| > | 2|1|3| 1| 0| 2| 0| > +--+-+-+--+++--+ > {code} > Note how two rows have a sessId of 2 (instead of one row as expected), and > how `fiftyCount` can now be negative while always zero in the original > dataframe. > The first dataframe uses two windows: > - `hasOne` uses a `window.rowsBetween(-10, 0)`. > - `hasFifty` uses a `window.rowsBetween(-10, -1)`. > The result is *correct* if: > - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of > `window.rowsBetween(-10, -1)`. > - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to > `fillna` -- although there are no visible effect on the dataframe as shown by > `show` as far as I can tell. > - I use a LEFT OUTER join instead of INNER JOIN. > - I write both dataframes to Parquet, read them back and join these. > This can be reproduced in pyspark using: > {code} > import pyspark.sql.functions as F > from pyspark.sql.functions import col > from pyspark.sql.window import Window > df1 = sql_context.createDataFrame( > pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) > ) > window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") > df2 = ( > df1 > .withColumn("hasOne", (col("index") == 1).cast("int")) > .withColumn("hasFifty", (col("index") == 50).cast("int")) > .withColumn("numOnesBefore", > F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) > .withColumn("numFiftyStrictlyBefore", > F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) > .fillna({ 'numFiftyStrictlyBefore': 0 }) > .withColumn("sessId", col("numOnesBefore") - > col("numFiftyStrictlyBefore")) > ) > df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) > df_joined = df_selector.join(df2, "sessId", how="inner") > df2.show() > df_selector.show() > df_joined.show() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell closed SPARK-18473. - Resolution: Fixed Assignee: Xiao Li Fixed by gatorsmile's PR for SPARK-17981/SPARK-17957 > Correctness issue in INNER join result with window functions > > > Key: SPARK-18473 > URL: https://issues.apache.org/jira/browse/SPARK-18473 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.1 >Reporter: peay >Assignee: Xiao Li > > I have stumbled onto a corner case where an INNER join appears to return > incorrect results. I believe the join should behave as the identity, but > instead, some values are shuffled around, and some are just plain wrong. > This can be reproduced as follows: joining > {code} > +-+-+--+++--+--+ > |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| > +-+-+--+++--+--+ > |1|1| 1| 0| 1| 0| 1| > |2|2| 0| 0| 1| 0| 1| > |1|3| 1| 0| 2| 0| 2| > +-+-+--+++--+--+ > {code} > with > {code} > +--+ > |sessId| > +--+ > | 1| > | 2| > +--+ > {code} > The result is > {code} > +--+-+-+--+++--+ > |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| > +--+-+-+--+++--+ > | 1|2|2| 0| 0| 1| 0| > | 2|1|1| 1| 0| 1|-1| > | 2|1|3| 1| 0| 2| 0| > +--+-+-+--+++--+ > {code} > Note how two rows have a sessId of 2 (instead of one row as expected), and > how `fiftyCount` can now be negative while always zero in the original > dataframe. > The first dataframe uses two windows: > - `hasOne` uses a `window.rowsBetween(-10, 0)`. > - `hasFifty` uses a `window.rowsBetween(-10, -1)`. > The result is *correct* if: > - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of > `window.rowsBetween(-10, -1)`. > - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to > `fillna` -- although there are no visible effect on the dataframe as shown by > `show` as far as I can tell. > - I use a LEFT OUTER join instead of INNER JOIN. > - I write both dataframes to Parquet, read them back and join these. > This can be reproduced in pyspark using: > {code} > import pyspark.sql.functions as F > from pyspark.sql.functions import col > from pyspark.sql.window import Window > df1 = sql_context.createDataFrame( > pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) > ) > window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") > df2 = ( > df1 > .withColumn("hasOne", (col("index") == 1).cast("int")) > .withColumn("hasFifty", (col("index") == 50).cast("int")) > .withColumn("numOnesBefore", > F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) > .withColumn("numFiftyStrictlyBefore", > F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) > .fillna({ 'numFiftyStrictlyBefore': 0 }) > .withColumn("sessId", col("numOnesBefore") - > col("numFiftyStrictlyBefore")) > ) > df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) > df_joined = df_selector.join(df2, "sessId", how="inner") > df2.show() > df_selector.show() > df_joined.show() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672223#comment-15672223 ] Herman van Hovell commented on SPARK-18473: --- This is probably caused by SPARK-17981. That bug creates incorrect nullability flags, which messes with the result of coalesce (fillna) in your query. > Correctness issue in INNER join result with window functions > > > Key: SPARK-18473 > URL: https://issues.apache.org/jira/browse/SPARK-18473 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.1 >Reporter: peay > > I have stumbled onto a corner case where an INNER join appears to return > incorrect results. I believe the join should behave as the identity, but > instead, some values are shuffled around, and some are just plain wrong. > This can be reproduced as follows: joining > {code} > +-+-+--+++--+--+ > |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| > +-+-+--+++--+--+ > |1|1| 1| 0| 1| 0| 1| > |2|2| 0| 0| 1| 0| 1| > |1|3| 1| 0| 2| 0| 2| > +-+-+--+++--+--+ > {code} > with > {code} > +--+ > |sessId| > +--+ > | 1| > | 2| > +--+ > {code} > The result is > {code} > +--+-+-+--+++--+ > |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| > +--+-+-+--+++--+ > | 1|2|2| 0| 0| 1| 0| > | 2|1|1| 1| 0| 1|-1| > | 2|1|3| 1| 0| 2| 0| > +--+-+-+--+++--+ > {code} > Note how two rows have a sessId of 2 (instead of one row as expected), and > how `fiftyCount` can now be negative while always zero in the original > dataframe. > The first dataframe uses two windows: > - `hasOne` uses a `window.rowsBetween(-10, 0)`. > - `hasFifty` uses a `window.rowsBetween(-10, -1)`. > The result is *correct* if: > - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of > `window.rowsBetween(-10, -1)`. > - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to > `fillna` -- although there are no visible effect on the dataframe as shown by > `show` as far as I can tell. > - I use a LEFT OUTER join instead of INNER JOIN. > - I write both dataframes to Parquet, read them back and join these. > This can be reproduced in pyspark using: > {code} > import pyspark.sql.functions as F > from pyspark.sql.functions import col > from pyspark.sql.window import Window > df1 = sql_context.createDataFrame( > pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) > ) > window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") > df2 = ( > df1 > .withColumn("hasOne", (col("index") == 1).cast("int")) > .withColumn("hasFifty", (col("index") == 50).cast("int")) > .withColumn("numOnesBefore", > F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) > .withColumn("numFiftyStrictlyBefore", > F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) > .fillna({ 'numFiftyStrictlyBefore': 0 }) > .withColumn("sessId", col("numOnesBefore") - > col("numFiftyStrictlyBefore")) > ) > df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) > df_joined = df_selector.join(df2, "sessId", how="inner") > df2.show() > df_selector.show() > df_joined.show() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18477) Enable interrupts for HDFS in HDFSMetadataLog
Shixiong Zhu created SPARK-18477: Summary: Enable interrupts for HDFS in HDFSMetadataLog Key: SPARK-18477 URL: https://issues.apache.org/jira/browse/SPARK-18477 Project: Spark Issue Type: Improvement Components: Structured Streaming Reporter: Shixiong Zhu Assignee: Shixiong Zhu HDFS `write` may just hang until timeout if some network error happens. It's better to enable interrupts to allow stopping the query fast. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18476) SparkR Logistic Regression should should support output original label.
[ https://issues.apache.org/jira/browse/SPARK-18476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672218#comment-15672218 ] Apache Spark commented on SPARK-18476: -- User 'wangmiao1981' has created a pull request for this issue: https://github.com/apache/spark/pull/15910 > SparkR Logistic Regression should should support output original label. > --- > > Key: SPARK-18476 > URL: https://issues.apache.org/jira/browse/SPARK-18476 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Miao Wang > > Similar to [SPARK-18401], as a classification algorithm, logistic regression > should support output original label instead of supporting index label. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18476) SparkR Logistic Regression should should support output original label.
[ https://issues.apache.org/jira/browse/SPARK-18476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18476: Assignee: Apache Spark > SparkR Logistic Regression should should support output original label. > --- > > Key: SPARK-18476 > URL: https://issues.apache.org/jira/browse/SPARK-18476 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Miao Wang >Assignee: Apache Spark > > Similar to [SPARK-18401], as a classification algorithm, logistic regression > should support output original label instead of supporting index label. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18476) SparkR Logistic Regression should should support output original label.
[ https://issues.apache.org/jira/browse/SPARK-18476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18476: Assignee: (was: Apache Spark) > SparkR Logistic Regression should should support output original label. > --- > > Key: SPARK-18476 > URL: https://issues.apache.org/jira/browse/SPARK-18476 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Miao Wang > > Similar to [SPARK-18401], as a classification algorithm, logistic regression > should support output original label instead of supporting index label. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18476) SparkR Logistic Regression should should support output original label.
Miao Wang created SPARK-18476: - Summary: SparkR Logistic Regression should should support output original label. Key: SPARK-18476 URL: https://issues.apache.org/jira/browse/SPARK-18476 Project: Spark Issue Type: Bug Components: SparkR Reporter: Miao Wang Similar to [SPARK-18401], as a classification algorithm, logistic regression should support output original label instead of supporting index label. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671923#comment-15671923 ] Apache Spark commented on SPARK-18475: -- User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/15909 > Be able to provide higher parallelization for StructuredStreaming Kafka Source > -- > > Key: SPARK-18475 > URL: https://issues.apache.org/jira/browse/SPARK-18475 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Burak Yavuz > > Right now the StructuredStreaming Kafka Source creates as many Spark tasks as > there are TopicPartitions that we're going to read from Kafka. > This doesn't work well when we have data skew, and there is no reason why we > shouldn't be able to increase parallelism further, i.e. have multiple Spark > tasks reading from the same Kafka TopicPartition. > What this will mean is that we won't be able to use the "CachedKafkaConsumer" > for what it is defined for (being cached) in this use case, but the extra > overhead is worth handling data skew and increasing parallelism especially in > ETL use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18475: Assignee: (was: Apache Spark) > Be able to provide higher parallelization for StructuredStreaming Kafka Source > -- > > Key: SPARK-18475 > URL: https://issues.apache.org/jira/browse/SPARK-18475 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Burak Yavuz > > Right now the StructuredStreaming Kafka Source creates as many Spark tasks as > there are TopicPartitions that we're going to read from Kafka. > This doesn't work well when we have data skew, and there is no reason why we > shouldn't be able to increase parallelism further, i.e. have multiple Spark > tasks reading from the same Kafka TopicPartition. > What this will mean is that we won't be able to use the "CachedKafkaConsumer" > for what it is defined for (being cached) in this use case, but the extra > overhead is worth handling data skew and increasing parallelism especially in > ETL use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18475: Assignee: Apache Spark > Be able to provide higher parallelization for StructuredStreaming Kafka Source > -- > > Key: SPARK-18475 > URL: https://issues.apache.org/jira/browse/SPARK-18475 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Burak Yavuz >Assignee: Apache Spark > > Right now the StructuredStreaming Kafka Source creates as many Spark tasks as > there are TopicPartitions that we're going to read from Kafka. > This doesn't work well when we have data skew, and there is no reason why we > shouldn't be able to increase parallelism further, i.e. have multiple Spark > tasks reading from the same Kafka TopicPartition. > What this will mean is that we won't be able to use the "CachedKafkaConsumer" > for what it is defined for (being cached) in this use case, but the extra > overhead is worth handling data skew and increasing parallelism especially in > ETL use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source
Burak Yavuz created SPARK-18475: --- Summary: Be able to provide higher parallelization for StructuredStreaming Kafka Source Key: SPARK-18475 URL: https://issues.apache.org/jira/browse/SPARK-18475 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.0.2, 2.1.0 Reporter: Burak Yavuz Right now the StructuredStreaming Kafka Source creates as many Spark tasks as there are TopicPartitions that we're going to read from Kafka. This doesn't work well when we have data skew, and there is no reason why we shouldn't be able to increase parallelism further, i.e. have multiple Spark tasks reading from the same Kafka TopicPartition. What this will mean is that we won't be able to use the "CachedKafkaConsumer" for what it is defined for (being cached) in this use case, but the extra overhead is worth handling data skew and increasing parallelism especially in ETL use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18186) Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation support
[ https://issues.apache.org/jira/browse/SPARK-18186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-18186. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 15703 [https://github.com/apache/spark/pull/15703] > Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation > support > > > Key: SPARK-18186 > URL: https://issues.apache.org/jira/browse/SPARK-18186 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 2.0.1 >Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 2.2.0 > > > Currently, Hive UDAFs in Spark SQL don't support partial aggregation. Any > query involving any Hive UDAFs has to fall back to {{SortAggregateExec}} > without partial aggregation. > This issue can be fixed by migrating {{HiveUDAFFunction}} to > {{TypedImperativeAggregate}}, which already provides partial aggregation > support for aggregate functions that may use arbitrary Java objects as > aggregation states. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1267) Add a pip installer for PySpark
[ https://issues.apache.org/jira/browse/SPARK-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-1267. --- Resolution: Fixed Fix Version/s: 2.2.0 Merged into master (2.2) and will consider for 2.1. > Add a pip installer for PySpark > --- > > Key: SPARK-1267 > URL: https://issues.apache.org/jira/browse/SPARK-1267 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Reporter: Prabin Banka >Assignee: holdenk >Priority: Minor > Labels: pyspark > Fix For: 2.2.0 > > > Please refer to this mail archive, > http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3CCAOEPXP7jKiw-3M8eh2giBcs8gEkZ1upHpGb=fqoucvscywj...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18129) Sign pip artifacts
[ https://issues.apache.org/jira/browse/SPARK-18129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-18129: --- Assignee: holdenk > Sign pip artifacts > -- > > Key: SPARK-18129 > URL: https://issues.apache.org/jira/browse/SPARK-18129 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Reporter: holdenk >Assignee: holdenk > Fix For: 2.2.0 > > > To allow us to distribute the Python pip installable artifacts we need to be > able to sign them with make-release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18129) Sign pip artifacts
[ https://issues.apache.org/jira/browse/SPARK-18129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-18129. Resolution: Fixed Fix Version/s: 2.2.0 Merged to master (2.2). > Sign pip artifacts > -- > > Key: SPARK-18129 > URL: https://issues.apache.org/jira/browse/SPARK-18129 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Reporter: holdenk >Assignee: holdenk > Fix For: 2.2.0 > > > To allow us to distribute the Python pip installable artifacts we need to be > able to sign them with make-release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1267) Add a pip installer for PySpark
[ https://issues.apache.org/jira/browse/SPARK-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-1267: -- Assignee: holdenk > Add a pip installer for PySpark > --- > > Key: SPARK-1267 > URL: https://issues.apache.org/jira/browse/SPARK-1267 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Reporter: Prabin Banka >Assignee: holdenk >Priority: Minor > Labels: pyspark > Fix For: 2.2.0 > > > Please refer to this mail archive, > http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3CCAOEPXP7jKiw-3M8eh2giBcs8gEkZ1upHpGb=fqoucvscywj...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16609) Single function for parsing timestamps/dates
[ https://issues.apache.org/jira/browse/SPARK-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16609: Assignee: (was: Reynold Xin) > Single function for parsing timestamps/dates > > > Key: SPARK-16609 > URL: https://issues.apache.org/jira/browse/SPARK-16609 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust > > Today, if you want to parse a date or timestamp, you have to use the unix > time function and then cast to a timestamp. Its a little odd there isn't a > single function that does both. I propose we add > {code} > to_date(, )/to_timestamp(, ). > {code} > For reference, in other systems there are: > MS SQL: {{convert(, )}}. See: > https://technet.microsoft.com/en-us/library/ms174450(v=sql.110).aspx > Netezza: {{to_timestamp(, )}}. See: > https://www.ibm.com/support/knowledgecenter/SSULQD_7.0.3/com.ibm.nz.dbu.doc/r_dbuser_ntz_sql_extns_conversion_funcs.html > Teradata has special casting functionality: {{cast( as timestamp > format '')}} > MySql: {{STR_TO_DATE(, )}}. This returns a datetime when you > define both date and time parts. See: > https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18424) Single Function for Parsing Dates and Times with Formats
[ https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers resolved SPARK-18424. --- Resolution: Duplicate This is a duplicate of SPARK-16609. Work will continue there. > Single Function for Parsing Dates and Times with Formats > > > Key: SPARK-18424 > URL: https://issues.apache.org/jira/browse/SPARK-18424 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Assignee: Bill Chambers >Priority: Minor > > I've found it quite cumbersome to work with dates thus far in Spark, it can > be hard to reason about the timeformat and what type you're working with, for > instance: > say that I have a date in the format > {code} > 2017-20-12 > // Y-D-M > {code} > In order to parse that into a Date, I have to perform several conversions. > {code} > to_date( > unix_timestamp(col("date"), dateFormat) > .cast("timestamp")) >.alias("date") > {code} > I propose simplifying this by adding a to_date function (exists) but adding > one that accepts a format for that date. I also propose a to_timestamp > function that also supports a format. > so that you can avoid entirely the above conversion. > It's also worth mentioning that many other databases support this. For > instance, mysql has the STR_TO_DATE function, netezza supports the > to_timestamp semantic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16609) Single function for parsing timestamps/dates
[ https://issues.apache.org/jira/browse/SPARK-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671632#comment-15671632 ] Bill Chambers commented on SPARK-16609: --- I am working on this. > Single function for parsing timestamps/dates > > > Key: SPARK-16609 > URL: https://issues.apache.org/jira/browse/SPARK-16609 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Reynold Xin > > Today, if you want to parse a date or timestamp, you have to use the unix > time function and then cast to a timestamp. Its a little odd there isn't a > single function that does both. I propose we add > {code} > to_date(, )/to_timestamp(, ). > {code} > For reference, in other systems there are: > MS SQL: {{convert(, )}}. See: > https://technet.microsoft.com/en-us/library/ms174450(v=sql.110).aspx > Netezza: {{to_timestamp(, )}}. See: > https://www.ibm.com/support/knowledgecenter/SSULQD_7.0.3/com.ibm.nz.dbu.doc/r_dbuser_ntz_sql_extns_conversion_funcs.html > Teradata has special casting functionality: {{cast( as timestamp > format '')}} > MySql: {{STR_TO_DATE(, )}}. This returns a datetime when you > define both date and time parts. See: > https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18424) Single Function for Parsing Dates and Times with Formats
[ https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-18424: -- Summary: Single Function for Parsing Dates and Times with Formats (was: Single Funct) > Single Function for Parsing Dates and Times with Formats > > > Key: SPARK-18424 > URL: https://issues.apache.org/jira/browse/SPARK-18424 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Assignee: Bill Chambers >Priority: Minor > > I've found it quite cumbersome to work with dates thus far in Spark, it can > be hard to reason about the timeformat and what type you're working with, for > instance: > say that I have a date in the format > {code} > 2017-20-12 > // Y-D-M > {code} > In order to parse that into a Date, I have to perform several conversions. > {code} > to_date( > unix_timestamp(col("date"), dateFormat) > .cast("timestamp")) >.alias("date") > {code} > I propose simplifying this by adding a to_date function (exists) but adding > one that accepts a format for that date. I also propose a to_timestamp > function that also supports a format. > so that you can avoid entirely the above conversion. > It's also worth mentioning that many other databases support this. For > instance, mysql has the STR_TO_DATE function, netezza supports the > to_timestamp semantic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18424) Single Funct
[ https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-18424: -- Summary: Single Funct (was: Improve Date Parsing Semantics & Functionality) > Single Funct > > > Key: SPARK-18424 > URL: https://issues.apache.org/jira/browse/SPARK-18424 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Assignee: Bill Chambers >Priority: Minor > > I've found it quite cumbersome to work with dates thus far in Spark, it can > be hard to reason about the timeformat and what type you're working with, for > instance: > say that I have a date in the format > {code} > 2017-20-12 > // Y-D-M > {code} > In order to parse that into a Date, I have to perform several conversions. > {code} > to_date( > unix_timestamp(col("date"), dateFormat) > .cast("timestamp")) >.alias("date") > {code} > I propose simplifying this by adding a to_date function (exists) but adding > one that accepts a format for that date. I also propose a to_timestamp > function that also supports a format. > so that you can avoid entirely the above conversion. > It's also worth mentioning that many other databases support this. For > instance, mysql has the STR_TO_DATE function, netezza supports the > to_timestamp semantic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671624#comment-15671624 ] peay commented on SPARK-18473: -- Ah, great, thanks. I had checked out the CHANGELOG but couldn't find anything relevant, any reference to the corresponding issue? > Correctness issue in INNER join result with window functions > > > Key: SPARK-18473 > URL: https://issues.apache.org/jira/browse/SPARK-18473 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.1 >Reporter: peay > > I have stumbled onto a corner case where an INNER join appears to return > incorrect results. I believe the join should behave as the identity, but > instead, some values are shuffled around, and some are just plain wrong. > This can be reproduced as follows: joining > {code} > +-+-+--+++--+--+ > |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| > +-+-+--+++--+--+ > |1|1| 1| 0| 1| 0| 1| > |2|2| 0| 0| 1| 0| 1| > |1|3| 1| 0| 2| 0| 2| > +-+-+--+++--+--+ > {code} > with > {code} > +--+ > |sessId| > +--+ > | 1| > | 2| > +--+ > {code} > The result is > {code} > +--+-+-+--+++--+ > |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| > +--+-+-+--+++--+ > | 1|2|2| 0| 0| 1| 0| > | 2|1|1| 1| 0| 1|-1| > | 2|1|3| 1| 0| 2| 0| > +--+-+-+--+++--+ > {code} > Note how two rows have a sessId of 2 (instead of one row as expected), and > how `fiftyCount` can now be negative while always zero in the original > dataframe. > The first dataframe uses two windows: > - `hasOne` uses a `window.rowsBetween(-10, 0)`. > - `hasFifty` uses a `window.rowsBetween(-10, -1)`. > The result is *correct* if: > - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of > `window.rowsBetween(-10, -1)`. > - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to > `fillna` -- although there are no visible effect on the dataframe as shown by > `show` as far as I can tell. > - I use a LEFT OUTER join instead of INNER JOIN. > - I write both dataframes to Parquet, read them back and join these. > This can be reproduced in pyspark using: > {code} > import pyspark.sql.functions as F > from pyspark.sql.functions import col > from pyspark.sql.window import Window > df1 = sql_context.createDataFrame( > pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) > ) > window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") > df2 = ( > df1 > .withColumn("hasOne", (col("index") == 1).cast("int")) > .withColumn("hasFifty", (col("index") == 50).cast("int")) > .withColumn("numOnesBefore", > F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) > .withColumn("numFiftyStrictlyBefore", > F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) > .fillna({ 'numFiftyStrictlyBefore': 0 }) > .withColumn("sessId", col("numOnesBefore") - > col("numFiftyStrictlyBefore")) > ) > df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) > df_joined = df_selector.join(df2, "sessId", how="inner") > df2.show() > df_selector.show() > df_joined.show() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671609#comment-15671609 ] Herman van Hovell commented on SPARK-18473: --- This has been fixed in spark 2.0.2. > Correctness issue in INNER join result with window functions > > > Key: SPARK-18473 > URL: https://issues.apache.org/jira/browse/SPARK-18473 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.1 >Reporter: peay > > I have stumbled onto a corner case where an INNER join appears to return > incorrect results. I believe the join should behave as the identity, but > instead, some values are shuffled around, and some are just plain wrong. > This can be reproduced as follows: joining > {code} > +-+-+--+++--+--+ > |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| > +-+-+--+++--+--+ > |1|1| 1| 0| 1| 0| 1| > |2|2| 0| 0| 1| 0| 1| > |1|3| 1| 0| 2| 0| 2| > +-+-+--+++--+--+ > {code} > with > {code} > +--+ > |sessId| > +--+ > | 1| > | 2| > +--+ > {code} > The result is > {code} > +--+-+-+--+++--+ > |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| > +--+-+-+--+++--+ > | 1|2|2| 0| 0| 1| 0| > | 2|1|1| 1| 0| 1|-1| > | 2|1|3| 1| 0| 2| 0| > +--+-+-+--+++--+ > {code} > Note how two rows have a sessId of 2 (instead of one row as expected), and > how `fiftyCount` can now be negative while always zero in the original > dataframe. > The first dataframe uses two windows: > - `hasOne` uses a `window.rowsBetween(-10, 0)`. > - `hasFifty` uses a `window.rowsBetween(-10, -1)`. > The result is *correct* if: > - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of > `window.rowsBetween(-10, -1)`. > - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to > `fillna` -- although there are no visible effect on the dataframe as shown by > `show` as far as I can tell. > - I use a LEFT OUTER join instead of INNER JOIN. > - I write both dataframes to Parquet, read them back and join these. > This can be reproduced in pyspark using: > {code} > import pyspark.sql.functions as F > from pyspark.sql.functions import col > from pyspark.sql.window import Window > df1 = sql_context.createDataFrame( > pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) > ) > window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") > df2 = ( > df1 > .withColumn("hasOne", (col("index") == 1).cast("int")) > .withColumn("hasFifty", (col("index") == 50).cast("int")) > .withColumn("numOnesBefore", > F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) > .withColumn("numFiftyStrictlyBefore", > F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) > .fillna({ 'numFiftyStrictlyBefore': 0 }) > .withColumn("sessId", col("numOnesBefore") - > col("numFiftyStrictlyBefore")) > ) > df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) > df_joined = df_selector.join(df2, "sessId", how="inner") > df2.show() > df_selector.show() > df_joined.show() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671609#comment-15671609 ] Herman van Hovell edited comment on SPARK-18473 at 11/16/16 8:59 PM: - This has been fixed in spark 2.0.2. See the following notebook: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2728434780191932/3885409687420673/6987336228780374/latest.html was (Author: hvanhovell): This has been fixed in spark 2.0.2. > Correctness issue in INNER join result with window functions > > > Key: SPARK-18473 > URL: https://issues.apache.org/jira/browse/SPARK-18473 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.1 >Reporter: peay > > I have stumbled onto a corner case where an INNER join appears to return > incorrect results. I believe the join should behave as the identity, but > instead, some values are shuffled around, and some are just plain wrong. > This can be reproduced as follows: joining > {code} > +-+-+--+++--+--+ > |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| > +-+-+--+++--+--+ > |1|1| 1| 0| 1| 0| 1| > |2|2| 0| 0| 1| 0| 1| > |1|3| 1| 0| 2| 0| 2| > +-+-+--+++--+--+ > {code} > with > {code} > +--+ > |sessId| > +--+ > | 1| > | 2| > +--+ > {code} > The result is > {code} > +--+-+-+--+++--+ > |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| > +--+-+-+--+++--+ > | 1|2|2| 0| 0| 1| 0| > | 2|1|1| 1| 0| 1|-1| > | 2|1|3| 1| 0| 2| 0| > +--+-+-+--+++--+ > {code} > Note how two rows have a sessId of 2 (instead of one row as expected), and > how `fiftyCount` can now be negative while always zero in the original > dataframe. > The first dataframe uses two windows: > - `hasOne` uses a `window.rowsBetween(-10, 0)`. > - `hasFifty` uses a `window.rowsBetween(-10, -1)`. > The result is *correct* if: > - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of > `window.rowsBetween(-10, -1)`. > - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to > `fillna` -- although there are no visible effect on the dataframe as shown by > `show` as far as I can tell. > - I use a LEFT OUTER join instead of INNER JOIN. > - I write both dataframes to Parquet, read them back and join these. > This can be reproduced in pyspark using: > {code} > import pyspark.sql.functions as F > from pyspark.sql.functions import col > from pyspark.sql.window import Window > df1 = sql_context.createDataFrame( > pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) > ) > window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") > df2 = ( > df1 > .withColumn("hasOne", (col("index") == 1).cast("int")) > .withColumn("hasFifty", (col("index") == 50).cast("int")) > .withColumn("numOnesBefore", > F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) > .withColumn("numFiftyStrictlyBefore", > F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) > .fillna({ 'numFiftyStrictlyBefore': 0 }) > .withColumn("sessId", col("numOnesBefore") - > col("numFiftyStrictlyBefore")) > ) > df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) > df_joined = df_selector.join(df2, "sessId", how="inner") > df2.show() > df_selector.show() > df_joined.show() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18474) Add StreamingQuery.status in python
Tathagata Das created SPARK-18474: - Summary: Add StreamingQuery.status in python Key: SPARK-18474 URL: https://issues.apache.org/jira/browse/SPARK-18474 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.0.2 Reporter: Tathagata Das Assignee: Tathagata Das -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-18473: - Description: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how two rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is *correct* if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to `fillna` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read them back and join these. This can be reproduced in pyspark using: {code} import pyspark.sql.functions as F from pyspark.sql.functions import col from pyspark.sql.window import Window df1 = sql_context.createDataFrame( pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) ) window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") df2 = ( df1 .withColumn("hasOne", (col("index") == 1).cast("int")) .withColumn("hasFifty", (col("index") == 50).cast("int")) .withColumn("numOnesBefore", F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) .withColumn("numFiftyStrictlyBefore", F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) .fillna({ 'numFiftyStrictlyBefore': 0 }) .withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore")) ) df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) df_joined = df_selector.join(df2, "sessId", how="inner") df2.show() df_selector.show() df_joined.show() {code} was: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is *correct* if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to `fillna` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. -
[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-18473: - Description: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is *correct* if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to `fillna` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read them back and join these. This can be reproduced in pyspark using: {code} import pyspark.sql.functions as F from pyspark.sql.functions import col from pyspark.sql.window import Window df1 = sql_context.createDataFrame( pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) ) window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") df2 = ( df1 .withColumn("hasOne", (col("index") == 1).cast("int")) .withColumn("hasFifty", (col("index") == 50).cast("int")) .withColumn("numOnesBefore", F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) .withColumn("numFiftyStrictlyBefore", F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) .fillna({ 'numFiftyStrictlyBefore': 0 }) .withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore")) ) df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) df_joined = df_selector.join(df2, "sessId", how="inner") df2.show() df_selector.show() df_joined.show() {code} was: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add {code}.fillna({ 'numOnesBefore': 0 })` {code} after the other call to `fillna` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. -
[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-18473: - Description: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add {code}.fillna({ 'numOnesBefore': 0 })` {code} after the other call to `fillna` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read them back and join these. This can be reproduced in pyspark using: {code} import pyspark.sql.functions as F from pyspark.sql.functions import col from pyspark.sql.window import Window df1 = sql_context.createDataFrame( pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) ) window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") df2 = ( df1 .withColumn("hasOne", (col("index") == 1).cast("int")) .withColumn("hasFifty", (col("index") == 50).cast("int")) .withColumn("numOnesBefore", F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) .withColumn("numFiftyStrictlyBefore", F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) .fillna({ 'numFiftyStrictlyBefore': 0 }) .withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore")) ) df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) df_joined = df_selector.join(df2, "sessId", how="inner") df2.show() df_selector.show() df_joined.show() {code} was: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add `.fillna({ 'numOnesBefore': 0 })` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER
[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-18473: - Description: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add `.fillna({ 'numOnesBefore': 0 })` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read them back and join these. This can be reproduced in pyspark using: {code} import pyspark.sql.functions as F from pyspark.sql.functions import col from pyspark.sql.window import Window df1 = sql_context.createDataFrame( pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) ) window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") df2 = ( df1 .withColumn("hasOne", (col("index") == 1).cast("int")) .withColumn("hasFifty", (col("index") == 50).cast("int")) .withColumn("numOnesBefore", F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) .withColumn("numFiftyStrictlyBefore", F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) .fillna({ 'numFiftyStrictlyBefore': 0 }) .withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore")) ) df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) df_joined = df_selector.join(df2, "sessId", how="inner") df2.show() df_selector.show() df_joined.show() {code} was: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add .fillna({ 'numOnesBefore': 0 }), although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read
[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-18473: - Description: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add .fillna({ 'numOnesBefore': 0 }), although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read them back and join these. This can be reproduced in pyspark using: {code:python} import pyspark.sql.functions as F from pyspark.sql.functions import col from pyspark.sql.window import Window df1 = sql_context.createDataFrame( pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) ) window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") df2 = ( df1 .withColumn("hasOne", (col("index") == 1).cast("int")) .withColumn("hasFifty", (col("index") == 50).cast("int")) .withColumn("numOnesBefore", F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) .withColumn("numFiftyStrictlyBefore", F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) .fillna({ 'numFiftyStrictlyBefore': 0 }) .withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore")) ) df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) df_joined = df_selector.join(df2, "sessId", how="inner") df2.show() df_selector.show() df_joined.show() {code} was: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add `.fillna({ 'numOnesBefore': 0 })`, although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet,
[jira] [Created] (SPARK-18473) Correctness issue in INNER join result with window functions
peay created SPARK-18473: Summary: Correctness issue in INNER join result with window functions Key: SPARK-18473 URL: https://issues.apache.org/jira/browse/SPARK-18473 Project: Spark Issue Type: Bug Components: PySpark, Spark Core, SQL Affects Versions: 2.0.1 Reporter: peay I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add `.fillna({ 'numOnesBefore': 0 })`, although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read them back and join these. This can be reproduced in pyspark using: {code:python} import pyspark.sql.functions as F from pyspark.sql.functions import col from pyspark.sql.window import Window df1 = sql_context.createDataFrame( pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) ) window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") df2 = ( df1 .withColumn("hasOne", (col("index") == 1).cast("int")) .withColumn("hasFifty", (col("index") == 50).cast("int")) .withColumn("numOnesBefore", F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) .withColumn("numFiftyStrictlyBefore", F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) .fillna({ 'numFiftyStrictlyBefore': 0 }) .withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore")) ) df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) df_joined = df_selector.join(df2, "sessId", how="inner") df2.show() df_selector.show() df_joined.show() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16795) Spark's HiveThriftServer should be able to use multiple sqlContexts
[ https://issues.apache.org/jira/browse/SPARK-16795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell closed SPARK-16795. - Resolution: Duplicate > Spark's HiveThriftServer should be able to use multiple sqlContexts > --- > > Key: SPARK-16795 > URL: https://issues.apache.org/jira/browse/SPARK-16795 > Project: Spark > Issue Type: Wish >Affects Versions: 2.0.0 >Reporter: Furcy Pin > > It seems that when sending multiple Hive queries to the thrift server, the > server cannot parallelize the query plannings, because > it uses only one global sqlContext. > This make the server very inefficient at handling many small concurrent > queries. > It would be nice to have it use a pool of sqlContexts instead, with a > configurable maximum number of contexts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16795) Spark's HiveThriftServer should be able to use multiple sqlContexts
[ https://issues.apache.org/jira/browse/SPARK-16795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671462#comment-15671462 ] Herman van Hovell commented on SPARK-16795: --- Spark uses one Hive client per spark context. This is known to cause performance issue in a highly concurrent environment because multiple sessions are contending for a single Hive client. This is a duplicate of SPARK-14003, and I am closing this one as such. > Spark's HiveThriftServer should be able to use multiple sqlContexts > --- > > Key: SPARK-16795 > URL: https://issues.apache.org/jira/browse/SPARK-16795 > Project: Spark > Issue Type: Wish >Affects Versions: 2.0.0 >Reporter: Furcy Pin > > It seems that when sending multiple Hive queries to the thrift server, the > server cannot parallelize the query plannings, because > it uses only one global sqlContext. > This make the server very inefficient at handling many small concurrent > queries. > It would be nice to have it use a pool of sqlContexts instead, with a > configurable maximum number of contexts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16865) A file-based end-to-end SQL query suite
[ https://issues.apache.org/jira/browse/SPARK-16865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-16865. --- Resolution: Fixed Assignee: Peter Lee Fix Version/s: 2.0.1 2.1.0 > A file-based end-to-end SQL query suite > --- > > Key: SPARK-16865 > URL: https://issues.apache.org/jira/browse/SPARK-16865 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Peter Lee >Assignee: Peter Lee > Fix For: 2.1.0, 2.0.1 > > > Spark currently has a large number of end-to-end SQL test cases in various > SQLQuerySuites. It is fairly difficult to manage and operate these end-to-end > test cases. This ticket proposes that we use files to specify queries and > expected outputs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16951) Alternative implementation of NOT IN to Anti-join
[ https://issues.apache.org/jira/browse/SPARK-16951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-16951: -- Issue Type: Sub-task (was: Improvement) Parent: SPARK-18455 > Alternative implementation of NOT IN to Anti-join > - > > Key: SPARK-16951 > URL: https://issues.apache.org/jira/browse/SPARK-16951 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > > A transformation currently used to process {{NOT IN}} subquery is to rewrite > to a form of Anti-join with null-aware property in the Logical Plan and then > translate to a form of {{OR}} predicate joining the parent side and the > subquery side of the {{NOT IN}}. As a result, the presence of {{OR}} > predicate is limited to the nested-loop join execution plan, which will have > a major performance implication if both sides' results are large. > This JIRA sketches an idea of changing the OR predicate to a form similar to > the technique used in the implementation of the Existence join that addresses > the problem of {{EXISTS (..) OR ..}} type of queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17268) Break Optimizer.scala apart
[ https://issues.apache.org/jira/browse/SPARK-17268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-17268. --- Resolution: Fixed Fix Version/s: 2.1.0 > Break Optimizer.scala apart > --- > > Key: SPARK-17268 > URL: https://issues.apache.org/jira/browse/SPARK-17268 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.1.0 > > > Optimizer.scala has become too large to maintain. We would need to break it > apart into multiple files each of which contains rules that are logically > relevant. > We can create the following files for logical grouping: > - finish analysis > - joins > - expressions > - subquery > - objects -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671415#comment-15671415 ] Sital Kedia commented on SPARK-13510: - [~shenhong] - We are seeing the same issue on our side. Do you have a PR for this yet? > Shuffle may throw FetchFailedException: Direct buffer memory > > > Key: SPARK-13510 > URL: https://issues.apache.org/jira/browse/SPARK-13510 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Hong Shen > Attachments: spark-13510.diff > > > In our cluster, when I test spark-1.6.0 with a sql, it throw exception and > failed. > {code} > 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request > for 1 blocks (915.4 MB) from 10.196.134.220:7337 > 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch > from 10.196.134.220:7337 (executor id 122) > 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 > to /10.196.134.220:7337 > 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in > connection from /10.196.134.220:7337 > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:744) > 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from /10.196.134.220:7337 is closed > 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block > shuffle_3_81_2, and will not retry (0 retries) > {code} > The reason is that when shuffle a big block(like 1G), task will allocate > the same memory, it will easily throw "FetchFailedException: Direct buffer > memory". > If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will > throw > {code} > java.lang.OutOfMemoryError: Java heap space > at > io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) > at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > {code} > > In mapreduce shuffle, it will firstly judge whether the block can cache in > memery, but spark doesn't. > If the block is more than we can cache in memory, we should write to disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17450) spark sql rownumber OOM
[ https://issues.apache.org/jira/browse/SPARK-17450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671411#comment-15671411 ] Herman van Hovell commented on SPARK-17450: --- [~cenyuhai] did you have any luck with merging these? Can we close this issue? > spark sql rownumber OOM > --- > > Key: SPARK-17450 > URL: https://issues.apache.org/jira/browse/SPARK-17450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: cen yuhai > > spark sql will be OOM when using row_number() over too much sorted records... > There will be only 1 task to handle all records > This sql group by passenger_id, we have 100 million passengers. > {code} > SELECT > passenger_id, > total_order, > (CASE WHEN row_number() over (ORDER BY total_order DESC) BETWEEN 0 AND > 670800 THEN 'V3' END) AS order_rank > FROM > ( > SELECT > passenger_id, > 1 as total_order > FROM table > GROUP BY passenger_id > ) dd1 > {code} > {code} > java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:536) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:93) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.fetchNextPartition(Window.scala:278) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:304) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:246) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:74) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:104) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > physical plan: > {code} > == Physical Plan == > Project [passenger_id#7L,total_order#0,CASE WHEN ((_we0#20 >= 0) && (_we1#21 > <= 670800)) THEN V3 AS order_rank#1] > +- Window [passenger_id#7L,total_order#0], > [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber() > windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING) AS > _we0#20,HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber() > windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING) AS _we1#21], [total_order#0 DESC] >+- Sort [total_order#0 DESC], false, 0 > +- TungstenExchange SinglePartition, None > +- Project [passenger_id#7L,total_order#0] > +- TungstenAggregate(key=[passenger_id#7L], functions=[], > output=[passenger_id#7L,total_order#0]) >+- TungstenExchange hashpartitioning(passenger_id#7L,1000), > None > +- TungstenAggregate(key=[passenger_id#7L], functions=[], > output=[passenger_id#7L]) > +- Project [passenger_id#7L] > +- Filter product#9 IN (kuai,gulf) >+- HiveTableScan [passenger_id#7L,product#9], > MetastoreRelation pbs_dw, dwv_order_whole_day, None, > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (SPARK-17662) Dedup UDAF
[ https://issues.apache.org/jira/browse/SPARK-17662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell closed SPARK-17662. - Resolution: Not A Problem > Dedup UDAF > -- > > Key: SPARK-17662 > URL: https://issues.apache.org/jira/browse/SPARK-17662 > Project: Spark > Issue Type: New Feature >Reporter: Ohad Raviv > > We have a common use case od deduping a table in a creation order. > For example, we have an event log of user actions. A user marks his favorite > category from time to time. > In our analytics we would like to know only the user's last favorite category. > The data: > user_idaction_typevaluedate > 123 fav category 1 2016-02-01 > 123 fav category 4 2016-02-02 > 123 fav category 8 2016-02-03 > 123 fav category 2 2016-02-04 > we would like to get only the last update by the date column. > we could of-course do it in sql: > select * from ( > select *, row_number() over (partition by user_id,action_type order by date > desc) as rnum from tbl) > where rnum=1; > but then, I believe it can't be optimized on the mappers side and we'll get > all the data shuffled to the reducers instead of partially aggregated in the > map side. > We have written a UDAF for this, but then we have other issues - like > blocking push-down-predicate for columns. > do you have any idea for a proper solution? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17662) Dedup UDAF
[ https://issues.apache.org/jira/browse/SPARK-17662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671401#comment-15671401 ] Herman van Hovell commented on SPARK-17662: --- This is more of a question for the user list or stack overflow. So I am closing this. BTW: I would use max, for example: {noformat} select user_id, action_type, max(struct(date, *)) last_record from tbl group by 1,2 {noformat} > Dedup UDAF > -- > > Key: SPARK-17662 > URL: https://issues.apache.org/jira/browse/SPARK-17662 > Project: Spark > Issue Type: New Feature >Reporter: Ohad Raviv > > We have a common use case od deduping a table in a creation order. > For example, we have an event log of user actions. A user marks his favorite > category from time to time. > In our analytics we would like to know only the user's last favorite category. > The data: > user_idaction_typevaluedate > 123 fav category 1 2016-02-01 > 123 fav category 4 2016-02-02 > 123 fav category 8 2016-02-03 > 123 fav category 2 2016-02-04 > we would like to get only the last update by the date column. > we could of-course do it in sql: > select * from ( > select *, row_number() over (partition by user_id,action_type order by date > desc) as rnum from tbl) > where rnum=1; > but then, I believe it can't be optimized on the mappers side and we'll get > all the data shuffled to the reducers instead of partially aggregated in the > map side. > We have written a UDAF for this, but then we have other issues - like > blocking push-down-predicate for columns. > do you have any idea for a proper solution? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18172) AnalysisException in first/last during aggregation
[ https://issues.apache.org/jira/browse/SPARK-18172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671361#comment-15671361 ] Herman van Hovell commented on SPARK-18172: --- This is different from SPARK-18300, this is not a case where foldable propagation is an issue. The thing is that we fixed a lot of these issues recently, so I am not sure which one fixed this one. I am closing this one, please re-open if you hit this again. > AnalysisException in first/last during aggregation > -- > > Key: SPARK-18172 > URL: https://issues.apache.org/jira/browse/SPARK-18172 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Emlyn Corrin > Fix For: 2.0.2 > > > Since Spark 2.0.1, the following pyspark snippet fails with > {{AnalysisException: The second argument of First should be a boolean > literal}} (but it's not restricted to Python, similar code with in Java fails > in the same way). > It worked in Spark 2.0.0, so I believe it may be related to the fix for > SPARK-16648. > {code} > from pyspark.sql import functions as F > ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]])) > ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), > F.countDistinct(ds._2, ds._3)).show() > {code} > It works if any of the three arguments to {{.agg}} is removed. > The stack trace is: > {code} > Py4JJavaError Traceback (most recent call last) > in () > > 1 > ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2, > ds._3)).show() > /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py > in show(self, n, truncate) > 285 +---+-+ > 286 """ > --> 287 print(self._jdf.showString(n, truncate)) > 288 > 289 def __repr__(self): > /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py > in __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) >1134 >1135 for temp_arg in temp_args: > /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in > deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > Py4JJavaError: An error occurred while calling o76.showString. > : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, > tree: first(_2#1L)() > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387) > at > org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105) > at >
[jira] [Resolved] (SPARK-18172) AnalysisException in first/last during aggregation
[ https://issues.apache.org/jira/browse/SPARK-18172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-18172. --- Resolution: Fixed Fix Version/s: 2.0.2 Target Version/s: (was: 2.1.0) > AnalysisException in first/last during aggregation > -- > > Key: SPARK-18172 > URL: https://issues.apache.org/jira/browse/SPARK-18172 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Emlyn Corrin > Fix For: 2.0.2 > > > Since Spark 2.0.1, the following pyspark snippet fails with > {{AnalysisException: The second argument of First should be a boolean > literal}} (but it's not restricted to Python, similar code with in Java fails > in the same way). > It worked in Spark 2.0.0, so I believe it may be related to the fix for > SPARK-16648. > {code} > from pyspark.sql import functions as F > ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]])) > ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), > F.countDistinct(ds._2, ds._3)).show() > {code} > It works if any of the three arguments to {{.agg}} is removed. > The stack trace is: > {code} > Py4JJavaError Traceback (most recent call last) > in () > > 1 > ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2, > ds._3)).show() > /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py > in show(self, n, truncate) > 285 +---+-+ > 286 """ > --> 287 print(self._jdf.showString(n, truncate)) > 288 > 289 def __repr__(self): > /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py > in __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) >1134 >1135 for temp_arg in temp_args: > /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in > deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > Py4JJavaError: An error occurred while calling o76.showString. > : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, > tree: first(_2#1L)() > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387) > at > org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at >
[jira] [Updated] (SPARK-17786) [SPARK 2.0] Sorting algorithm gives higher skewness of output
[ https://issues.apache.org/jira/browse/SPARK-17786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-17786: -- Target Version/s: 2.1.0 > [SPARK 2.0] Sorting algorithm gives higher skewness of output > - > > Key: SPARK-17786 > URL: https://issues.apache.org/jira/browse/SPARK-17786 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Maciej Bryński > > Hi, > I'm using df.sort("column") to sort my data before saving it to parquet. > When using Spark 1.6.2 all partitions were similar in size. > On Spark 2.0.0 three of the partitions are much bigger than rest. > Can I go back to previous behaviour of sorting ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks
[ https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-17788: -- Target Version/s: 2.1.0 > RangePartitioner results in few very large tasks and many small to empty > tasks > --- > > Key: SPARK-17788 > URL: https://issues.apache.org/jira/browse/SPARK-17788 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0 > Environment: Ubuntu 14.04 64bit > Java 1.8.0_101 >Reporter: Babak Alipour > > Greetings everyone, > I was trying to read a single field of a Hive table stored as Parquet in > Spark (~140GB for the entire table, this single field is a Double, ~1.4B > records) and look at the sorted output using the following: > sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") > But this simple line of code gives: > Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with > more than 17179869176 bytes > Same error for: > sql("SELECT " + field + " FROM MY_TABLE).sort(field) > and: > sql("SELECT " + field + " FROM MY_TABLE).orderBy(field) > After doing some searching, the issue seems to lie in the RangePartitioner > trying to create equal ranges. [1] > [1] > https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html > > The Double values I'm trying to sort are mostly in the range [0,1] (~70% of > the data which roughly equates 1 billion records), other numbers in the > dataset are as high as 2000. With the RangePartitioner trying to create equal > ranges, some tasks are becoming almost empty while others are extremely > large, due to the heavily skewed distribution. > This is either a bug in Apache Spark or a major limitation of the framework. > I hope one of the devs can help solve this issue. > P.S. Email thread on Spark user mailing list: > http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17932) Failed to run SQL "show table extended like table_name" in Spark2.0.0
[ https://issues.apache.org/jira/browse/SPARK-17932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671331#comment-15671331 ] Herman van Hovell commented on SPARK-17932: --- This is currently not implemented in Spark 2.0. Feel free to open up a PR for this. > Failed to run SQL "show table extended like table_name" in Spark2.0.0 > --- > > Key: SPARK-17932 > URL: https://issues.apache.org/jira/browse/SPARK-17932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: pin_zhang > > SQL "show table extended like table_name " doesn't work in spark 2.0.0 > that works in spark1.5.2 > Error: org.apache.spark.sql.catalyst.parser.ParseException: > missing 'FUNCTIONS' at 'extended'(line 1, pos 11) > == SQL == > show table extended like test > ---^^^ (state=,code=0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17897) not isnotnull is converted to the always false condition isnotnull && not isnotnull
[ https://issues.apache.org/jira/browse/SPARK-17897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-17897: -- Target Version/s: 2.1.0 > not isnotnull is converted to the always false condition isnotnull && not > isnotnull > --- > > Key: SPARK-17897 > URL: https://issues.apache.org/jira/browse/SPARK-17897 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.0.0, 2.0.1 >Reporter: Jordan Halterman > > When a logical plan is built containing the following somewhat nonsensical > filter: > {{Filter (NOT isnotnull($f0#212))}} > During optimization the filter is converted into a condition that will always > fail: > {{Filter (isnotnull($f0#212) && NOT isnotnull($f0#212))}} > This appears to be caused by the following check for {{NullIntolerant}}: > https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R63 > Which recurses through the expression and extracts nested {{IsNotNull}} > calls, converting them to {{IsNotNull}} calls on the attribute at the root > level: > https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R49 > This results in the nonsensical condition above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18460) Include triggerDetails in StreamingQueryStatus.json
[ https://issues.apache.org/jira/browse/SPARK-18460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671328#comment-15671328 ] Apache Spark commented on SPARK-18460: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/15908 > Include triggerDetails in StreamingQueryStatus.json > --- > > Key: SPARK-18460 > URL: https://issues.apache.org/jira/browse/SPARK-18460 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.0.2 >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18459) Rename triggerId to batchId in StreamingQueryStatus.triggerDetails
[ https://issues.apache.org/jira/browse/SPARK-18459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671327#comment-15671327 ] Apache Spark commented on SPARK-18459: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/15908 > Rename triggerId to batchId in StreamingQueryStatus.triggerDetails > -- > > Key: SPARK-18459 > URL: https://issues.apache.org/jira/browse/SPARK-18459 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.1.0 > > > triggerId seems like a number that should be increasing with each trigger, > whether or not there is data in it. However, actually, triggerId increases > only where there is a batch of data in a trigger. So its better to rename it > to batchId. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17977) DataFrameReader and DataStreamReader should have an ancestor class
[ https://issues.apache.org/jira/browse/SPARK-17977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671316#comment-15671316 ] Herman van Hovell edited comment on SPARK-17977 at 11/16/16 7:07 PM: - [~aassudani] do you want to open a PR for this? was (Author: hvanhovell): [~aassudani] want to open a PR for this? > DataFrameReader and DataStreamReader should have an ancestor class > -- > > Key: SPARK-17977 > URL: https://issues.apache.org/jira/browse/SPARK-17977 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Amit Assudani > > There should be an ancestor class of DataFrameReader and DataStreamReader to > configure common options / format and use common methods. Most of the methods > are exact same having exact same arguments. This will help create utilities / > generic code being used for stream / batch applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17977) DataFrameReader and DataStreamReader should have an ancestor class
[ https://issues.apache.org/jira/browse/SPARK-17977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671316#comment-15671316 ] Herman van Hovell commented on SPARK-17977: --- [~aassudani] want to open a PR for this? > DataFrameReader and DataStreamReader should have an ancestor class > -- > > Key: SPARK-17977 > URL: https://issues.apache.org/jira/browse/SPARK-17977 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Amit Assudani > > There should be an ancestor class of DataFrameReader and DataStreamReader to > configure common options / format and use common methods. Most of the methods > are exact same having exact same arguments. This will help create utilities / > generic code being used for stream / batch applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)
[ https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671314#comment-15671314 ] Herman van Hovell commented on SPARK-18458: --- Nice find! > core dumped running Spark SQL on large data volume (100TB) > -- > > Key: SPARK-18458 > URL: https://issues.apache.org/jira/browse/SPARK-18458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: JESSE CHEN >Priority: Critical > Labels: core, dump > > Running a query on 100TB parquet database using the Spark master dated 11/04 > dump cores on Spark executors. > The query is TPCDS query 82 (though this query is not the only one can > produce this core dump, just the easiest one to re-create the error). > Spark output that showed the exception: > {noformat} > 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: > Container marked as failed: container_e68_1478924651089_0018_01_74 on > host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from > container-launch. > Container id: container_e68_1478924651089_0018_01_74 > Exit code: 134 > Exception message: /bin/bash: line 1: 4031216 Aborted (core > dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java > -server -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar > > > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout > 2> > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr > Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 > Aborted (core dumped) > /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server > -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar > > > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout > 2> > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr > at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) > at
[jira] [Resolved] (SPARK-18461) Improve docs on StreamingQueryListener and StreamingQuery.status
[ https://issues.apache.org/jira/browse/SPARK-18461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-18461. -- Resolution: Fixed Issue resolved by pull request 15897 [https://github.com/apache/spark/pull/15897] > Improve docs on StreamingQueryListener and StreamingQuery.status > > > Key: SPARK-18461 > URL: https://issues.apache.org/jira/browse/SPARK-18461 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Structured Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17977) DataFrameReader and DataStreamReader should have an ancestor class
[ https://issues.apache.org/jira/browse/SPARK-17977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671287#comment-15671287 ] Michael Armbrust commented on SPARK-17977: -- No, they were actually the same class for a while. I would be fine with creating a trait that has the common methods. > DataFrameReader and DataStreamReader should have an ancestor class > -- > > Key: SPARK-17977 > URL: https://issues.apache.org/jira/browse/SPARK-17977 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Amit Assudani > > There should be an ancestor class of DataFrameReader and DataStreamReader to > configure common options / format and use common methods. Most of the methods > are exact same having exact same arguments. This will help create utilities / > generic code being used for stream / batch applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)
[ https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18458: Assignee: (was: Apache Spark) > core dumped running Spark SQL on large data volume (100TB) > -- > > Key: SPARK-18458 > URL: https://issues.apache.org/jira/browse/SPARK-18458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: JESSE CHEN >Priority: Critical > Labels: core, dump > > Running a query on 100TB parquet database using the Spark master dated 11/04 > dump cores on Spark executors. > The query is TPCDS query 82 (though this query is not the only one can > produce this core dump, just the easiest one to re-create the error). > Spark output that showed the exception: > {noformat} > 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: > Container marked as failed: container_e68_1478924651089_0018_01_74 on > host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from > container-launch. > Container id: container_e68_1478924651089_0018_01_74 > Exit code: 134 > Exception message: /bin/bash: line 1: 4031216 Aborted (core > dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java > -server -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar > > > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout > 2> > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr > Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 > Aborted (core dumped) > /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server > -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar > > > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout > 2> > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr > at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) > at
[jira] [Assigned] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)
[ https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18458: Assignee: Apache Spark > core dumped running Spark SQL on large data volume (100TB) > -- > > Key: SPARK-18458 > URL: https://issues.apache.org/jira/browse/SPARK-18458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: JESSE CHEN >Assignee: Apache Spark >Priority: Critical > Labels: core, dump > > Running a query on 100TB parquet database using the Spark master dated 11/04 > dump cores on Spark executors. > The query is TPCDS query 82 (though this query is not the only one can > produce this core dump, just the easiest one to re-create the error). > Spark output that showed the exception: > {noformat} > 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: > Container marked as failed: container_e68_1478924651089_0018_01_74 on > host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from > container-launch. > Container id: container_e68_1478924651089_0018_01_74 > Exit code: 134 > Exception message: /bin/bash: line 1: 4031216 Aborted (core > dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java > -server -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar > > > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout > 2> > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr > Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 > Aborted (core dumped) > /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server > -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar > > > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout > 2> > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr > at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) > at
[jira] [Commented] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)
[ https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671237#comment-15671237 ] Kazuaki Ishizaki commented on SPARK-18458: -- I worked with [~jfc...@us.ibm.com]. Then, I identified that a root cause is an integer overflow in a Java expression. When SIGSEGV occurred, we identified {{dest}} is less than 0 at [RadixSort.java|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java#L253-L254]. It should be greater than 0. It means that {{offsets\[bucket\]}} is less than 0. Next, when we checked [transformCountsToOffsets()|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java#L148-L168], we knew that {{outputOffset}} is less than 0. Then, when we checked [RadixSort.java|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java#L244-L245], we knew that {{outindex = 0x10\?\?\?\?\?\?}}. Since the result of multiplication is {{0x80\?\?\?\?\?\?}}, it means that less than 0 as int. This is because {{outIndex}} and {{8}} are integer type. Finally, we understood integer overflow occurs. Here, the expression must be {{array.getBaseOffset() + outIndex * 8L}}. > core dumped running Spark SQL on large data volume (100TB) > -- > > Key: SPARK-18458 > URL: https://issues.apache.org/jira/browse/SPARK-18458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: JESSE CHEN >Priority: Critical > Labels: core, dump > > Running a query on 100TB parquet database using the Spark master dated 11/04 > dump cores on Spark executors. > The query is TPCDS query 82 (though this query is not the only one can > produce this core dump, just the easiest one to re-create the error). > Spark output that showed the exception: > {noformat} > 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: > Container marked as failed: container_e68_1478924651089_0018_01_74 on > host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from > container-launch. > Container id: container_e68_1478924651089_0018_01_74 > Exit code: 134 > Exception message: /bin/bash: line 1: 4031216 Aborted (core > dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java > -server -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar > > > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout > 2> > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr > Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 > Aborted (core dumped) > /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server > -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname
[jira] [Issue Comment Deleted] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)
[ https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-18458: - Comment: was deleted (was: I worked with [~jfc...@us.ibm.com]. Then, I identified that a root cause is an integer overflow in a Java expression. When SIGSEGV occurred, we identified {{dest}} is less than 0 at [RadixSort.java|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java#L253-L254]. It should be greater than 0. It means that {{offsets\[bucket\]}} is less than 0. When we checked [transformCountsToOffsets()|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java#L148-L168], we knew that {{outputOffset}} is less than 0. When we checked [RadixSort.java|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java#L244-L245], we knew that {{outindex = 0x10__}}. Since {{outIndex}} and {{8}} are integer, the result of multiplication is {{0x80??}}. It means that less than 0 as int. Here, the expression must be {{array.getBaseOffset() + outIndex * 8L}}.) > core dumped running Spark SQL on large data volume (100TB) > -- > > Key: SPARK-18458 > URL: https://issues.apache.org/jira/browse/SPARK-18458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: JESSE CHEN >Priority: Critical > Labels: core, dump > > Running a query on 100TB parquet database using the Spark master dated 11/04 > dump cores on Spark executors. > The query is TPCDS query 82 (though this query is not the only one can > produce this core dump, just the easiest one to re-create the error). > Spark output that showed the exception: > {noformat} > 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: > Container marked as failed: container_e68_1478924651089_0018_01_74 on > host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from > container-launch. > Container id: container_e68_1478924651089_0018_01_74 > Exit code: 134 > Exception message: /bin/bash: line 1: 4031216 Aborted (core > dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java > -server -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar > > > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout > 2> > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr > Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 > Aborted (core dumped) > /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server > -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path >
[jira] [Commented] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)
[ https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671229#comment-15671229 ] Kazuaki Ishizaki commented on SPARK-18458: -- I worked with [~jfc...@us.ibm.com]. Then, I identified that a root cause is an integer overflow in a Java expression. When SIGSEGV occurred, we identified {{dest}} is less than 0 at [RadixSort.java|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java#L253-L254]. It should be greater than 0. It means that {{offsets\[bucket\]}} is less than 0. When we checked [transformCountsToOffsets()|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java#L148-L168], we knew that {{outputOffset}} is less than 0. When we checked [RadixSort.java|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java#L244-L245], we knew that {{outindex = 0x10__}}. Since {{outIndex}} and {{8}} are integer, the result of multiplication is {{0x80??}}. It means that less than 0 as int. Here, the expression must be {{array.getBaseOffset() + outIndex * 8L}}. > core dumped running Spark SQL on large data volume (100TB) > -- > > Key: SPARK-18458 > URL: https://issues.apache.org/jira/browse/SPARK-18458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: JESSE CHEN >Priority: Critical > Labels: core, dump > > Running a query on 100TB parquet database using the Spark master dated 11/04 > dump cores on Spark executors. > The query is TPCDS query 82 (though this query is not the only one can > produce this core dump, just the easiest one to re-create the error). > Spark output that showed the exception: > {noformat} > 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: > Container marked as failed: container_e68_1478924651089_0018_01_74 on > host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from > container-launch. > Container id: container_e68_1478924651089_0018_01_74 > Exit code: 134 > Exception message: /bin/bash: line 1: 4031216 Aborted (core > dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java > -server -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar > > > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout > 2> > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr > Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 > Aborted (core dumped) > /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server > -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path >
[jira] [Commented] (SPARK-18172) AnalysisException in first/last during aggregation
[ https://issues.apache.org/jira/browse/SPARK-18172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671227#comment-15671227 ] Emlyn Corrin commented on SPARK-18172: -- It occurs on 2.0.1 and 2.0.2 (on Mac, installed via Homebrew), but I'd need more time to compile a more recent version and test that. Since it was literally just the snippet above that triggered it, I think it's probably fixed if you can't reproduce it. Maybe it is the same underlying problem as SPARK-18300, and the fix for that fixed this too? > AnalysisException in first/last during aggregation > -- > > Key: SPARK-18172 > URL: https://issues.apache.org/jira/browse/SPARK-18172 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Emlyn Corrin > > Since Spark 2.0.1, the following pyspark snippet fails with > {{AnalysisException: The second argument of First should be a boolean > literal}} (but it's not restricted to Python, similar code with in Java fails > in the same way). > It worked in Spark 2.0.0, so I believe it may be related to the fix for > SPARK-16648. > {code} > from pyspark.sql import functions as F > ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]])) > ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), > F.countDistinct(ds._2, ds._3)).show() > {code} > It works if any of the three arguments to {{.agg}} is removed. > The stack trace is: > {code} > Py4JJavaError Traceback (most recent call last) > in () > > 1 > ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2, > ds._3)).show() > /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py > in show(self, n, truncate) > 285 +---+-+ > 286 """ > --> 287 print(self._jdf.showString(n, truncate)) > 288 > 289 def __repr__(self): > /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py > in __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) >1134 >1135 for temp_arg in temp_args: > /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in > deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > Py4JJavaError: An error occurred while calling o76.showString. > : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, > tree: first(_2#1L)() > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387) > at > org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105) > at >