[jira] [Resolved] (SPARK-22018) Catalyst Optimizer does not preserve top-level metadata while collapsing projects
[ https://issues.apache.org/jira/browse/SPARK-22018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-22018. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 19240 [https://github.com/apache/spark/pull/19240] > Catalyst Optimizer does not preserve top-level metadata while collapsing > projects > - > > Key: SPARK-22018 > URL: https://issues.apache.org/jira/browse/SPARK-22018 > Project: Spark > Issue Type: Bug > Components: Optimizer, Structured Streaming >Affects Versions: 2.1.1, 2.2.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 3.0.0 > > > If there are two projects like as follows. > {code} > Project [a_with_metadata#27 AS b#26] > +- Project [a#0 AS a_with_metadata#27] >+- LocalRelation , [a#0, b#1] > {code} > Child Project has an output column with a metadata in it, and the parent > Project has an alias that implicitly forwards the metadata. So this metadata > is visible for higher operators. Upon applying CollapseProject optimizer > rule, the metadata is not preserved. > {code} > Project [a#0 AS b#26] > +- LocalRelation , [a#0, b#1] > {code} > This is incorrect, as downstream operators that expect certain metadata (e.g. > watermark in structured streaming) to identify certain fields will fail to do > so. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22019) JavaBean int type property
taiho choi created SPARK-22019: -- Summary: JavaBean int type property Key: SPARK-22019 URL: https://issues.apache.org/jira/browse/SPARK-22019 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: taiho choi when the type of SampleData's id is int, following code generates errors. when long, it's ok. {code:java} @Test public void testDataSet2() { ArrayList arr= new ArrayList(); arr.add("{\"str\": \"everyone\", \"id\": 1}"); arr.add("{\"str\": \"Hello\", \"id\": 1}"); //1.read array and change to string dataset. JavaRDD data = sc.parallelize(arr); Dataset stringdataset = sqc.createDataset(data.rdd(), Encoders.STRING()); stringdataset.show(); //PASS //2. convert string dataset to sampledata dataset Dataset df = sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class)); df.show();//PASS df.printSchema();//PASS Dataset fad = df.flatMap(SampleDataFlat::flatMap, Encoders.bean(SampleDataFlat.class)); fad.show(); //ERROR fad.printSchema(); } public static class SampleData implements Serializable { public String getStr() { return str; } public void setStr(String str) { this.str = str; } public int getId() { return id; } public void setId(int id) { this.id = id; } String str; int id; } public static class SampleDataFlat { String str; public String getStr() { return str; } public void setStr(String str) { this.str = str; } public SampleDataFlat(String str, long id) { this.str = str; } public static Iterator flatMap(SampleData data) { ArrayList arr = new ArrayList<>(); arr.add(new SampleDataFlat(data.getStr(), data.getId())); arr.add(new SampleDataFlat(data.getStr(), data.getId()+1)); arr.add(new SampleDataFlat(data.getStr(), data.getId()+2)); return arr.iterator(); } } {code} ==Error message== Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 38, Column 16: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 38, Column 16: No applicable constructor/method found for actual parameters "long"; candidates are: "public void SparkUnitTest$SampleData.setId(int)" /* 024 */ public java.lang.Object apply(java.lang.Object _i) { /* 025 */ InternalRow i = (InternalRow) _i; /* 026 */ /* 027 */ final SparkUnitTest$SampleData value1 = false ? null : new SparkUnitTest$SampleData(); /* 028 */ this.javaBean = value1; /* 029 */ if (!false) { /* 030 */ /* 031 */ /* 032 */ boolean isNull3 = i.isNullAt(0); /* 033 */ long value3 = isNull3 ? -1L : (i.getLong(0)); /* 034 */ /* 035 */ if (isNull3) { /* 036 */ throw new NullPointerException(((java.lang.String) references[0])); /* 037 */ } /* 038 */ javaBean.setId(value3); -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167280#comment-16167280 ] Jia-Xuan Liu edited comment on SPARK-21994 at 9/15/17 3:40 AM: --- I also can't reproduce this in Spark 2.2 release. I think maybe not a problem of Spark. {code:java} Spark context available as 'sc' (master = local[*], app id = local-1505446512312). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information. scala> val df = spark.sql("show databases") df: org.apache.spark.sql.DataFrame = [databaseName: string] scala> df.show() ++ |databaseName| ++ | default| |test| ++ scala> df.write.format("parquet").saveAsTable("test.spark22_test_2") scala> spark.sql("select * from test.spark22_test_2").show() ++ |databaseName| ++ | default| |test| ++ {code} was (Author: goldmedal): I also can't reproduce this in Spark 2.2 release. {code:java} Spark context available as 'sc' (master = local[*], app id = local-1505446512312). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information. scala> val df = spark.sql("show databases") df: org.apache.spark.sql.DataFrame = [databaseName: string] scala> df.show() ++ |databaseName| ++ | default| |test| ++ scala> df.write.format("parquet").saveAsTable("test.spark22_test_2") scala> spark.sql("select * from test.spark22_test_2").show() ++ |databaseName| ++ | default| |test| ++ {code} > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path, it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167280#comment-16167280 ] Jia-Xuan Liu commented on SPARK-21994: -- I also can't reproduce this in Spark 2.2 release. {code:java} Spark context available as 'sc' (master = local[*], app id = local-1505446512312). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information. scala> val df = spark.sql("show databases") df: org.apache.spark.sql.DataFrame = [databaseName: string] scala> df.show() ++ |databaseName| ++ | default| |test| ++ scala> df.write.format("parquet").saveAsTable("test.spark22_test_2") scala> spark.sql("select * from test.spark22_test_2").show() ++ |databaseName| ++ | default| |test| ++ {code} > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path, it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167276#comment-16167276 ] Xiayun Sun edited comment on SPARK-21994 at 9/15/17 3:31 AM: - Looks like it's not a problem in latest master build (commit a28728a, version 2.3.0-SNAPSHOT) {code:java} Spark context available as 'sc' (master = local[*], app id = local-1505446109004). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("create database test") res0: org.apache.spark.sql.DataFrame = [] scala> val df = spark.sql("show databases") df: org.apache.spark.sql.DataFrame = [databaseName: string] scala> df.show() ++ |databaseName| ++ | default| |test| ++ scala> df.write.format("parquet").saveAsTable("test.spark22_test") scala> spark.sql("select * from test.spark22_test").show() ++ |databaseName| ++ | default| |test| ++ {code} was (Author: xiayunsun): I'm unable to reproduce this for latest master build (commit a28728a, version 2.3.0-SNAPSHOT) {code:java} Spark context available as 'sc' (master = local[*], app id = local-1505446109004). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("create database test") res0: org.apache.spark.sql.DataFrame = [] scala> val df = spark.sql("show databases") df: org.apache.spark.sql.DataFrame = [databaseName: string] scala> df.show() ++ |databaseName| ++ | default| |test| ++ scala> df.write.format("parquet").saveAsTable("test.spark22_test") scala> spark.sql("select * from test.spark22_test").show() ++ |databaseName| ++ | default| |test| ++ {code} > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path, it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version
[jira] [Comment Edited] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167276#comment-16167276 ] Xiayun Sun edited comment on SPARK-21994 at 9/15/17 3:30 AM: - I'm unable to reproduce this for latest master build (commit a28728a, version 2.3.0-SNAPSHOT) {code:java} Spark context available as 'sc' (master = local[*], app id = local-1505446109004). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("create database test") res0: org.apache.spark.sql.DataFrame = [] scala> val df = spark.sql("show databases") df: org.apache.spark.sql.DataFrame = [databaseName: string] scala> df.show() ++ |databaseName| ++ | default| |test| ++ scala> df.write.format("parquet").saveAsTable("test.spark22_test") scala> spark.sql("select * from test.spark22_test").show() ++ |databaseName| ++ | default| |test| ++ {code} was (Author: xiayunsun): I'm unable to reproduce this for latest master build (commit a28728a, version 2.3.0-SNAPSHOT) {code:java} scala> spark.sql("create database test") res0: org.apache.spark.sql.DataFrame = [] scala> val df = spark.sql("show databases") df: org.apache.spark.sql.DataFrame = [databaseName: string] scala> df.show() ++ |databaseName| ++ | default| |test| ++ scala> df.write.format("parquet").saveAsTable("test.spark22_test") scala> spark.sql("select * from test.spark22_test").show() ++ |databaseName| ++ | default| |test| ++ {code} > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path, it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167276#comment-16167276 ] Xiayun Sun edited comment on SPARK-21994 at 9/15/17 3:29 AM: - I'm unable to reproduce this for latest master build (commit a28728a, version 2.3.0-SNAPSHOT) {code:java} scala> spark.sql("create database test") res0: org.apache.spark.sql.DataFrame = [] scala> val df = spark.sql("show databases") df: org.apache.spark.sql.DataFrame = [databaseName: string] scala> df.show() ++ |databaseName| ++ | default| |test| ++ scala> df.write.format("parquet").saveAsTable("test.spark22_test") scala> spark.sql("select * from test.spark22_test").show() ++ |databaseName| ++ | default| |test| ++ {code} was (Author: xiayunsun): I'm unable to reproduce this for latest master build (commit a28728a, version 2.3.0-SNAPSHOT) {{ scala> spark.sql("create database test") res0: org.apache.spark.sql.DataFrame = [] scala> val df = spark.sql("show databases") df: org.apache.spark.sql.DataFrame = [databaseName: string] scala> df.show() ++ |databaseName| ++ | default| |test| ++ scala> df.write.format("parquet").saveAsTable("test.spark22_test") scala> spark.sql("select * from test.spark22_test").show() ++ |databaseName| ++ | default| |test| ++ }} > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path, it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167276#comment-16167276 ] Xiayun Sun commented on SPARK-21994: I'm unable to reproduce this for latest master build (commit a28728a, version 2.3.0-SNAPSHOT) {{ scala> spark.sql("create database test") res0: org.apache.spark.sql.DataFrame = [] scala> val df = spark.sql("show databases") df: org.apache.spark.sql.DataFrame = [databaseName: string] scala> df.show() ++ |databaseName| ++ | default| |test| ++ scala> df.write.format("parquet").saveAsTable("test.spark22_test") scala> spark.sql("select * from test.spark22_test").show() ++ |databaseName| ++ | default| |test| ++ }} > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path, it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19459) ORC tables cannot be read when they contain char/varchar columns
[ https://issues.apache.org/jira/browse/SPARK-19459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167258#comment-16167258 ] Apache Spark commented on SPARK-19459: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/19235 > ORC tables cannot be read when they contain char/varchar columns > > > Key: SPARK-19459 > URL: https://issues.apache.org/jira/browse/SPARK-19459 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > Fix For: 2.1.1, 2.2.0 > > > Reading from an ORC table which contains char/varchar columns can fail if the > table has been created using Spark. This is caused by the fact that spark > internally replaces char and varchar columns with a string column, this > causes the ORC reader to use the wrong reader, and that eventually causes a > ClassCastException. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14387) Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc
[ https://issues.apache.org/jira/browse/SPARK-14387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167257#comment-16167257 ] Apache Spark commented on SPARK-14387: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/19235 > Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc > - > > Key: SPARK-14387 > URL: https://issues.apache.org/jira/browse/SPARK-14387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Rajesh Balamohan > > In master branch, I tried to run TPC-DS queries (e.g Query27) at 200 GB > scale. Initially I got the following exception (as FileScanRDD has been made > the default in master branch) > {noformat} > 16/04/04 06:49:55 WARN TaskSetManager: Lost task 0.0 in stage 15.0. > java.lang.IllegalArgumentException: Field "s_store_sk" does not exist. > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236) > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:59) > at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:94) > at > org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410) > at > org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:157) > at > org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:146) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:69) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:60) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegen$$anonfun$6$$anon$1.hasNext(WholeStageCodegen.scala:361) > {noformat} > When running with "spark.sql.sources.fileScan=false", following exception is > thrown > {noformat} > 16/04/04 09:02:00 ERROR SparkExecuteStatementOperation: Error executing > query, currentState RUNNING, > java.lang.IllegalArgumentException: Field "cd_demo_sk" does not exist. > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236) > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:59) > at > org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:94) > at > org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410) > at > org.apache.spark.sql.hive.orc.OrcTableScan.execute(OrcRelation.scala:317) > at > org.apache.spark.sql.hive.orc.DefaultSource.buildInternalScan(OrcRelation.scala:124) >
[jira] [Commented] (SPARK-22000) org.codehaus.commons.compiler.CompileException: toString method is not declared
[ https://issues.apache.org/jira/browse/SPARK-22000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167248#comment-16167248 ] taiho choi commented on SPARK-22000: @members, my code generates this error message but i cannot make sample code to reproduce this issue. > org.codehaus.commons.compiler.CompileException: toString method is not > declared > --- > > Key: SPARK-22000 > URL: https://issues.apache.org/jira/browse/SPARK-22000 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: taiho choi > > the error message say that toString is not declared on "value13" which is > "long" type in generated code. > i think value13 should be Long type. > ==error message > Caused by: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 70, Column 32: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 70, Column 32: A method named "toString" is not declared in any enclosing > class nor any supertype, nor through a static import > /* 033 */ private void apply1_2(InternalRow i) { > /* 034 */ > /* 035 */ > /* 036 */ boolean isNull11 = i.isNullAt(1); > /* 037 */ UTF8String value11 = isNull11 ? null : (i.getUTF8String(1)); > /* 038 */ boolean isNull10 = true; > /* 039 */ java.lang.String value10 = null; > /* 040 */ if (!isNull11) { > /* 041 */ > /* 042 */ isNull10 = false; > /* 043 */ if (!isNull10) { > /* 044 */ > /* 045 */ Object funcResult4 = null; > /* 046 */ funcResult4 = value11.toString(); > /* 047 */ > /* 048 */ if (funcResult4 != null) { > /* 049 */ value10 = (java.lang.String) funcResult4; > /* 050 */ } else { > /* 051 */ isNull10 = true; > /* 052 */ } > /* 053 */ > /* 054 */ > /* 055 */ } > /* 056 */ } > /* 057 */ javaBean.setApp(value10); > /* 058 */ > /* 059 */ > /* 060 */ boolean isNull13 = i.isNullAt(12); > /* 061 */ long value13 = isNull13 ? -1L : (i.getLong(12)); > /* 062 */ boolean isNull12 = true; > /* 063 */ java.lang.String value12 = null; > /* 064 */ if (!isNull13) { > /* 065 */ > /* 066 */ isNull12 = false; > /* 067 */ if (!isNull12) { > /* 068 */ > /* 069 */ Object funcResult5 = null; > /* 070 */ funcResult5 = value13.toString(); > /* 071 */ > /* 072 */ if (funcResult5 != null) { > /* 073 */ value12 = (java.lang.String) funcResult5; > /* 074 */ } else { > /* 075 */ isNull12 = true; > /* 076 */ } > /* 077 */ > /* 078 */ > /* 079 */ } > /* 080 */ } > /* 081 */ javaBean.setReasonCode(value12); > /* 082 */ > /* 083 */ } -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21997) Spark shows different results on char/varchar columns on Parquet
[ https://issues.apache.org/jira/browse/SPARK-21997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-21997: -- Description: SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows different results according to the SQL configuration, *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, the default of `spark.sql.hive.convertMetastoreParquet` is true, so the result is wrong by default. {code} scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet") scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'") scala> sql("SELECT * FROM t_char").show +---+---+ | a| b| +---+---+ | a| b| +---+---+ scala> sql("set spark.sql.hive.convertMetastoreParquet=false") scala> sql("SELECT * FROM t_char").show +--+---+ | a| b| +--+---+ |a | b| +--+---+ {code} was: SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows different results according to the SQL configuration, *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, the default of `spark.sql.hive.convertMetastoreParquet` is true, so the result is wrong by default. For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn on it `true`. {code} scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet") scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'") scala> sql("SELECT * FROM t_char").show +---+---+ | a| b| +---+---+ | a| b| +---+---+ scala> sql("set spark.sql.hive.convertMetastoreParquet=false") scala> sql("SELECT * FROM t_char").show +--+---+ | a| b| +--+---+ |a | b| +--+---+ {code} > Spark shows different results on char/varchar columns on Parquet > > > Key: SPARK-21997 > URL: https://issues.apache.org/jira/browse/SPARK-21997 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.1, 2.2.0 >Reporter: Dongjoon Hyun > > SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows > different results according to the SQL configuration, > *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, > the default of `spark.sql.hive.convertMetastoreParquet` is true, so the > result is wrong by default. > {code} > scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet") > scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'") > scala> sql("SELECT * FROM t_char").show > +---+---+ > | a| b| > +---+---+ > | a| b| > +---+---+ > scala> sql("set spark.sql.hive.convertMetastoreParquet=false") > scala> sql("SELECT * FROM t_char").show > +--+---+ > | a| b| > +--+---+ > |a | b| > +--+---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22018) Catalyst Optimizer does not preserve top-level metadata while collapsing projects
[ https://issues.apache.org/jira/browse/SPARK-22018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22018: Assignee: Apache Spark (was: Tathagata Das) > Catalyst Optimizer does not preserve top-level metadata while collapsing > projects > - > > Key: SPARK-22018 > URL: https://issues.apache.org/jira/browse/SPARK-22018 > Project: Spark > Issue Type: Bug > Components: Optimizer, Structured Streaming >Affects Versions: 2.1.1, 2.2.0 >Reporter: Tathagata Das >Assignee: Apache Spark > > If there are two projects like as follows. > {code} > Project [a_with_metadata#27 AS b#26] > +- Project [a#0 AS a_with_metadata#27] >+- LocalRelation , [a#0, b#1] > {code} > Child Project has an output column with a metadata in it, and the parent > Project has an alias that implicitly forwards the metadata. So this metadata > is visible for higher operators. Upon applying CollapseProject optimizer > rule, the metadata is not preserved. > {code} > Project [a#0 AS b#26] > +- LocalRelation , [a#0, b#1] > {code} > This is incorrect, as downstream operators that expect certain metadata (e.g. > watermark in structured streaming) to identify certain fields will fail to do > so. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22018) Catalyst Optimizer does not preserve top-level metadata while collapsing projects
[ https://issues.apache.org/jira/browse/SPARK-22018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22018: Assignee: Tathagata Das (was: Apache Spark) > Catalyst Optimizer does not preserve top-level metadata while collapsing > projects > - > > Key: SPARK-22018 > URL: https://issues.apache.org/jira/browse/SPARK-22018 > Project: Spark > Issue Type: Bug > Components: Optimizer, Structured Streaming >Affects Versions: 2.1.1, 2.2.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > > If there are two projects like as follows. > {code} > Project [a_with_metadata#27 AS b#26] > +- Project [a#0 AS a_with_metadata#27] >+- LocalRelation , [a#0, b#1] > {code} > Child Project has an output column with a metadata in it, and the parent > Project has an alias that implicitly forwards the metadata. So this metadata > is visible for higher operators. Upon applying CollapseProject optimizer > rule, the metadata is not preserved. > {code} > Project [a#0 AS b#26] > +- LocalRelation , [a#0, b#1] > {code} > This is incorrect, as downstream operators that expect certain metadata (e.g. > watermark in structured streaming) to identify certain fields will fail to do > so. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22018) Catalyst Optimizer does not preserve top-level metadata while collapsing projects
[ https://issues.apache.org/jira/browse/SPARK-22018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167166#comment-16167166 ] Apache Spark commented on SPARK-22018: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/19240 > Catalyst Optimizer does not preserve top-level metadata while collapsing > projects > - > > Key: SPARK-22018 > URL: https://issues.apache.org/jira/browse/SPARK-22018 > Project: Spark > Issue Type: Bug > Components: Optimizer, Structured Streaming >Affects Versions: 2.1.1, 2.2.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > > If there are two projects like as follows. > {code} > Project [a_with_metadata#27 AS b#26] > +- Project [a#0 AS a_with_metadata#27] >+- LocalRelation , [a#0, b#1] > {code} > Child Project has an output column with a metadata in it, and the parent > Project has an alias that implicitly forwards the metadata. So this metadata > is visible for higher operators. Upon applying CollapseProject optimizer > rule, the metadata is not preserved. > {code} > Project [a#0 AS b#26] > +- LocalRelation , [a#0, b#1] > {code} > This is incorrect, as downstream operators that expect certain metadata (e.g. > watermark in structured streaming) to identify certain fields will fail to do > so. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22018) Catalyst Optimizer does not preserve top-level metadata while collapsing projects
Tathagata Das created SPARK-22018: - Summary: Catalyst Optimizer does not preserve top-level metadata while collapsing projects Key: SPARK-22018 URL: https://issues.apache.org/jira/browse/SPARK-22018 Project: Spark Issue Type: Bug Components: Optimizer, Structured Streaming Affects Versions: 2.2.0, 2.1.1 Reporter: Tathagata Das Assignee: Tathagata Das If there are two projects like as follows. {code} Project [a_with_metadata#27 AS b#26] +- Project [a#0 AS a_with_metadata#27] +- LocalRelation , [a#0, b#1] {code} Child Project has an output column with a metadata in it, and the parent Project has an alias that implicitly forwards the metadata. So this metadata is visible for higher operators. Upon applying CollapseProject optimizer rule, the metadata is not preserved. {code} Project [a#0 AS b#26] +- LocalRelation , [a#0, b#1] {code} This is incorrect, as downstream operators that expect certain metadata (e.g. watermark in structured streaming) to identify certain fields will fail to do so. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21523) Fix bug of strong wolfe linesearch `init` parameter lose effectiveness
[ https://issues.apache.org/jira/browse/SPARK-21523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167033#comment-16167033 ] Ihor Bobak commented on SPARK-21523: For us it is also critical issue: we faced with the same problem. > Fix bug of strong wolfe linesearch `init` parameter lose effectiveness > -- > > Key: SPARK-21523 > URL: https://issues.apache.org/jira/browse/SPARK-21523 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Critical > Fix For: 2.2.1, 2.3.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > We need merge this breeze bugfix into spark because it influence a series of > algos in MLlib which use LBFGS. > https://github.com/scalanlp/breeze/pull/651 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22017) watermark evaluation with multi-input stream operators is unspecified
[ https://issues.apache.org/jira/browse/SPARK-22017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167020#comment-16167020 ] Apache Spark commented on SPARK-22017: -- User 'joseph-torres' has created a pull request for this issue: https://github.com/apache/spark/pull/19239 > watermark evaluation with multi-input stream operators is unspecified > - > > Key: SPARK-22017 > URL: https://issues.apache.org/jira/browse/SPARK-22017 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jose Torres > > Watermarks are stored as a single value in StreamExecution. If a query has > multiple watermark nodes (which can generally only happen with multi input > operators like union), a headOption call will arbitrarily pick one to use as > the real one. This will happen independently in each batch, possibly leading > to strange and undefined behavior. > We should instead choose the minimum from all watermark exec nodes as the > query-wide watermark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22017) watermark evaluation with multi-input stream operators is unspecified
[ https://issues.apache.org/jira/browse/SPARK-22017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22017: Assignee: (was: Apache Spark) > watermark evaluation with multi-input stream operators is unspecified > - > > Key: SPARK-22017 > URL: https://issues.apache.org/jira/browse/SPARK-22017 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jose Torres > > Watermarks are stored as a single value in StreamExecution. If a query has > multiple watermark nodes (which can generally only happen with multi input > operators like union), a headOption call will arbitrarily pick one to use as > the real one. This will happen independently in each batch, possibly leading > to strange and undefined behavior. > We should instead choose the minimum from all watermark exec nodes as the > query-wide watermark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22017) watermark evaluation with multi-input stream operators is unspecified
[ https://issues.apache.org/jira/browse/SPARK-22017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22017: Assignee: Apache Spark > watermark evaluation with multi-input stream operators is unspecified > - > > Key: SPARK-22017 > URL: https://issues.apache.org/jira/browse/SPARK-22017 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jose Torres >Assignee: Apache Spark > > Watermarks are stored as a single value in StreamExecution. If a query has > multiple watermark nodes (which can generally only happen with multi input > operators like union), a headOption call will arbitrarily pick one to use as > the real one. This will happen independently in each batch, possibly leading > to strange and undefined behavior. > We should instead choose the minimum from all watermark exec nodes as the > query-wide watermark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22017) watermark evaluation with multi-input stream operators is unspecified
Jose Torres created SPARK-22017: --- Summary: watermark evaluation with multi-input stream operators is unspecified Key: SPARK-22017 URL: https://issues.apache.org/jira/browse/SPARK-22017 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.2.0 Reporter: Jose Torres Watermarks are stored as a single value in StreamExecution. If a query has multiple watermark nodes (which can generally only happen with multi input operators like union), a headOption call will arbitrarily pick one to use as the real one. This will happen independently in each batch, possibly leading to strange and undefined behavior. We should instead choose the minimum from all watermark exec nodes as the query-wide watermark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22016) Add HiveDialect for JDBC connection to Hive
[ https://issues.apache.org/jira/browse/SPARK-22016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22016: Assignee: Apache Spark > Add HiveDialect for JDBC connection to Hive > --- > > Key: SPARK-22016 > URL: https://issues.apache.org/jira/browse/SPARK-22016 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.3, 2.2.1 >Reporter: Daniel Fernandez >Assignee: Apache Spark > > I found out there is no Dialect for Hive in spark. So, I would like to add > the HiveDialect.scala in the package org.apache.spark.sql.jdbc to support it. > Only two functions should be overriden: > * canHandle > * quoteIdentifier -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22016) Add HiveDialect for JDBC connection to Hive
[ https://issues.apache.org/jira/browse/SPARK-22016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166988#comment-16166988 ] Apache Spark commented on SPARK-22016: -- User 'danielfx90' has created a pull request for this issue: https://github.com/apache/spark/pull/19238 > Add HiveDialect for JDBC connection to Hive > --- > > Key: SPARK-22016 > URL: https://issues.apache.org/jira/browse/SPARK-22016 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.3, 2.2.1 >Reporter: Daniel Fernandez > > I found out there is no Dialect for Hive in spark. So, I would like to add > the HiveDialect.scala in the package org.apache.spark.sql.jdbc to support it. > Only two functions should be overriden: > * canHandle > * quoteIdentifier -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22016) Add HiveDialect for JDBC connection to Hive
[ https://issues.apache.org/jira/browse/SPARK-22016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22016: Assignee: (was: Apache Spark) > Add HiveDialect for JDBC connection to Hive > --- > > Key: SPARK-22016 > URL: https://issues.apache.org/jira/browse/SPARK-22016 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.3, 2.2.1 >Reporter: Daniel Fernandez > > I found out there is no Dialect for Hive in spark. So, I would like to add > the HiveDialect.scala in the package org.apache.spark.sql.jdbc to support it. > Only two functions should be overriden: > * canHandle > * quoteIdentifier -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22016) Add HiveDialect for JDBC connection to Hive
Daniel Fernandez created SPARK-22016: Summary: Add HiveDialect for JDBC connection to Hive Key: SPARK-22016 URL: https://issues.apache.org/jira/browse/SPARK-22016 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.6.3, 2.2.1 Reporter: Daniel Fernandez I found out there is no Dialect for Hive in spark. So, I would like to add the HiveDialect.scala in the package org.apache.spark.sql.jdbc to support it. Only two functions should be overriden: * canHandle * quoteIdentifier -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166945#comment-16166945 ] Michael N commented on SPARK-21999: --- Some related questions that would provide insights into this issue are 1. Based on the stack trace, what triggered Spark to do serialization of the application objects on the master node ? 2. Why does Spark do this type of serialization asynchronously ? 3. What other conditions trigger Spark to do this type of serialization asynchronously ? 4. Can it be configured to do this type of serialization synchronously instead ? 5. Excluding application code, what list objects does Spark use that would be part of this type of serialization ? Thanks. Michael, > ConcurrentModificationException - Spark Streaming > - > > Key: SPARK-21999 > URL: https://issues.apache.org/jira/browse/SPARK-21999 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michael N >Priority: Critical > > Hi, > I am using Spark Streaming v2.1.0 with Kafka 0.8. I am getting > ConcurrentModificationException intermittently. When it occurs, Spark does > not honor the specified value of spark.task.maxFailures. So Spark aborts the > current batch and fetch the next batch, so it results in lost data. Its > exception stack is listed below. > This instance of ConcurrentModificationException is similar to the issue at > https://issues.apache.org/jira/browse/SPARK-17463, which was about > Serialization of accumulators in heartbeats. However, my Spark stream app > does not use accumulators. > The stack trace listed below occurred on the Spark master in Spark streaming > driver at the time of data loss. > From the line of code in the first stack trace, can you tell which object > Spark was trying to serialize ? What is the root cause for this issue ? > Because this issue results in lost data as described above, could you have > this issue fixed ASAP ? > Thanks. > Michael N., > > Stack trace of Spark Streaming driver > ERROR JobScheduler:91: Error generating jobs for time 150522493 ms > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42) > at >
[jira] [Resolved] (SPARK-21988) Add default stats to StreamingExecutionRelation
[ https://issues.apache.org/jira/browse/SPARK-21988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-21988. -- Resolution: Fixed Assignee: Jose Torres Fix Version/s: 2.3.0 > Add default stats to StreamingExecutionRelation > --- > > Key: SPARK-21988 > URL: https://issues.apache.org/jira/browse/SPARK-21988 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jose Torres >Assignee: Jose Torres > Fix For: 2.3.0 > > > StreamingExecutionRelation currently doesn't implement stats. > This makes some sense, but unfortunately the LeafNode contract requires that > nodes which survive analysis implement stats, and StreamingExecutionRelation > can indeed survive analysis when running explain() on a streaming dataframe. > This value won't ever be used during execution, because > StreamingExecutionRelation does *not* survive analysis on the execution path; > it's replaced with each batch. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21987) Spark 2.3 cannot read 2.2 event logs
[ https://issues.apache.org/jira/browse/SPARK-21987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21987: Assignee: (was: Apache Spark) > Spark 2.3 cannot read 2.2 event logs > > > Key: SPARK-21987 > URL: https://issues.apache.org/jira/browse/SPARK-21987 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Blocker > > Reported by [~jincheng] in a comment in SPARK-18085: > {noformat} > com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: > Unrecognized field "metadata" (class > org.apache.spark.sql.execution.SparkPlanInfo), not marked as ignorable (4 > known properties: "simpleString", "nodeName", "children", "metrics"]) > at [Source: > {"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart","executionId":0,"description":"json > at > NativeMethodAccessorImpl.java:0","details":"org.apache.spark.sql.DataFrameWriter.json(DataFrameWriter.scala:487)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native > > Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:498)\npy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\npy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\npy4j.Gateway.invoke(Gateway.java:280)\npy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\npy4j.commands.CallCommand.execute(CallCommand.java:79)\npy4j.GatewayConnection.run(GatewayConnection.java:214)\njava.lang.Thread.run(Thread.java:748)","physicalPlanDescription":"== > Parsed Logical Plan ==\nRepartition 200, true\n+- LogicalRDD [uid#327L, > gids#328]\n\n== Analyzed Logical Plan ==\nuid: bigint, gids: > array\nRepartition 200, true\n+- LogicalRDD [uid#327L, > gids#328]\n\n== Optimized Logical Plan ==\nRepartition 200, true\n+- > LogicalRDD [uid#327L, gids#328]\n\n== Physical Plan ==\nExchange > RoundRobinPartitioning(200)\n+- Scan > ExistingRDD[uid#327L,gids#328]","sparkPlanInfo":{"nodeName":"Exchange","simpleString":"Exchange > > RoundRobinPartitioning(200)","children":[{"nodeName":"ExistingRDD","simpleString":"Scan > > ExistingRDD[uid#327L,gids#328]","children":[],"metadata":{},"metrics":[{"name":"number > of output > rows","accumulatorId":140,"metricType":"sum"}]}],"metadata":{},"metrics":[{"name":"data > size total (min, med, > max)","accumulatorId":139,"metricType":"size"}]},"time":1504837052948}; line: > 1, column: 1622] (through reference chain: > org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart["sparkPlanInfo"]->org.apache.spark.sql.execution.SparkPlanInfo["children"]->com.fasterxml.jackson.module.scala.deser.BuilderWrapper[0]->org.apache.spark.sql.execution.SparkPlanInfo["metadata"]) > at > com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51) > {noformat} > This was caused by SPARK-17701 (which at this moment is still open even > though the patch has been committed). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21987) Spark 2.3 cannot read 2.2 event logs
[ https://issues.apache.org/jira/browse/SPARK-21987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166637#comment-16166637 ] Apache Spark commented on SPARK-21987: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/19237 > Spark 2.3 cannot read 2.2 event logs > > > Key: SPARK-21987 > URL: https://issues.apache.org/jira/browse/SPARK-21987 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Blocker > > Reported by [~jincheng] in a comment in SPARK-18085: > {noformat} > com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: > Unrecognized field "metadata" (class > org.apache.spark.sql.execution.SparkPlanInfo), not marked as ignorable (4 > known properties: "simpleString", "nodeName", "children", "metrics"]) > at [Source: > {"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart","executionId":0,"description":"json > at > NativeMethodAccessorImpl.java:0","details":"org.apache.spark.sql.DataFrameWriter.json(DataFrameWriter.scala:487)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native > > Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:498)\npy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\npy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\npy4j.Gateway.invoke(Gateway.java:280)\npy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\npy4j.commands.CallCommand.execute(CallCommand.java:79)\npy4j.GatewayConnection.run(GatewayConnection.java:214)\njava.lang.Thread.run(Thread.java:748)","physicalPlanDescription":"== > Parsed Logical Plan ==\nRepartition 200, true\n+- LogicalRDD [uid#327L, > gids#328]\n\n== Analyzed Logical Plan ==\nuid: bigint, gids: > array\nRepartition 200, true\n+- LogicalRDD [uid#327L, > gids#328]\n\n== Optimized Logical Plan ==\nRepartition 200, true\n+- > LogicalRDD [uid#327L, gids#328]\n\n== Physical Plan ==\nExchange > RoundRobinPartitioning(200)\n+- Scan > ExistingRDD[uid#327L,gids#328]","sparkPlanInfo":{"nodeName":"Exchange","simpleString":"Exchange > > RoundRobinPartitioning(200)","children":[{"nodeName":"ExistingRDD","simpleString":"Scan > > ExistingRDD[uid#327L,gids#328]","children":[],"metadata":{},"metrics":[{"name":"number > of output > rows","accumulatorId":140,"metricType":"sum"}]}],"metadata":{},"metrics":[{"name":"data > size total (min, med, > max)","accumulatorId":139,"metricType":"size"}]},"time":1504837052948}; line: > 1, column: 1622] (through reference chain: > org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart["sparkPlanInfo"]->org.apache.spark.sql.execution.SparkPlanInfo["children"]->com.fasterxml.jackson.module.scala.deser.BuilderWrapper[0]->org.apache.spark.sql.execution.SparkPlanInfo["metadata"]) > at > com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51) > {noformat} > This was caused by SPARK-17701 (which at this moment is still open even > though the patch has been committed). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21987) Spark 2.3 cannot read 2.2 event logs
[ https://issues.apache.org/jira/browse/SPARK-21987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21987: Assignee: Apache Spark > Spark 2.3 cannot read 2.2 event logs > > > Key: SPARK-21987 > URL: https://issues.apache.org/jira/browse/SPARK-21987 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Blocker > > Reported by [~jincheng] in a comment in SPARK-18085: > {noformat} > com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: > Unrecognized field "metadata" (class > org.apache.spark.sql.execution.SparkPlanInfo), not marked as ignorable (4 > known properties: "simpleString", "nodeName", "children", "metrics"]) > at [Source: > {"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart","executionId":0,"description":"json > at > NativeMethodAccessorImpl.java:0","details":"org.apache.spark.sql.DataFrameWriter.json(DataFrameWriter.scala:487)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native > > Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:498)\npy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\npy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\npy4j.Gateway.invoke(Gateway.java:280)\npy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\npy4j.commands.CallCommand.execute(CallCommand.java:79)\npy4j.GatewayConnection.run(GatewayConnection.java:214)\njava.lang.Thread.run(Thread.java:748)","physicalPlanDescription":"== > Parsed Logical Plan ==\nRepartition 200, true\n+- LogicalRDD [uid#327L, > gids#328]\n\n== Analyzed Logical Plan ==\nuid: bigint, gids: > array\nRepartition 200, true\n+- LogicalRDD [uid#327L, > gids#328]\n\n== Optimized Logical Plan ==\nRepartition 200, true\n+- > LogicalRDD [uid#327L, gids#328]\n\n== Physical Plan ==\nExchange > RoundRobinPartitioning(200)\n+- Scan > ExistingRDD[uid#327L,gids#328]","sparkPlanInfo":{"nodeName":"Exchange","simpleString":"Exchange > > RoundRobinPartitioning(200)","children":[{"nodeName":"ExistingRDD","simpleString":"Scan > > ExistingRDD[uid#327L,gids#328]","children":[],"metadata":{},"metrics":[{"name":"number > of output > rows","accumulatorId":140,"metricType":"sum"}]}],"metadata":{},"metrics":[{"name":"data > size total (min, med, > max)","accumulatorId":139,"metricType":"size"}]},"time":1504837052948}; line: > 1, column: 1622] (through reference chain: > org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart["sparkPlanInfo"]->org.apache.spark.sql.execution.SparkPlanInfo["children"]->com.fasterxml.jackson.module.scala.deser.BuilderWrapper[0]->org.apache.spark.sql.execution.SparkPlanInfo["metadata"]) > at > com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51) > {noformat} > This was caused by SPARK-17701 (which at this moment is still open even > though the patch has been committed). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22015) Remove usage of non-used private field isAuthenticated from org.apache.spark.network.sasl.SaslRpcHandler.java
[ https://issues.apache.org/jira/browse/SPARK-22015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22015: Assignee: Apache Spark > Remove usage of non-used private field isAuthenticated from > org.apache.spark.network.sasl.SaslRpcHandler.java > - > > Key: SPARK-22015 > URL: https://issues.apache.org/jira/browse/SPARK-22015 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Nikhil Bhide >Assignee: Apache Spark >Priority: Trivial > > Remove usage of non-used private field isAuthenticated from > org.apache.spark.network.sasl.SaslRpcHandler.java. It does not adhere to code > convention. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22015) Remove usage of non-used private field isAuthenticated from org.apache.spark.network.sasl.SaslRpcHandler.java
[ https://issues.apache.org/jira/browse/SPARK-22015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166561#comment-16166561 ] Apache Spark commented on SPARK-22015: -- User 'nikhilbhide' has created a pull request for this issue: https://github.com/apache/spark/pull/19236 > Remove usage of non-used private field isAuthenticated from > org.apache.spark.network.sasl.SaslRpcHandler.java > - > > Key: SPARK-22015 > URL: https://issues.apache.org/jira/browse/SPARK-22015 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Nikhil Bhide >Priority: Trivial > > Remove usage of non-used private field isAuthenticated from > org.apache.spark.network.sasl.SaslRpcHandler.java. It does not adhere to code > convention. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22015) Remove usage of non-used private field isAuthenticated from org.apache.spark.network.sasl.SaslRpcHandler.java
[ https://issues.apache.org/jira/browse/SPARK-22015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22015: Assignee: (was: Apache Spark) > Remove usage of non-used private field isAuthenticated from > org.apache.spark.network.sasl.SaslRpcHandler.java > - > > Key: SPARK-22015 > URL: https://issues.apache.org/jira/browse/SPARK-22015 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Nikhil Bhide >Priority: Trivial > > Remove usage of non-used private field isAuthenticated from > org.apache.spark.network.sasl.SaslRpcHandler.java. It does not adhere to code > convention. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22015) Remove usage of non-used private field isAuthenticated from org.apache.spark.network.sasl.SaslRpcHandler.java
Nikhil Bhide created SPARK-22015: Summary: Remove usage of non-used private field isAuthenticated from org.apache.spark.network.sasl.SaslRpcHandler.java Key: SPARK-22015 URL: https://issues.apache.org/jira/browse/SPARK-22015 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.0 Reporter: Nikhil Bhide Priority: Trivial Remove usage of non-used private field isAuthenticated from org.apache.spark.network.sasl.SaslRpcHandler.java. It does not adhere to code convention. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21997) Spark shows different results on char/varchar columns on Parquet
[ https://issues.apache.org/jira/browse/SPARK-21997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-21997: -- Description: SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows different results according to the SQL configuration, *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, the default of `spark.sql.hive.convertMetastoreParquet` is true, so the result is wrong by default. For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn on it `true`. {code} scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet") scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'") scala> sql("SELECT * FROM t_char").show +---+---+ | a| b| +---+---+ | a| b| +---+---+ scala> sql("set spark.sql.hive.convertMetastoreParquet=false") scala> sql("SELECT * FROM t_char").show +--+---+ | a| b| +--+---+ |a | b| +--+---+ {code} was: SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows different results according to the SQL configuration, *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, the default of `spark.sql.hive.convertMetastoreParquet` is true, so the result is wrong by default. For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn on it `true`. {code} scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet") scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'") scala> sql("SELECT * FROM t_char").show +---+---+ | a| b| +---+---+ | a| b| +---+---+ scala> sql("set spark.sql.hive.convertMetastoreParquet=false") scala> sql("SELECT * FROM t_char").show +--+---+ | a| b| +--+---+ |a | b| +--+---+ scala> spark.version res3: String = 2.2.0 {code} > Spark shows different results on char/varchar columns on Parquet > > > Key: SPARK-21997 > URL: https://issues.apache.org/jira/browse/SPARK-21997 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.1, 2.2.0 >Reporter: Dongjoon Hyun > > SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows > different results according to the SQL configuration, > *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, > the default of `spark.sql.hive.convertMetastoreParquet` is true, so the > result is wrong by default. > For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so > SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn > on it `true`. > {code} > scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet") > scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'") > scala> sql("SELECT * FROM t_char").show > +---+---+ > | a| b| > +---+---+ > | a| b| > +---+---+ > scala> sql("set spark.sql.hive.convertMetastoreParquet=false") > scala> sql("SELECT * FROM t_char").show > +--+---+ > | a| b| > +--+---+ > |a | b| > +--+---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21997) Spark shows different results on char/varchar columns on Parquet
[ https://issues.apache.org/jira/browse/SPARK-21997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-21997: -- Summary: Spark shows different results on char/varchar columns on Parquet (was: Spark shows different results on Hive char/varchar columns on Parquet) > Spark shows different results on char/varchar columns on Parquet > > > Key: SPARK-21997 > URL: https://issues.apache.org/jira/browse/SPARK-21997 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.1, 2.2.0 >Reporter: Dongjoon Hyun > > SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows > different results according to the SQL configuration, > *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, > the default of `spark.sql.hive.convertMetastoreParquet` is true, so the > result is wrong by default. > For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so > SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn > on it `true`. > {code} > scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet") > scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'") > scala> sql("SELECT * FROM t_char").show > +---+---+ > | a| b| > +---+---+ > | a| b| > +---+---+ > scala> sql("set spark.sql.hive.convertMetastoreParquet=false") > scala> sql("SELECT * FROM t_char").show > +--+---+ > | a| b| > +--+---+ > |a | b| > +--+---+ > scala> spark.version > res3: String = 2.2.0 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21997) Spark shows different results on Hive char/varchar columns on Parquet
[ https://issues.apache.org/jira/browse/SPARK-21997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-21997: -- Description: SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows different results according to the SQL configuration, *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, the default of `spark.sql.hive.convertMetastoreParquet` is true, so the result is wrong by default. For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn on it `true`. {code} scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet") scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'") scala> sql("SELECT * FROM t_char").show +---+---+ | a| b| +---+---+ | a| b| +---+---+ scala> sql("set spark.sql.hive.convertMetastoreParquet=false") scala> sql("SELECT * FROM t_char").show +--+---+ | a| b| +--+---+ |a | b| +--+---+ scala> spark.version res3: String = 2.2.0 {code} was: SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows different results according to the SQL configuration, *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, the default of `spark.sql.hive.convertMetastoreParquet` is true, so the result is wrong by default. For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn on it `true`. {code} hive> CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet; hive> INSERT INTO TABLE t_char SELECT 'a', 'b' FROM (SELECT 1) t; scala> sql("SELECT * FROM t_char").show +---+---+ | a| b| +---+---+ | a| b| +---+---+ scala> sql("set spark.sql.hive.convertMetastoreParquet=false") scala> sql("SELECT * FROM t_char").show +--+---+ | a| b| +--+---+ |a | b| +--+---+ scala> spark.version res3: String = 2.2.0 {code} > Spark shows different results on Hive char/varchar columns on Parquet > - > > Key: SPARK-21997 > URL: https://issues.apache.org/jira/browse/SPARK-21997 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.1, 2.2.0 >Reporter: Dongjoon Hyun > > SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows > different results according to the SQL configuration, > *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, > the default of `spark.sql.hive.convertMetastoreParquet` is true, so the > result is wrong by default. > For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so > SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn > on it `true`. > {code} > scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet") > scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'") > scala> sql("SELECT * FROM t_char").show > +---+---+ > | a| b| > +---+---+ > | a| b| > +---+---+ > scala> sql("set spark.sql.hive.convertMetastoreParquet=false") > scala> sql("SELECT * FROM t_char").show > +--+---+ > | a| b| > +--+---+ > |a | b| > +--+---+ > scala> spark.version > res3: String = 2.2.0 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21997) Spark shows different results on Hive char/varchar columns on Parquet
[ https://issues.apache.org/jira/browse/SPARK-21997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166539#comment-16166539 ] Apache Spark commented on SPARK-21997: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/19235 > Spark shows different results on Hive char/varchar columns on Parquet > - > > Key: SPARK-21997 > URL: https://issues.apache.org/jira/browse/SPARK-21997 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.1, 2.2.0 >Reporter: Dongjoon Hyun > > SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows > different results according to the SQL configuration, > *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, > the default of `spark.sql.hive.convertMetastoreParquet` is true, so the > result is wrong by default. > For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so > SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn > on it `true`. > {code} > hive> CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet; > hive> INSERT INTO TABLE t_char SELECT 'a', 'b' FROM (SELECT 1) t; > scala> sql("SELECT * FROM t_char").show > +---+---+ > | a| b| > +---+---+ > | a| b| > +---+---+ > scala> sql("set spark.sql.hive.convertMetastoreParquet=false") > scala> sql("SELECT * FROM t_char").show > +--+---+ > | a| b| > +--+---+ > |a | b| > +--+---+ > scala> spark.version > res3: String = 2.2.0 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21997) Spark shows different results on Hive char/varchar columns on Parquet
[ https://issues.apache.org/jira/browse/SPARK-21997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21997: Assignee: (was: Apache Spark) > Spark shows different results on Hive char/varchar columns on Parquet > - > > Key: SPARK-21997 > URL: https://issues.apache.org/jira/browse/SPARK-21997 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.1, 2.2.0 >Reporter: Dongjoon Hyun > > SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows > different results according to the SQL configuration, > *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, > the default of `spark.sql.hive.convertMetastoreParquet` is true, so the > result is wrong by default. > For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so > SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn > on it `true`. > {code} > hive> CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet; > hive> INSERT INTO TABLE t_char SELECT 'a', 'b' FROM (SELECT 1) t; > scala> sql("SELECT * FROM t_char").show > +---+---+ > | a| b| > +---+---+ > | a| b| > +---+---+ > scala> sql("set spark.sql.hive.convertMetastoreParquet=false") > scala> sql("SELECT * FROM t_char").show > +--+---+ > | a| b| > +--+---+ > |a | b| > +--+---+ > scala> spark.version > res3: String = 2.2.0 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21997) Spark shows different results on Hive char/varchar columns on Parquet
[ https://issues.apache.org/jira/browse/SPARK-21997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21997: Assignee: Apache Spark > Spark shows different results on Hive char/varchar columns on Parquet > - > > Key: SPARK-21997 > URL: https://issues.apache.org/jira/browse/SPARK-21997 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.1, 2.2.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows > different results according to the SQL configuration, > *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, > the default of `spark.sql.hive.convertMetastoreParquet` is true, so the > result is wrong by default. > For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so > SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn > on it `true`. > {code} > hive> CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet; > hive> INSERT INTO TABLE t_char SELECT 'a', 'b' FROM (SELECT 1) t; > scala> sql("SELECT * FROM t_char").show > +---+---+ > | a| b| > +---+---+ > | a| b| > +---+---+ > scala> sql("set spark.sql.hive.convertMetastoreParquet=false") > scala> sql("SELECT * FROM t_char").show > +--+---+ > | a| b| > +--+---+ > |a | b| > +--+---+ > scala> spark.version > res3: String = 2.2.0 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22010) Slow fromInternal conversion for TimestampType
[ https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-22010: --- Description: To convert timestamp type to python we are using {code}datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 100){code} code. {code} In [34]: %%timeit ...: datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344) ...: 4.58 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each) {code} It's slow, because: # we're trying to get TZ on every conversion # we're using replace method Proposed solution: custom datetime conversion and move calculation of TZ to module was: To convert timestamp type to python we are using {code}datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 100){code} code. {code} In [34]: %%timeit ...: datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344) ...: 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each) {code} It's slow, because: # we're trying to get TZ on every conversion # we're using replace method Proposed solution: custom datetime conversion and move calculation of TZ to module > Slow fromInternal conversion for TimestampType > -- > > Key: SPARK-22010 > URL: https://issues.apache.org/jira/browse/SPARK-22010 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński >Priority: Minor > Attachments: profile_fact_dok.png > > > To convert timestamp type to python we are using > {code}datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % > 100){code} > code. > {code} > In [34]: %%timeit > ...: > datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344) > ...: > 4.58 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each) > {code} > It's slow, because: > # we're trying to get TZ on every conversion > # we're using replace method > Proposed solution: custom datetime conversion and move calculation of TZ to > module -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22010) Slow fromInternal conversion for TimestampType
[ https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-22010: --- Priority: Minor (was: Major) > Slow fromInternal conversion for TimestampType > -- > > Key: SPARK-22010 > URL: https://issues.apache.org/jira/browse/SPARK-22010 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński >Priority: Minor > Attachments: profile_fact_dok.png > > > To convert timestamp type to python we are using > `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % > 100)` > code. > {code} > In [34]: %%timeit > ...: > datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344) > ...: > 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each) > {code} > It's slow, because: > # we're trying to get TZ on every conversion > # we're using replace method > Proposed solution: custom datetime conversion and move calculation of TZ to > module -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22010) Slow fromInternal conversion for TimestampType
[ https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-22010: --- Issue Type: Improvement (was: Bug) > Slow fromInternal conversion for TimestampType > -- > > Key: SPARK-22010 > URL: https://issues.apache.org/jira/browse/SPARK-22010 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński > Attachments: profile_fact_dok.png > > > To convert timestamp type to python we are using > `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % > 100)` > code. > {code} > In [34]: %%timeit > ...: > datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344) > ...: > 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each) > {code} > It's slow, because: > # we're trying to get TZ on every conversion > # we're using replace method > Proposed solution: custom datetime conversion and move calculation of TZ to > module -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22010) Slow fromInternal conversion for TimestampType
[ https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-22010: --- Description: To convert timestamp type to python we are using {code}datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 100){code} code. {code} In [34]: %%timeit ...: datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344) ...: 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each) {code} It's slow, because: # we're trying to get TZ on every conversion # we're using replace method Proposed solution: custom datetime conversion and move calculation of TZ to module was: To convert timestamp type to python we are using `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 100)` code. {code} In [34]: %%timeit ...: datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344) ...: 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each) {code} It's slow, because: # we're trying to get TZ on every conversion # we're using replace method Proposed solution: custom datetime conversion and move calculation of TZ to module > Slow fromInternal conversion for TimestampType > -- > > Key: SPARK-22010 > URL: https://issues.apache.org/jira/browse/SPARK-22010 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński >Priority: Minor > Attachments: profile_fact_dok.png > > > To convert timestamp type to python we are using > {code}datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % > 100){code} > code. > {code} > In [34]: %%timeit > ...: > datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344) > ...: > 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each) > {code} > It's slow, because: > # we're trying to get TZ on every conversion > # we're using replace method > Proposed solution: custom datetime conversion and move calculation of TZ to > module -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166516#comment-16166516 ] Michael N commented on SPARK-21999: --- My app was not asynchronously modifying an object that your streaming job accessed on the driver. Since we have ruled that out and given that - this issue occurred on the master - other people have submitted tickets for various occurrences of ConcurrentModificationException - the line numbers in the Spark code were captured in the stack trace could you have it tracked down and resolved ? > ConcurrentModificationException - Spark Streaming > - > > Key: SPARK-21999 > URL: https://issues.apache.org/jira/browse/SPARK-21999 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michael N >Priority: Critical > > Hi, > I am using Spark Streaming v2.1.0 with Kafka 0.8. I am getting > ConcurrentModificationException intermittently. When it occurs, Spark does > not honor the specified value of spark.task.maxFailures. So Spark aborts the > current batch and fetch the next batch, so it results in lost data. Its > exception stack is listed below. > This instance of ConcurrentModificationException is similar to the issue at > https://issues.apache.org/jira/browse/SPARK-17463, which was about > Serialization of accumulators in heartbeats. However, my Spark stream app > does not use accumulators. > The stack trace listed below occurred on the Spark master in Spark streaming > driver at the time of data loss. > From the line of code in the first stack trace, can you tell which object > Spark was trying to serialize ? What is the root cause for this issue ? > Because this issue results in lost data as described above, could you have > this issue fixed ASAP ? > Thanks. > Michael N., > > Stack trace of Spark Streaming driver > ERROR JobScheduler:91: Error generating jobs for time 150522493 ms > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at
[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166493#comment-16166493 ] Sean Owen commented on SPARK-21999: --- I'm suggesting it could happen if your app was asynchronously modifying an object that your streaming job accessed on the driver. Maybe not, but that's the first thing to rule out. It's hard to guess what it could be otherwise, unfortunately. > ConcurrentModificationException - Spark Streaming > - > > Key: SPARK-21999 > URL: https://issues.apache.org/jira/browse/SPARK-21999 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michael N >Priority: Critical > > Hi, > I am using Spark Streaming v2.1.0 with Kafka 0.8. I am getting > ConcurrentModificationException intermittently. When it occurs, Spark does > not honor the specified value of spark.task.maxFailures. So Spark aborts the > current batch and fetch the next batch, so it results in lost data. Its > exception stack is listed below. > This instance of ConcurrentModificationException is similar to the issue at > https://issues.apache.org/jira/browse/SPARK-17463, which was about > Serialization of accumulators in heartbeats. However, my Spark stream app > does not use accumulators. > The stack trace listed below occurred on the Spark master in Spark streaming > driver at the time of data loss. > From the line of code in the first stack trace, can you tell which object > Spark was trying to serialize ? What is the root cause for this issue ? > Because this issue results in lost data as described above, could you have > this issue fixed ASAP ? > Thanks. > Michael N., > > Stack trace of Spark Streaming driver > ERROR JobScheduler:91: Error generating jobs for time 150522493 ms > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) >
[jira] [Commented] (SPARK-21858) Make Spark grouping_id() compatible with Hive grouping__id
[ https://issues.apache.org/jira/browse/SPARK-21858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166467#comment-16166467 ] Dongjoon Hyun commented on SPARK-21858: --- Thank you for conclusion, [~cloud_fan]! > Make Spark grouping_id() compatible with Hive grouping__id > -- > > Key: SPARK-21858 > URL: https://issues.apache.org/jira/browse/SPARK-21858 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Yann Byron > > If you want to migrate some ETLs using `grouping__id` in Hive to Spark and > use Spark `grouping_id()` instead of Hive `grouping__id`, you will find > difference between their evaluations. > Here is an example. > {code:java} > select A, B, grouping__id/grouping_id() from t group by A, B grouping > sets((), (A), (B), (A,B)) > {code} > Running it on Hive and Spark separately, you'll find this: (the selected > attribute in selected grouping set is represented by (/) and otherwise by > (x)) > ||A B||Binary Expression in Spark||Spark||Hive||Binary Expression in Hive||B > A|| > |(x) (x)|11|3|0|00|(x) (x)| > |(x) (/)|10|2|2|10|(/) (x)| > |(/) (x)|01|1|1|01|(x) (/)| > |(/) (/)|00|0|3|11|(/) (/)| > As shown above,In Hive, (/) set to 0, (x) set to 1, and in Spark it's > opposite. > Moreover, attributes in `group by` will reverse firstly in Hive. In Spark > it'll be evaluated directly. > In my opinion, I suggest that modifying the behavior of `grouping_id()` make > it compatible with Hive `grouping__id`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22014) Sample windows in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-22014: -- Target Version/s: (was: 2.2.0) > Sample windows in Spark SQL > --- > > Key: SPARK-22014 > URL: https://issues.apache.org/jira/browse/SPARK-22014 > Project: Spark > Issue Type: Wish > Components: DStreams, SQL >Affects Versions: 2.2.0 >Reporter: Simon Schiff >Priority: Minor > > Hello, > I am using spark to process measurement data. It is possible to create sample > windows in Spark Streaming, where the duration of the window is smaller than > the slide. But when I try to do the same with Spark SQL (The measurement data > has a time stamp column) then I got an analysis exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot > resolve 'timewindow(timestamp, 6000, 18000, 0)' due to data type > mismatch: The slide duration (18000) must be less than or equal to the > windowDuration (6000) > {code} > Here is a example: > {code:java} > import java.sql.Timestamp; > import java.text.SimpleDateFormat; > import java.util.ArrayList; > import java.util.Date; > import java.util.List; > import org.apache.spark.api.java.function.Function; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Encoders; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.RowFactory; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.functions; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > public class App { > public static Timestamp createTimestamp(String in) throws Exception { > SimpleDateFormat dateFormat = new SimpleDateFormat("-MM-dd > hh:mm:ss"); > Date parsedDate = dateFormat.parse(in); > return new Timestamp(parsedDate.getTime()); > } > > public static void main(String[] args) { > SparkSession spark = SparkSession.builder().appName("Window > Sampling Example").getOrCreate(); > > List sensorData = new ArrayList(); > sensorData.add("2017-08-04 00:00:00, 22.75"); > sensorData.add("2017-08-04 00:01:00, 23.82"); > sensorData.add("2017-08-04 00:02:00, 24.15"); > sensorData.add("2017-08-04 00:03:00, 23.16"); > sensorData.add("2017-08-04 00:04:00, 22.62"); > sensorData.add("2017-08-04 00:05:00, 22.89"); > sensorData.add("2017-08-04 00:06:00, 23.21"); > sensorData.add("2017-08-04 00:07:00, 24.59"); > sensorData.add("2017-08-04 00:08:00, 24.44"); > > Dataset in = spark.createDataset(sensorData, > Encoders.STRING()); > > StructType sensorSchema = DataTypes.createStructType(new > StructField[] { > DataTypes.createStructField("timestamp", > DataTypes.TimestampType, false), > DataTypes.createStructField("value", > DataTypes.DoubleType, false), > }); > > Dataset data = > spark.createDataFrame(in.toJavaRDD().map(new Function() { > public Row call(String line) throws Exception { > return > RowFactory.create(createTimestamp(line.split(",")[0]), > Double.parseDouble(line.split(",")[1])); > } > }), sensorSchema); > > data.groupBy(functions.window(data.col("timestamp"), "1 > minutes", "3 minutes")).avg("value").orderBy("window").show(false); > } > } > {code} > I think there should be no difference (duration and slide) in a "Spark > Streaming window" and a "Spark SQL window" function. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22011) model <- spark.logit(training, Survived ~ ., regParam = 0.5) shwoing error
[ https://issues.apache.org/jira/browse/SPARK-22011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22011. --- Resolution: Not A Problem The error pretty much says it all -- set SPARK_HOME. See the SparkR docs. > model <- spark.logit(training, Survived ~ ., regParam = 0.5) shwoing error > -- > > Key: SPARK-22011 > URL: https://issues.apache.org/jira/browse/SPARK-22011 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.2.0 > Environment: Error showing on SparkR >Reporter: Atul Khairnar > > Sys.setenv(SPARK_HOME="C:/spark") > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) > Sys.setenv(JAVA_HOME="C:/Program Files/Java/jdk1.8.0_144/") > library(SparkR) > sc <- sparkR.session(master = "local") > sqlContext <- sparkRSQL.init(sc) > o/p: showing error in Rstudio > Warning message: > 'sparkRSQL.init' is deprecated. > Use 'sparkR.session' instead. > See help("Deprecated") > Can you help me what exactly error/warning...and next > model <- spark.logit(training, Survived ~ ., regParam = 0.5) > Error in handleErrors(returnStatus, conn) : > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 35.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 35.0 (TID 31, localhost, executor driver): org.apache.spark.SparkException: > SPARK_HOME not set. Can't locate SparkR package. > at org.apache.spark.api.r.RUtils$$anonfun$2.apply(RUtils.scala:88) > at org.apache.spark.api.r.RUtils$$anonfun$2.apply(RUtils.scala:88) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.api.r.RUtils$.sparkRPackagePath(RUtils.scala:87) > at org.apache.spark.api.r.RRunner$.createRProcess(RRunner.scala:339) > at org.apache.spark.api.r.RRunner$.createRWorker(RRunner.scala:391) > at org.apache.spark.api.r.RRunner.compute(RRunner.scala:69) > at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:51) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22010) Slow fromInternal conversion for TimestampType
[ https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22010: Assignee: (was: Apache Spark) > Slow fromInternal conversion for TimestampType > -- > > Key: SPARK-22010 > URL: https://issues.apache.org/jira/browse/SPARK-22010 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński > Attachments: profile_fact_dok.png > > > To convert timestamp type to python we are using > `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % > 100)` > code. > {code} > In [34]: %%timeit > ...: > datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344) > ...: > 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each) > {code} > It's slow, because: > # we're trying to get TZ on every conversion > # we're using replace method > Proposed solution: custom datetime conversion and move calculation of TZ to > module -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22010) Slow fromInternal conversion for TimestampType
[ https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22010: Assignee: Apache Spark > Slow fromInternal conversion for TimestampType > -- > > Key: SPARK-22010 > URL: https://issues.apache.org/jira/browse/SPARK-22010 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński >Assignee: Apache Spark > Attachments: profile_fact_dok.png > > > To convert timestamp type to python we are using > `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % > 100)` > code. > {code} > In [34]: %%timeit > ...: > datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344) > ...: > 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each) > {code} > It's slow, because: > # we're trying to get TZ on every conversion > # we're using replace method > Proposed solution: custom datetime conversion and move calculation of TZ to > module -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22010) Slow fromInternal conversion for TimestampType
[ https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166430#comment-16166430 ] Apache Spark commented on SPARK-22010: -- User 'maver1ck' has created a pull request for this issue: https://github.com/apache/spark/pull/19234 > Slow fromInternal conversion for TimestampType > -- > > Key: SPARK-22010 > URL: https://issues.apache.org/jira/browse/SPARK-22010 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński > Attachments: profile_fact_dok.png > > > To convert timestamp type to python we are using > `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % > 100)` > code. > {code} > In [34]: %%timeit > ...: > datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344) > ...: > 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each) > {code} > It's slow, because: > # we're trying to get TZ on every conversion > # we're using replace method > Proposed solution: custom datetime conversion and move calculation of TZ to > module -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22014) Sample windows in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Schiff updated SPARK-22014: - Description: Hello, I am using spark to process measurement data. It is possible to create sample windows in Spark Streaming, where the duration of the window is smaller than the slide. But when I try to do the same with Spark SQL (The measurement data has a time stamp column) then I got an analysis exception: {code} Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, 6000, 18000, 0)' due to data type mismatch: The slide duration (18000) must be less than or equal to the windowDuration (6000) {code} Here is a example: {code:java} import java.sql.Timestamp; import java.text.SimpleDateFormat; import java.util.ArrayList; import java.util.Date; import java.util.List; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Encoders; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.functions; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; public class App { public static Timestamp createTimestamp(String in) throws Exception { SimpleDateFormat dateFormat = new SimpleDateFormat("-MM-dd hh:mm:ss"); Date parsedDate = dateFormat.parse(in); return new Timestamp(parsedDate.getTime()); } public static void main(String[] args) { SparkSession spark = SparkSession.builder().appName("Window Sampling Example").getOrCreate(); List sensorData = new ArrayList(); sensorData.add("2017-08-04 00:00:00, 22.75"); sensorData.add("2017-08-04 00:01:00, 23.82"); sensorData.add("2017-08-04 00:02:00, 24.15"); sensorData.add("2017-08-04 00:03:00, 23.16"); sensorData.add("2017-08-04 00:04:00, 22.62"); sensorData.add("2017-08-04 00:05:00, 22.89"); sensorData.add("2017-08-04 00:06:00, 23.21"); sensorData.add("2017-08-04 00:07:00, 24.59"); sensorData.add("2017-08-04 00:08:00, 24.44"); Dataset in = spark.createDataset(sensorData, Encoders.STRING()); StructType sensorSchema = DataTypes.createStructType(new StructField[] { DataTypes.createStructField("timestamp", DataTypes.TimestampType, false), DataTypes.createStructField("value", DataTypes.DoubleType, false), }); Dataset data = spark.createDataFrame(in.toJavaRDD().map(new Function() { public Row call(String line) throws Exception { return RowFactory.create(createTimestamp(line.split(",")[0]), Double.parseDouble(line.split(",")[1])); } }), sensorSchema); data.groupBy(functions.window(data.col("timestamp"), "1 minutes", "3 minutes")).avg("value").orderBy("window").show(false); } } {code} I think there should be no difference (duration and slide) in a "Spark Streaming window" and a "Spark SQL window" function. was: Hello, I am using spark to process measurement data. It is possible to create sample windows in Spark Streaming, where the duration of the window is smaller than the slide. But when I try to do the same with Spark SQL (The measurement data has a time stamp column) then i got an analysis exception: {code} Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, 6000, 18000, 0)' due to data type mismatch: The slide duration (18000) must be less than or equal to the windowDuration (6000) {code} Here is a example: {code:java} import java.sql.Timestamp; import java.text.SimpleDateFormat; import java.util.ArrayList; import java.util.Date; import java.util.List; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Encoders; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.functions; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; public class App { public static Timestamp createTimestamp(String in) throws Exception { SimpleDateFormat dateFormat = new SimpleDateFormat("-MM-dd hh:mm:ss"); Date parsedDate =
[jira] [Updated] (SPARK-22014) Sample windows in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Schiff updated SPARK-22014: - Description: Hello, I am using spark to process measurement data. It is possible to create sample windows in Spark Streaming, where the duration of the window is smaller than the slide. But when I try to do the same with Spark SQL (The measurement data has a time stamp column) then i got an analysis exception: {code} Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, 6000, 18000, 0)' due to data type mismatch: The slide duration (18000) must be less than or equal to the windowDuration (6000) {code} Here is a example: {code:java} import java.sql.Timestamp; import java.text.SimpleDateFormat; import java.util.ArrayList; import java.util.Date; import java.util.List; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Encoders; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.functions; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; public class App { public static Timestamp createTimestamp(String in) throws Exception { SimpleDateFormat dateFormat = new SimpleDateFormat("-MM-dd hh:mm:ss"); Date parsedDate = dateFormat.parse(in); return new Timestamp(parsedDate.getTime()); } public static void main(String[] args) { SparkSession spark = SparkSession.builder().appName("Window Sampling Example").getOrCreate(); List sensorData = new ArrayList(); sensorData.add("2017-08-04 00:00:00, 22.75"); sensorData.add("2017-08-04 00:01:00, 23.82"); sensorData.add("2017-08-04 00:02:00, 24.15"); sensorData.add("2017-08-04 00:03:00, 23.16"); sensorData.add("2017-08-04 00:04:00, 22.62"); sensorData.add("2017-08-04 00:05:00, 22.89"); sensorData.add("2017-08-04 00:06:00, 23.21"); sensorData.add("2017-08-04 00:07:00, 24.59"); sensorData.add("2017-08-04 00:08:00, 24.44"); Dataset in = spark.createDataset(sensorData, Encoders.STRING()); StructType sensorSchema = DataTypes.createStructType(new StructField[] { DataTypes.createStructField("timestamp", DataTypes.TimestampType, false), DataTypes.createStructField("value", DataTypes.DoubleType, false), }); Dataset data = spark.createDataFrame(in.toJavaRDD().map(new Function() { public Row call(String line) throws Exception { return RowFactory.create(createTimestamp(line.split(",")[0]), Double.parseDouble(line.split(",")[1])); } }), sensorSchema); data.groupBy(functions.window(data.col("timestamp"), "1 minutes", "3 minutes")).avg("value").orderBy("window").show(false); } } {code} I think there should be no difference (duration and slide) in a "Spark Streaming window" and a "Spark SQL window" function. was: Hello, i am using spark to process measurement data. It is possible to create sample windows in Spark Streaming, where the duration of the window is smaller than the slide. But when I try to do the same with Spark SQL (The measurement data has a time stamp column) then i got a analysis exception: {code} Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, 6000, 18000, 0)' due to data type mismatch: The slide duration (18000) must be less than or equal to the windowDuration (6000) {code} Here is a example: {code:java} import java.sql.Timestamp; import java.text.SimpleDateFormat; import java.util.ArrayList; import java.util.Date; import java.util.List; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Encoders; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.functions; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; public class App { public static Timestamp createTimestamp(String in) throws Exception { SimpleDateFormat dateFormat = new SimpleDateFormat("-MM-dd hh:mm:ss"); Date parsedDate =
[jira] [Updated] (SPARK-22014) Sample windows in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Schiff updated SPARK-22014: - Issue Type: Wish (was: Improvement) > Sample windows in Spark SQL > --- > > Key: SPARK-22014 > URL: https://issues.apache.org/jira/browse/SPARK-22014 > Project: Spark > Issue Type: Wish > Components: DStreams, SQL >Affects Versions: 2.2.0 >Reporter: Simon Schiff >Priority: Minor > > Hello, > i am using spark to process measurement data. It is possible to create sample > windows in Spark Streaming, where the duration of the window is smaller than > the slide. But when I try to do the same with Spark SQL (The measurement data > has a time stamp column) then i got a analysis exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot > resolve 'timewindow(timestamp, 6000, 18000, 0)' due to data type > mismatch: The slide duration (18000) must be less than or equal to the > windowDuration (6000) > {code} > Here is a example: > {code:java} > import java.sql.Timestamp; > import java.text.SimpleDateFormat; > import java.util.ArrayList; > import java.util.Date; > import java.util.List; > import org.apache.spark.api.java.function.Function; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Encoders; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.RowFactory; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.functions; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > public class App { > public static Timestamp createTimestamp(String in) throws Exception { > SimpleDateFormat dateFormat = new SimpleDateFormat("-MM-dd > hh:mm:ss"); > Date parsedDate = dateFormat.parse(in); > return new Timestamp(parsedDate.getTime()); > } > > public static void main(String[] args) { > SparkSession spark = SparkSession.builder().appName("Window > Sampling Example").getOrCreate(); > > List sensorData = new ArrayList(); > sensorData.add("2017-08-04 00:00:00, 22.75"); > sensorData.add("2017-08-04 00:01:00, 23.82"); > sensorData.add("2017-08-04 00:02:00, 24.15"); > sensorData.add("2017-08-04 00:03:00, 23.16"); > sensorData.add("2017-08-04 00:04:00, 22.62"); > sensorData.add("2017-08-04 00:05:00, 22.89"); > sensorData.add("2017-08-04 00:06:00, 23.21"); > sensorData.add("2017-08-04 00:07:00, 24.59"); > sensorData.add("2017-08-04 00:08:00, 24.44"); > > Dataset in = spark.createDataset(sensorData, > Encoders.STRING()); > > StructType sensorSchema = DataTypes.createStructType(new > StructField[] { > DataTypes.createStructField("timestamp", > DataTypes.TimestampType, false), > DataTypes.createStructField("value", > DataTypes.DoubleType, false), > }); > > Dataset data = > spark.createDataFrame(in.toJavaRDD().map(new Function() { > public Row call(String line) throws Exception { > return > RowFactory.create(createTimestamp(line.split(",")[0]), > Double.parseDouble(line.split(",")[1])); > } > }), sensorSchema); > > data.groupBy(functions.window(data.col("timestamp"), "1 > minutes", "3 minutes")).avg("value").orderBy("window").show(false); > } > } > {code} > I think there should be no difference (duration and slide) in a "Spark > Streaming window" and a "Spark SQL window" function. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22011) model <- spark.logit(training, Survived ~ ., regParam = 0.5) shwoing error
[ https://issues.apache.org/jira/browse/SPARK-22011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-22011: - Fix Version/s: (was: 0.8.2) > model <- spark.logit(training, Survived ~ ., regParam = 0.5) shwoing error > -- > > Key: SPARK-22011 > URL: https://issues.apache.org/jira/browse/SPARK-22011 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.2.0 > Environment: Error showing on SparkR >Reporter: Atul Khairnar > > Sys.setenv(SPARK_HOME="C:/spark") > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) > Sys.setenv(JAVA_HOME="C:/Program Files/Java/jdk1.8.0_144/") > library(SparkR) > sc <- sparkR.session(master = "local") > sqlContext <- sparkRSQL.init(sc) > o/p: showing error in Rstudio > Warning message: > 'sparkRSQL.init' is deprecated. > Use 'sparkR.session' instead. > See help("Deprecated") > Can you help me what exactly error/warning...and next > model <- spark.logit(training, Survived ~ ., regParam = 0.5) > Error in handleErrors(returnStatus, conn) : > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 35.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 35.0 (TID 31, localhost, executor driver): org.apache.spark.SparkException: > SPARK_HOME not set. Can't locate SparkR package. > at org.apache.spark.api.r.RUtils$$anonfun$2.apply(RUtils.scala:88) > at org.apache.spark.api.r.RUtils$$anonfun$2.apply(RUtils.scala:88) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.api.r.RUtils$.sparkRPackagePath(RUtils.scala:87) > at org.apache.spark.api.r.RRunner$.createRProcess(RRunner.scala:339) > at org.apache.spark.api.r.RRunner$.createRWorker(RRunner.scala:391) > at org.apache.spark.api.r.RRunner.compute(RRunner.scala:69) > at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:51) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22011) model <- spark.logit(training, Survived ~ ., regParam = 0.5) shwoing error
[ https://issues.apache.org/jira/browse/SPARK-22011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-22011: - Target Version/s: (was: 2.2.0) > model <- spark.logit(training, Survived ~ ., regParam = 0.5) shwoing error > -- > > Key: SPARK-22011 > URL: https://issues.apache.org/jira/browse/SPARK-22011 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.2.0 > Environment: Error showing on SparkR >Reporter: Atul Khairnar > > Sys.setenv(SPARK_HOME="C:/spark") > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) > Sys.setenv(JAVA_HOME="C:/Program Files/Java/jdk1.8.0_144/") > library(SparkR) > sc <- sparkR.session(master = "local") > sqlContext <- sparkRSQL.init(sc) > o/p: showing error in Rstudio > Warning message: > 'sparkRSQL.init' is deprecated. > Use 'sparkR.session' instead. > See help("Deprecated") > Can you help me what exactly error/warning...and next > model <- spark.logit(training, Survived ~ ., regParam = 0.5) > Error in handleErrors(returnStatus, conn) : > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 35.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 35.0 (TID 31, localhost, executor driver): org.apache.spark.SparkException: > SPARK_HOME not set. Can't locate SparkR package. > at org.apache.spark.api.r.RUtils$$anonfun$2.apply(RUtils.scala:88) > at org.apache.spark.api.r.RUtils$$anonfun$2.apply(RUtils.scala:88) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.api.r.RUtils$.sparkRPackagePath(RUtils.scala:87) > at org.apache.spark.api.r.RRunner$.createRProcess(RRunner.scala:339) > at org.apache.spark.api.r.RRunner$.createRWorker(RRunner.scala:391) > at org.apache.spark.api.r.RRunner.compute(RRunner.scala:69) > at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:51) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22014) Sample windows in Spark SQL
Simon Schiff created SPARK-22014: Summary: Sample windows in Spark SQL Key: SPARK-22014 URL: https://issues.apache.org/jira/browse/SPARK-22014 Project: Spark Issue Type: Improvement Components: DStreams, SQL Affects Versions: 2.2.0 Reporter: Simon Schiff Priority: Minor Hello, i am using spark to process measurement data. It is possible to create sample windows in Spark Streaming, where the duration of the window is smaller than the slide. But when I try to do the same with Spark SQL (The measurement data has a time stamp column) then i got a analysis exception: {code} Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, 6000, 18000, 0)' due to data type mismatch: The slide duration (18000) must be less than or equal to the windowDuration (6000) {code} Here is a example: {code:java} import java.sql.Timestamp; import java.text.SimpleDateFormat; import java.util.ArrayList; import java.util.Date; import java.util.List; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Encoders; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.functions; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; public class App { public static Timestamp createTimestamp(String in) throws Exception { SimpleDateFormat dateFormat = new SimpleDateFormat("-MM-dd hh:mm:ss"); Date parsedDate = dateFormat.parse(in); return new Timestamp(parsedDate.getTime()); } public static void main(String[] args) { SparkSession spark = SparkSession.builder().appName("Window Sampling Example").getOrCreate(); List sensorData = new ArrayList(); sensorData.add("2017-08-04 00:00:00, 22.75"); sensorData.add("2017-08-04 00:01:00, 23.82"); sensorData.add("2017-08-04 00:02:00, 24.15"); sensorData.add("2017-08-04 00:03:00, 23.16"); sensorData.add("2017-08-04 00:04:00, 22.62"); sensorData.add("2017-08-04 00:05:00, 22.89"); sensorData.add("2017-08-04 00:06:00, 23.21"); sensorData.add("2017-08-04 00:07:00, 24.59"); sensorData.add("2017-08-04 00:08:00, 24.44"); Dataset in = spark.createDataset(sensorData, Encoders.STRING()); StructType sensorSchema = DataTypes.createStructType(new StructField[] { DataTypes.createStructField("timestamp", DataTypes.TimestampType, false), DataTypes.createStructField("value", DataTypes.DoubleType, false), }); Dataset data = spark.createDataFrame(in.toJavaRDD().map(new Function() { public Row call(String line) throws Exception { return RowFactory.create(createTimestamp(line.split(",")[0]), Double.parseDouble(line.split(",")[1])); } }), sensorSchema); data.groupBy(functions.window(data.col("timestamp"), "1 minutes", "3 minutes")).avg("value").orderBy("window").show(false); } } {code} I think there should be no difference (duration and slide) in a "Spark Streaming window" and a "Spark SQL window" function. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166316#comment-16166316 ] Michael N commented on SPARK-21999: --- Sean, this issue occurred on the master as well.I updated the ticket to show only its stack trace to clarify that. Which closure did you refer to about my app modifying some list ? My app is not using accumulators or broadcast variables. Thanks. Michael, > ConcurrentModificationException - Spark Streaming > - > > Key: SPARK-21999 > URL: https://issues.apache.org/jira/browse/SPARK-21999 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michael N >Priority: Critical > > Hi, > I am using Spark Streaming v2.1.0 with Kafka 0.8. I am getting > ConcurrentModificationException intermittently. When it occurs, Spark does > not honor the specified value of spark.task.maxFailures. So Spark aborts the > current batch and fetch the next batch, so it results in lost data. Its > exception stack is listed below. > This instance of ConcurrentModificationException is similar to the issue at > https://issues.apache.org/jira/browse/SPARK-17463, which was about > Serialization of accumulators in heartbeats. However, my Spark stream app > does not use accumulators. > The stack trace listed below occurred on the Spark master in Spark streaming > driver at the time of data loss. > From the line of code in the first stack trace, can you tell which object > Spark was trying to serialize ? What is the root cause for this issue ? > Because this issue results in lost data as described above, could you have > this issue fixed ASAP ? > Thanks. > Michael N., > > Stack trace of Spark Streaming driver > ERROR JobScheduler:91: Error generating jobs for time 150522493 ms > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at
[jira] [Updated] (SPARK-21999) ConcurrentModificationException - Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael N updated SPARK-21999: -- Description: Hi, I am using Spark Streaming v2.1.0 with Kafka 0.8. I am getting ConcurrentModificationException intermittently. When it occurs, Spark does not honor the specified value of spark.task.maxFailures. So Spark aborts the current batch and fetch the next batch, so it results in lost data. Its exception stack is listed below. This instance of ConcurrentModificationException is similar to the issue at https://issues.apache.org/jira/browse/SPARK-17463, which was about Serialization of accumulators in heartbeats. However, my Spark stream app does not use accumulators. The stack trace listed below occurred on the Spark master in Spark streaming driver at the time of data loss. >From the line of code in the first stack trace, can you tell which object >Spark was trying to serialize ? What is the root cause for this issue ? Because this issue results in lost data as described above, could you have this issue fixed ASAP ? Thanks. Michael N., Stack trace of Spark Streaming driver ERROR JobScheduler:91: Error generating jobs for time 150522493 ms org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792) at org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37) at org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37) at scala.Option.map(Option.scala:146) at org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) at scala.Option.orElse(Option.scala:289) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) at org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42) at org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:42) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) at
[jira] [Updated] (SPARK-21999) ConcurrentModificationException - Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael N updated SPARK-21999: -- Summary: ConcurrentModificationException - Spark Streaming (was: ConcurrentModificationException - Master ) > ConcurrentModificationException - Spark Streaming > - > > Key: SPARK-21999 > URL: https://issues.apache.org/jira/browse/SPARK-21999 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michael N >Priority: Critical > > Hi, > I am using Spark Streaming v2.1.0 with Kafka 0.8. I am getting > ConcurrentModificationException intermittently. When it occurs, Spark does > not honor the specified value of spark.task.maxFailures. So Spark aborts the > current batch and fetch the next batch, so it results in lost data. Its > exception stack is listed below. > This instance of ConcurrentModificationException is similar to the issue at > https://issues.apache.org/jira/browse/SPARK-17463, which was about > Serialization of accumulators in heartbeats. However, my Spark stream app > does not use accumulators. > There are two stack traces below. The first one is from the Spark streaming > driver at the time of data loss. The second one is from the Spark slave > server that runs the application several hours later. It may or may not be > related. They are listed here because they involve the same type of > ConcurrentModificationException, so they may be permutation of the same issue > and occurred at different times. > From the line of code in the first stack trace, can you tell which object > Spark was trying to serialize ? What is the root cause for this issue ? > Because this issue results in lost data as described above, could you have > this issue fixed ASAP ? > Thanks. > Michael N., > > Stack trace of Spark Streaming driver > ERROR JobScheduler:91: Error generating jobs for time 150522493 ms > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at
[jira] [Updated] (SPARK-21999) ConcurrentModificationException - Master
[ https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael N updated SPARK-21999: -- Summary: ConcurrentModificationException - Master (was: ConcurrentModificationException - Error sending message [message = Heartbeat) > ConcurrentModificationException - Master > - > > Key: SPARK-21999 > URL: https://issues.apache.org/jira/browse/SPARK-21999 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michael N >Priority: Critical > > Hi, > I am using Spark Streaming v2.1.0 with Kafka 0.8. I am getting > ConcurrentModificationException intermittently. When it occurs, Spark does > not honor the specified value of spark.task.maxFailures. So Spark aborts the > current batch and fetch the next batch, so it results in lost data. Its > exception stack is listed below. > This instance of ConcurrentModificationException is similar to the issue at > https://issues.apache.org/jira/browse/SPARK-17463, which was about > Serialization of accumulators in heartbeats. However, my Spark stream app > does not use accumulators. > There are two stack traces below. The first one is from the Spark streaming > driver at the time of data loss. The second one is from the Spark slave > server that runs the application several hours later. It may or may not be > related. They are listed here because they involve the same type of > ConcurrentModificationException, so they may be permutation of the same issue > and occurred at different times. > From the line of code in the first stack trace, can you tell which object > Spark was trying to serialize ? What is the root cause for this issue ? > Because this issue results in lost data as described above, could you have > this issue fixed ASAP ? > Thanks. > Michael N., > > Stack trace of Spark Streaming driver > ERROR JobScheduler:91: Error generating jobs for time 150522493 ms > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) >
[jira] [Commented] (SPARK-22010) Slow fromInternal conversion for TimestampType
[ https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166206#comment-16166206 ] Maciej Bryński commented on SPARK-22010: I'll open PR. > Slow fromInternal conversion for TimestampType > -- > > Key: SPARK-22010 > URL: https://issues.apache.org/jira/browse/SPARK-22010 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński > Attachments: profile_fact_dok.png > > > To convert timestamp type to python we are using > `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % > 100)` > code. > {code} > In [34]: %%timeit > ...: > datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344) > ...: > 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each) > {code} > It's slow, because: > # we're trying to get TZ on every conversion > # we're using replace method > Proposed solution: custom datetime conversion and move calculation of TZ to > module -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21922) When executor failed and task metrics have not send to driver,the status will always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'
[ https://issues.apache.org/jira/browse/SPARK-21922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-21922. - Resolution: Fixed Assignee: zhoukang Target Version/s: 2.3.0 > When executor failed and task metrics have not send to driver,the status will > always be 'RUNNING' and the duration will be 'CurrentTime - launchTime' > - > > Key: SPARK-21922 > URL: https://issues.apache.org/jira/browse/SPARK-21922 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: zhoukang >Assignee: zhoukang > Attachments: fixed01.png, fixed02.png, notfixed01.png, notfixed02.png > > > As title described,and below is an example: > !notfixed01.png|Before fixed! > !notfixed02.png|Before fixed! > We can fix the duration time by the modify time of event log: > !fixed01.png|After fixed! > !fixed02.png|After fixed! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21922) When executor failed and task metrics have not send to driver,the status will always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'
[ https://issues.apache.org/jira/browse/SPARK-21922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-21922: Fix Version/s: 2.3.0 > When executor failed and task metrics have not send to driver,the status will > always be 'RUNNING' and the duration will be 'CurrentTime - launchTime' > - > > Key: SPARK-21922 > URL: https://issues.apache.org/jira/browse/SPARK-21922 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: zhoukang >Assignee: zhoukang > Fix For: 2.3.0 > > Attachments: fixed01.png, fixed02.png, notfixed01.png, notfixed02.png > > > As title described,and below is an example: > !notfixed01.png|Before fixed! > !notfixed02.png|Before fixed! > We can fix the duration time by the modify time of event log: > !fixed01.png|After fixed! > !fixed02.png|After fixed! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21922) When executor failed and task metrics have not send to driver,the status will always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'
[ https://issues.apache.org/jira/browse/SPARK-21922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-21922: Target Version/s: (was: 2.3.0) > When executor failed and task metrics have not send to driver,the status will > always be 'RUNNING' and the duration will be 'CurrentTime - launchTime' > - > > Key: SPARK-21922 > URL: https://issues.apache.org/jira/browse/SPARK-21922 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: zhoukang >Assignee: zhoukang > Fix For: 2.3.0 > > Attachments: fixed01.png, fixed02.png, notfixed01.png, notfixed02.png > > > As title described,and below is an example: > !notfixed01.png|Before fixed! > !notfixed02.png|Before fixed! > We can fix the duration time by the modify time of event log: > !fixed01.png|After fixed! > !fixed02.png|After fixed! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22010) Slow fromInternal conversion for TimestampType
[ https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166187#comment-16166187 ] Hyukjin Kwon commented on SPARK-22010: -- I'd also put some details in the PR description, for example, fromtimestamp implementation in Python - https://github.com/python/cpython/blob/018d353c1c8c87767d2335cd884017c2ce12e045/Lib/datetime.py#L1425-L1458. I still think it is trivial but sounds valid improvement. > Slow fromInternal conversion for TimestampType > -- > > Key: SPARK-22010 > URL: https://issues.apache.org/jira/browse/SPARK-22010 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński > Attachments: profile_fact_dok.png > > > To convert timestamp type to python we are using > `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % > 100)` > code. > {code} > In [34]: %%timeit > ...: > datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344) > ...: > 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each) > {code} > It's slow, because: > # we're trying to get TZ on every conversion > # we're using replace method > Proposed solution: custom datetime conversion and move calculation of TZ to > module -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22010) Slow fromInternal conversion for TimestampType
[ https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166181#comment-16166181 ] Hyukjin Kwon commented on SPARK-22010: -- Sounds good. Would you like to go ahead and open a PR with some small perf tests? > Slow fromInternal conversion for TimestampType > -- > > Key: SPARK-22010 > URL: https://issues.apache.org/jira/browse/SPARK-22010 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński > Attachments: profile_fact_dok.png > > > To convert timestamp type to python we are using > `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % > 100)` > code. > {code} > In [34]: %%timeit > ...: > datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344) > ...: > 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each) > {code} > It's slow, because: > # we're trying to get TZ on every conversion > # we're using replace method > Proposed solution: custom datetime conversion and move calculation of TZ to > module -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22010) Slow fromInternal conversion for TimestampType
[ https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166136#comment-16166136 ] Maciej Bryński edited comment on SPARK-22010 at 9/14/17 12:05 PM: -- The reason of this Jira is this profiling (attachment). !profile_fact_dok.png|thumbnail! As you can see about 80% of pyspark time is spent in Spark internals. was (Author: maver1ck): The reason of this Jira is this profiling (attachment). !profile_fact_dok.jpg|thumbnail! As you can see about 80% of pyspark time is spent in Spark internals. > Slow fromInternal conversion for TimestampType > -- > > Key: SPARK-22010 > URL: https://issues.apache.org/jira/browse/SPARK-22010 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński > Attachments: profile_fact_dok.png > > > To convert timestamp type to python we are using > `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % > 100)` > code. > {code} > In [34]: %%timeit > ...: > datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344) > ...: > 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each) > {code} > It's slow, because: > # we're trying to get TZ on every conversion > # we're using replace method > Proposed solution: custom datetime conversion and move calculation of TZ to > module -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22010) Slow fromInternal conversion for TimestampType
[ https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166134#comment-16166134 ] Maciej Bryński commented on SPARK-22010: Proposition Create constant {code} utc = datetime.datetime.now(tzlocal()).tzname() == 'UTC' {code} Then change this code to: ( https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L196 ) {code} y, m, d, hh, mm, ss, _, _, _ = gmtime(ts // 100) if utc else localtime(ts // 100) datetime.datetime(y, m, d, hh, mm, ss, ts % 100) {code} This is running 30% faster if TZ != UTC and 3x faster if TZ == UTC What do you think about such a solution ? > Slow fromInternal conversion for TimestampType > -- > > Key: SPARK-22010 > URL: https://issues.apache.org/jira/browse/SPARK-22010 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński > > To convert timestamp type to python we are using > `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % > 100)` > code. > {code} > In [34]: %%timeit > ...: > datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344) > ...: > 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each) > {code} > It's slow, because: > # we're trying to get TZ on every conversion > # we're using replace method > Proposed solution: custom datetime conversion and move calculation of TZ to > module -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-22012) CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"
[ https://issues.apache.org/jira/browse/SPARK-22012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-22012. - > CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after > polling for 512" > -- > > Key: SPARK-22012 > URL: https://issues.apache.org/jira/browse/SPARK-22012 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 > Environment: Apache Spark 2.1.0.2.6.0.3-8, Kafka 0.10 for Java 8 >Reporter: Karan Singh >Priority: Blocker > > My Spark Streaming duration is 5 seconds (5000) and kafka is all at its > default properties , i am still facing this issue , Can anyone tell me how to > resolve it what i am doing wrong ? > JavaStreamingContext ssc = new JavaStreamingContext(sc, > new Duration(5000)); > Exception in Spark Streamings > Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most > recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): > java.lang.AssertionError: assertion failed: Failed to get records for > spark-executor-xxx pulse 1 163684030 after polling for 512 > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at > scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31) > at > com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:165) > at > com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:132) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22012) CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"
[ https://issues.apache.org/jira/browse/SPARK-22012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22012. --- Resolution: Invalid > CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after > polling for 512" > -- > > Key: SPARK-22012 > URL: https://issues.apache.org/jira/browse/SPARK-22012 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 > Environment: Apache Spark 2.1.0.2.6.0.3-8, Kafka 0.10 for Java 8 >Reporter: Karan Singh >Priority: Blocker > > My Spark Streaming duration is 5 seconds (5000) and kafka is all at its > default properties , i am still facing this issue , Can anyone tell me how to > resolve it what i am doing wrong ? > JavaStreamingContext ssc = new JavaStreamingContext(sc, > new Duration(5000)); > Exception in Spark Streamings > Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most > recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): > java.lang.AssertionError: assertion failed: Failed to get records for > spark-executor-xxx pulse 1 163684030 after polling for 512 > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at > scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31) > at > com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:165) > at > com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:132) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22012) CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"
[ https://issues.apache.org/jira/browse/SPARK-22012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karan Singh updated SPARK-22012: Environment: Apache Spark 2.1.0.2.6.0.3-8, Kafka 0.10 for Java 8 (was: Apache Spark 2.0.0, Kafka 0.10 for Scala 2.11) > CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after > polling for 512" > -- > > Key: SPARK-22012 > URL: https://issues.apache.org/jira/browse/SPARK-22012 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 > Environment: Apache Spark 2.1.0.2.6.0.3-8, Kafka 0.10 for Java 8 >Reporter: Karan Singh >Priority: Blocker > > My Spark Streaming duration is 5 seconds (5000) and kafka is all at its > default properties , i am still facing this issue , Can anyone tell me how to > resolve it what i am doing wrong ? > JavaStreamingContext ssc = new JavaStreamingContext(sc, > new Duration(5000)); > Exception in Spark Streamings > Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most > recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): > java.lang.AssertionError: assertion failed: Failed to get records for > spark-executor-xxx pulse 1 163684030 after polling for 512 > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at > scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31) > at > com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:165) > at > com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:132) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory
[ https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166117#comment-16166117 ] Michel Lemay commented on SPARK-12216: -- In my case, what prevent the temp folder to be deleted is a lock on the jar I used in spark-submit. That jar is copied in `spark-guid\userFiles-guid\` folder and loaded in the process. I'm not sure if it is caused by the JVM itself or some handle left open (MutableURLClassLoader does not seems to be closed.) > Spark failed to delete temp directory > -- > > Key: SPARK-12216 > URL: https://issues.apache.org/jira/browse/SPARK-12216 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Environment: windows 7 64 bit > Spark 1.52 > Java 1.8.0.65 > PATH includes: > C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin > C:\ProgramData\Oracle\Java\javapath > C:\Users\Stefan\scala\bin > SYSTEM variables set are: > JAVA_HOME=C:\Program Files\Java\jre1.8.0_65 > HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin > (where the bin\winutils resides) > both \tmp and \tmp\hive have permissions > drwxrwxrwx as detected by winutils ls >Reporter: stefan >Priority: Minor > > The mailing list archives have no obvious solution to this: > scala> :q > Stopping spark context. > 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark > temp dir: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > java.io.IOException: Failed to delete: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) > at scala.util.Try$.apply(Try.scala:161) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22012) CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"
[ https://issues.apache.org/jira/browse/SPARK-22012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karan Singh updated SPARK-22012: Description: My Spark Streaming duration is 5 seconds (5000) and kafka is all at its default properties , i am still facing this issue , Can anyone tell me how to resolve it what i am doing wrong ? JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(5000)); Exception in Spark Streamings Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): java.lang.AssertionError: assertion failed: Failed to get records for spark-executor-xxx pulse 1 163684030 after polling for 512 at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31) at com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:165) at com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:132) at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) was: We have a Spark Streaming application reading records from Kafka 0.10. Some tasks are failed because of the following error: "java.lang.AssertionError: assertion failed: Failed to get records for (...) after polling for 512" The first attempt fails and the second attempt (retry) completes successfully, - this is the pattern that we see for many tasks in our logs. These fails and retries consume resources. A similar case with a stack trace are described here: https://www.mail-archive.com/user@spark.apache.org/msg56564.html https://gist.github.com/SrikanthTati/c2e95c4ac689cd49aab817e24ec42767 Here is the line from the stack trace where the error is raised: org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) We tried several values for "spark.streaming.kafka.consumer.poll.ms", - 2, 5, 10, 30 and 60 seconds, but the error appeared in all the cases except the last one. Moreover, increasing the threshold led to increasing total Spark stage duration. In other words, increasing "spark.streaming.kafka.consumer.poll.ms" led to fewer task failures but with cost of total stage duration. So, it is bad for performance when processing data streams. We have a suspicion that there is a bug in CachedKafkaConsumer (and/or other related classes) which inhibits the reading process. > CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after > polling for 512" > -- > > Key: SPARK-22012 > URL: https://issues.apache.org/jira/browse/SPARK-22012 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 > Environment: Apache Spark 2.0.0, Kafka 0.10 for Scala 2.11 >Reporter: Karan Singh > > My Spark Streaming duration is 5 seconds (5000) and kafka is all at its > default properties , i am still facing this issue , Can anyone tell me how to > resolve it what i am doing wrong ? > JavaStreamingContext ssc = new JavaStreamingContext(sc, > new Duration(5000)); > Exception in Spark Streamings > Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most > recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): > java.lang.AssertionError: assertion failed: Failed to get records for > spark-executor-xxx pulse 1 163684030 after polling for 512 > at
[jira] [Updated] (SPARK-22012) CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"
[ https://issues.apache.org/jira/browse/SPARK-22012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karan Singh updated SPARK-22012: Priority: Blocker (was: Major) > CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after > polling for 512" > -- > > Key: SPARK-22012 > URL: https://issues.apache.org/jira/browse/SPARK-22012 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 > Environment: Apache Spark 2.0.0, Kafka 0.10 for Scala 2.11 >Reporter: Karan Singh >Priority: Blocker > > My Spark Streaming duration is 5 seconds (5000) and kafka is all at its > default properties , i am still facing this issue , Can anyone tell me how to > resolve it what i am doing wrong ? > JavaStreamingContext ssc = new JavaStreamingContext(sc, > new Duration(5000)); > Exception in Spark Streamings > Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most > recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): > java.lang.AssertionError: assertion failed: Failed to get records for > spark-executor-xxx pulse 1 163684030 after polling for 512 > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at > scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31) > at > com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:165) > at > com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:132) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22012) CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"
Karan Singh created SPARK-22012: --- Summary: CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512" Key: SPARK-22012 URL: https://issues.apache.org/jira/browse/SPARK-22012 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 2.0.0 Environment: Apache Spark 2.0.0, Kafka 0.10 for Scala 2.11 Reporter: Karan Singh We have a Spark Streaming application reading records from Kafka 0.10. Some tasks are failed because of the following error: "java.lang.AssertionError: assertion failed: Failed to get records for (...) after polling for 512" The first attempt fails and the second attempt (retry) completes successfully, - this is the pattern that we see for many tasks in our logs. These fails and retries consume resources. A similar case with a stack trace are described here: https://www.mail-archive.com/user@spark.apache.org/msg56564.html https://gist.github.com/SrikanthTati/c2e95c4ac689cd49aab817e24ec42767 Here is the line from the stack trace where the error is raised: org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) We tried several values for "spark.streaming.kafka.consumer.poll.ms", - 2, 5, 10, 30 and 60 seconds, but the error appeared in all the cases except the last one. Moreover, increasing the threshold led to increasing total Spark stage duration. In other words, increasing "spark.streaming.kafka.consumer.poll.ms" led to fewer task failures but with cost of total stage duration. So, it is bad for performance when processing data streams. We have a suspicion that there is a bug in CachedKafkaConsumer (and/or other related classes) which inhibits the reading process. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19275) Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"
[ https://issues.apache.org/jira/browse/SPARK-19275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166112#comment-16166112 ] Karan Singh commented on SPARK-19275: - Hi Team , My Spark Streaming duration is 5 seconds (5000) and kafka is all at its default properties , i am still facing this issue , Can anyone tell me how to resolve it what i am doing wrong ? JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(5000)); Exception in Spark Streamings Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): java.lang.AssertionError: assertion failed: Failed to get records for spark-executor-xxx pulse 1 163684030 after polling for 512 at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31) at com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:165) at com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:132) at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) > Spark Streaming, Kafka receiver, "Failed to get records for ... after polling > for 512" > -- > > Key: SPARK-19275 > URL: https://issues.apache.org/jira/browse/SPARK-19275 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 > Environment: Apache Spark 2.0.0, Kafka 0.10 for Scala 2.11 >Reporter: Dmitry Ochnev > > We have a Spark Streaming application reading records from Kafka 0.10. > Some tasks are failed because of the following error: > "java.lang.AssertionError: assertion failed: Failed to get records for (...) > after polling for 512" > The first attempt fails and the second attempt (retry) completes > successfully, - this is the pattern that we see for many tasks in our logs. > These fails and retries consume resources. > A similar case with a stack trace are described here: > https://www.mail-archive.com/user@spark.apache.org/msg56564.html > https://gist.github.com/SrikanthTati/c2e95c4ac689cd49aab817e24ec42767 > Here is the line from the stack trace where the error is raised: > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) > We tried several values for "spark.streaming.kafka.consumer.poll.ms", - 2, 5, > 10, 30 and 60 seconds, but the error appeared in all the cases except the > last one. Moreover, increasing the threshold led to increasing total Spark > stage duration. > In other words, increasing "spark.streaming.kafka.consumer.poll.ms" led to > fewer task failures but with cost of total stage duration. So, it is bad for > performance when processing data streams. > We have a suspicion that there is a bug in CachedKafkaConsumer (and/or other > related classes) which inhibits the reading process. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22011) model <- spark.logit(training, Survived ~ ., regParam = 0.5) shwoing error
Atul Khairnar created SPARK-22011: - Summary: model <- spark.logit(training, Survived ~ ., regParam = 0.5) shwoing error Key: SPARK-22011 URL: https://issues.apache.org/jira/browse/SPARK-22011 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 2.2.0 Environment: Error showing on SparkR Reporter: Atul Khairnar Fix For: 0.8.2 Sys.setenv(SPARK_HOME="C:/spark") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) Sys.setenv(JAVA_HOME="C:/Program Files/Java/jdk1.8.0_144/") library(SparkR) sc <- sparkR.session(master = "local") sqlContext <- sparkRSQL.init(sc) o/p: showing error in Rstudio Warning message: 'sparkRSQL.init' is deprecated. Use 'sparkR.session' instead. See help("Deprecated") Can you help me what exactly error/warning...and next model <- spark.logit(training, Survived ~ ., regParam = 0.5) Error in handleErrors(returnStatus, conn) : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 35.0 failed 1 times, most recent failure: Lost task 0.0 in stage 35.0 (TID 31, localhost, executor driver): org.apache.spark.SparkException: SPARK_HOME not set. Can't locate SparkR package. at org.apache.spark.api.r.RUtils$$anonfun$2.apply(RUtils.scala:88) at org.apache.spark.api.r.RUtils$$anonfun$2.apply(RUtils.scala:88) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.api.r.RUtils$.sparkRPackagePath(RUtils.scala:87) at org.apache.spark.api.r.RRunner$.createRProcess(RRunner.scala:339) at org.apache.spark.api.r.RRunner$.createRWorker(RRunner.scala:391) at org.apache.spark.api.r.RRunner.compute(RRunner.scala:69) at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:51) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22008) Spark Streaming Dynamic Allocation auto fix maxNumExecutors
[ https://issues.apache.org/jira/browse/SPARK-22008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22008: Assignee: Apache Spark > Spark Streaming Dynamic Allocation auto fix maxNumExecutors > --- > > Key: SPARK-22008 > URL: https://issues.apache.org/jira/browse/SPARK-22008 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Yue Ma >Assignee: Apache Spark >Priority: Minor > > In SparkStreaming DRA .The metric we use to add or remove executor is the > ratio of batch processing time / batch duration (R). And we use the parameter > "spark.streaming.dynamicAllocation.maxExecutors" to set the max Num of > executor .Currently it doesn't work well with Spark streaming because of > several reasons: > (1) For example if the max nums of executor we need is 10 and we set > "spark.streaming.dynamicAllocation.maxExecutors" to 15,Obviously ,We wasted 5 > executors. > (2) If the number of topic partition changes ,then the partition of KafkaRDD > or the num of tasks in a stage changes too.And the max executor we need will > also change,so the num of maxExecutors should change with the nums of Task . > The goal of this JIRA is to auto fix maxNumExecutors . Using a SparkListerner > when Stage Submitted ,first figure out the num executor we need , then > update the maxNumExecutor -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22008) Spark Streaming Dynamic Allocation auto fix maxNumExecutors
[ https://issues.apache.org/jira/browse/SPARK-22008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166106#comment-16166106 ] Apache Spark commented on SPARK-22008: -- User 'mayuehappy' has created a pull request for this issue: https://github.com/apache/spark/pull/19233 > Spark Streaming Dynamic Allocation auto fix maxNumExecutors > --- > > Key: SPARK-22008 > URL: https://issues.apache.org/jira/browse/SPARK-22008 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Yue Ma >Priority: Minor > > In SparkStreaming DRA .The metric we use to add or remove executor is the > ratio of batch processing time / batch duration (R). And we use the parameter > "spark.streaming.dynamicAllocation.maxExecutors" to set the max Num of > executor .Currently it doesn't work well with Spark streaming because of > several reasons: > (1) For example if the max nums of executor we need is 10 and we set > "spark.streaming.dynamicAllocation.maxExecutors" to 15,Obviously ,We wasted 5 > executors. > (2) If the number of topic partition changes ,then the partition of KafkaRDD > or the num of tasks in a stage changes too.And the max executor we need will > also change,so the num of maxExecutors should change with the nums of Task . > The goal of this JIRA is to auto fix maxNumExecutors . Using a SparkListerner > when Stage Submitted ,first figure out the num executor we need , then > update the maxNumExecutor -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22008) Spark Streaming Dynamic Allocation auto fix maxNumExecutors
[ https://issues.apache.org/jira/browse/SPARK-22008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22008: Assignee: (was: Apache Spark) > Spark Streaming Dynamic Allocation auto fix maxNumExecutors > --- > > Key: SPARK-22008 > URL: https://issues.apache.org/jira/browse/SPARK-22008 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Yue Ma >Priority: Minor > > In SparkStreaming DRA .The metric we use to add or remove executor is the > ratio of batch processing time / batch duration (R). And we use the parameter > "spark.streaming.dynamicAllocation.maxExecutors" to set the max Num of > executor .Currently it doesn't work well with Spark streaming because of > several reasons: > (1) For example if the max nums of executor we need is 10 and we set > "spark.streaming.dynamicAllocation.maxExecutors" to 15,Obviously ,We wasted 5 > executors. > (2) If the number of topic partition changes ,then the partition of KafkaRDD > or the num of tasks in a stage changes too.And the max executor we need will > also change,so the num of maxExecutors should change with the nums of Task . > The goal of this JIRA is to auto fix maxNumExecutors . Using a SparkListerner > when Stage Submitted ,first figure out the num executor we need , then > update the maxNumExecutor -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22010) Slow fromInternal conversion for TimestampType
[ https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166097#comment-16166097 ] Hyukjin Kwon commented on SPARK-22010: -- I don't think this is worth fixing for now. The improvement looks quite trivial but it sounds we should reinvent the wheel. Do you know a simple and well-known workaround or any measurement between the custom fix and the current status? Otherwise, I'd close this as {{Won't Fix}}. > Slow fromInternal conversion for TimestampType > -- > > Key: SPARK-22010 > URL: https://issues.apache.org/jira/browse/SPARK-22010 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński > > To convert timestamp type to python we are using > `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % > 100)` > code. > {code} > In [34]: %%timeit > ...: > datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344) > ...: > 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each) > {code} > It's slow, because: > # we're trying to get TZ on every conversion > # we're using replace method > Proposed solution: custom datetime conversion and move calculation of TZ to > module -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22006) date/datetime comparisons should avoid casting
[ https://issues.apache.org/jira/browse/SPARK-22006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22006. -- Resolution: Invalid Please see https://github.com/apache/spark/blob/183d4cb71fbcbf484fc85d8621e1fe04cbbc8195/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L119-L121: {code} // We should cast all relative timestamp/date/string comparison into string comparisons // This behaves as a user would expect because timestamp strings sort lexicographically. // i.e. TimeStamp(2013-01-01 00:00 ...) < "2014" = true {code} I think direct comparison between {{datetime}} and {{date}} is not even allowed in Python itself: {code} >>> import datetime >>> datetime.date(2017, 1, 1) > datetime.datetime(2017, 1, 1) Traceback (most recent call last): File "", line 1, in TypeError: can't compare datetime.datetime to datetime.date {code} > date/datetime comparisons should avoid casting > -- > > Key: SPARK-22006 > URL: https://issues.apache.org/jira/browse/SPARK-22006 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Adrian Bridgett >Priority: Minor > Labels: performance > > I believe there's a relatively simple optimisation that can be done here - > comparing timestamps with dates involves a cast whereas comparing with > datetimes avoids this (and pushes the query down into parquet: > {code} > df.filter(df['local_read_at'] > datetime.date('2017-01-01')).count() > {code} > Results in a plan of: > {code} > +- *Filter (isnotnull(local_read_at#324) && (cast(local_read_at#324 > as string) > 2017-01-01)) > +- *FileScan parquet [local_read_at#324] Batched: true, Format: > Parquet, Location: InMemoryFileIndex[s3a://...], PartitionFilters: [], > PushedFilters: [IsNotNull(local_read_at)], ReadSchema: > struct > {code} > Whereas: > {code} > df.filter(df['local_read_at'] > datetime.datetime(2017,1,1)).count() > {code} > Results in: > {code} > +- *Filter (isnotnull(local_read_at#324) && (local_read_at#324 > > 14832288)) > +- *FileScan parquet [local_read_at#324] Batched: true, Format: > Parquet, Location: InMemoryFileIndex[s3a://...], PartitionFilters: [], > PushedFilters: [IsNotNull(local_read_at), > GreaterThan(local_read_at,2017-01-01 00:00:00.0)], ReadSchema: > struct > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22008) Spark Streaming Dynamic Allocation auto fix maxNumExecutors
[ https://issues.apache.org/jira/browse/SPARK-22008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Ma updated SPARK-22008: --- Description: In SparkStreaming DRA .The metric we use to add or remove executor is the ratio of batch processing time / batch duration (R). And we use the parameter "spark.streaming.dynamicAllocation.maxExecutors" to set the max Num of executor .Currently it doesn't work well with Spark streaming because of several reasons: (1) For example if the max nums of executor we need is 10 and we set "spark.streaming.dynamicAllocation.maxExecutors" to 15,Obviously ,We wasted 5 executors. (2) If the number of topic partition changes ,then the partition of KafkaRDD or the num of tasks in a stage changes too.And the max executor we need will also change,so the num of maxExecutors should change with the nums of Task . The goal of this JIRA is to auto fix maxNumExecutors . Using a SparkListerner when Stage Submitted ,first figure out the num executor we need , then update the maxNumExecutor was: In SparkStreaming DRA .The metric we use to add or remove executor is the ratio of batch processing time / batch duration (R). And we use the parameter "spark.streaming.dynamicAllocation.maxExecutors" to set the max Num of executor .Currently it doesn't work well with Spark streaming because of several reasons: (1) For example if the max nums of executor we need is 10 and we set "spark.streaming.dynamicAllocation.maxExecutors" to 15,Obviously ,We wasted 5 executors. (2) If the number of topic partition changes ,then the partition of KafkaRDD or the num of tasks in a stage changes too.And the max executor we need will also change,so the num of maxExecutors should change with the nums of Task . The goal of this JIRA is to auto fix maxNumExecutors . Using a SparkListerner when Stage Submitted ,first figure out the num executor we need , then update the maxNumExecutor > Spark Streaming Dynamic Allocation auto fix maxNumExecutors > --- > > Key: SPARK-22008 > URL: https://issues.apache.org/jira/browse/SPARK-22008 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Yue Ma >Priority: Minor > > In SparkStreaming DRA .The metric we use to add or remove executor is the > ratio of batch processing time / batch duration (R). And we use the parameter > "spark.streaming.dynamicAllocation.maxExecutors" to set the max Num of > executor .Currently it doesn't work well with Spark streaming because of > several reasons: > (1) For example if the max nums of executor we need is 10 and we set > "spark.streaming.dynamicAllocation.maxExecutors" to 15,Obviously ,We wasted 5 > executors. > (2) If the number of topic partition changes ,then the partition of KafkaRDD > or the num of tasks in a stage changes too.And the max executor we need will > also change,so the num of maxExecutors should change with the nums of Task . > The goal of this JIRA is to auto fix maxNumExecutors . Using a SparkListerner > when Stage Submitted ,first figure out the num executor we need , then > update the maxNumExecutor -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21935) Pyspark UDF causing ExecutorLostFailure
[ https://issues.apache.org/jira/browse/SPARK-21935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166086#comment-16166086 ] Nikolaos Tsipas commented on SPARK-21935: - Thanks for the response [~srowen]! The problem only appears when a python UDF is used, the same UDF written in scala doesn't cause any memory issues. However if you are still thinking that the issue is somewhere else in the app what would be the best way to debug it? Focus on spark or yarn? Also, if you can think of any more specific debugging steps please make suggestions. > Pyspark UDF causing ExecutorLostFailure > > > Key: SPARK-21935 > URL: https://issues.apache.org/jira/browse/SPARK-21935 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Nikolaos Tsipas > Labels: pyspark, udf > Attachments: cpu.png, Screen Shot 2017-09-06 at 11.30.28.png, Screen > Shot 2017-09-06 at 11.31.13.png, Screen Shot 2017-09-06 at 11.31.31.png > > > Hi, > I'm using spark 2.1.0 on AWS EMR (Yarn) and trying to use a UDF in python as > follows: > {code} > from pyspark.sql.functions import col, udf > from pyspark.sql.types import StringType > path = 's3://some/parquet/dir/myfile.parquet' > df = spark.read.load(path) > def _test_udf(useragent): > return useragent.upper() > test_udf = udf(_test_udf, StringType()) > df = df.withColumn('test_field', test_udf(col('std_useragent'))) > df.write.parquet('/output.parquet') > {code} > The following config is used in {{spark-defaults.conf}} (using > {{maximizeResourceAllocation}} in EMR) > {code} > ... > spark.executor.instances 4 > spark.executor.cores 8 > spark.driver.memory 8G > spark.executor.memory9658M > spark.default.parallelism64 > spark.driver.maxResultSize 3G > ... > {code} > The cluster has 4 worker nodes (+1 master) with the following specs: 8 vCPU, > 15 GiB memory, 160 SSD GB storage > The above example fails every single time with errors like the following: > {code} > 17/09/06 09:58:08 WARN TaskSetManager: Lost task 26.1 in stage 1.0 (TID 50, > ip-172-31-7-125.eu-west-1.compute.internal, executor 10): ExecutorLostFailure > (executor 10 exited caused by one of the running tasks) Reason: Container > killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical > memory used. Consider boosting spark.yarn.executor.memoryOverhead. > {code} > I tried to increase the {{spark.yarn.executor.memoryOverhead}} to 3000 which > delays the errors but eventually I get them before the end of the job. The > job eventually fails. > !Screen Shot 2017-09-06 at 11.31.31.png|width=800! > If I run the above job in scala everything works as expected (without having > to adjust the memoryOverhead) > {code} > import org.apache.spark.sql.functions.udf > val upper: String => String = _.toUpperCase > val df = spark.read.load("s3://some/parquet/dir/myfile.parquet") > val upperUDF = udf(upper) > val newdf = df.withColumn("test_field", upperUDF(col("std_useragent"))) > newdf.write.parquet("/output.parquet") > {code} > !Screen Shot 2017-09-06 at 11.31.13.png|width=800! > Cpu utilisation is very bad with pyspark > !cpu.png|width=800! > Is this a known bug with pyspark and udfs or is it a matter of bad > configuration? > Looking forward to suggestions. Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21935) Pyspark UDF causing ExecutorLostFailure
[ https://issues.apache.org/jira/browse/SPARK-21935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166069#comment-16166069 ] Sean Owen commented on SPARK-21935: --- The error is just running out of memory. Generally it is not true that UDFs cause the whole thing to fail. It is likely an issue elsewhere on your app > Pyspark UDF causing ExecutorLostFailure > > > Key: SPARK-21935 > URL: https://issues.apache.org/jira/browse/SPARK-21935 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Nikolaos Tsipas > Labels: pyspark, udf > Attachments: cpu.png, Screen Shot 2017-09-06 at 11.30.28.png, Screen > Shot 2017-09-06 at 11.31.13.png, Screen Shot 2017-09-06 at 11.31.31.png > > > Hi, > I'm using spark 2.1.0 on AWS EMR (Yarn) and trying to use a UDF in python as > follows: > {code} > from pyspark.sql.functions import col, udf > from pyspark.sql.types import StringType > path = 's3://some/parquet/dir/myfile.parquet' > df = spark.read.load(path) > def _test_udf(useragent): > return useragent.upper() > test_udf = udf(_test_udf, StringType()) > df = df.withColumn('test_field', test_udf(col('std_useragent'))) > df.write.parquet('/output.parquet') > {code} > The following config is used in {{spark-defaults.conf}} (using > {{maximizeResourceAllocation}} in EMR) > {code} > ... > spark.executor.instances 4 > spark.executor.cores 8 > spark.driver.memory 8G > spark.executor.memory9658M > spark.default.parallelism64 > spark.driver.maxResultSize 3G > ... > {code} > The cluster has 4 worker nodes (+1 master) with the following specs: 8 vCPU, > 15 GiB memory, 160 SSD GB storage > The above example fails every single time with errors like the following: > {code} > 17/09/06 09:58:08 WARN TaskSetManager: Lost task 26.1 in stage 1.0 (TID 50, > ip-172-31-7-125.eu-west-1.compute.internal, executor 10): ExecutorLostFailure > (executor 10 exited caused by one of the running tasks) Reason: Container > killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical > memory used. Consider boosting spark.yarn.executor.memoryOverhead. > {code} > I tried to increase the {{spark.yarn.executor.memoryOverhead}} to 3000 which > delays the errors but eventually I get them before the end of the job. The > job eventually fails. > !Screen Shot 2017-09-06 at 11.31.31.png|width=800! > If I run the above job in scala everything works as expected (without having > to adjust the memoryOverhead) > {code} > import org.apache.spark.sql.functions.udf > val upper: String => String = _.toUpperCase > val df = spark.read.load("s3://some/parquet/dir/myfile.parquet") > val upperUDF = udf(upper) > val newdf = df.withColumn("test_field", upperUDF(col("std_useragent"))) > newdf.write.parquet("/output.parquet") > {code} > !Screen Shot 2017-09-06 at 11.31.13.png|width=800! > Cpu utilisation is very bad with pyspark > !cpu.png|width=800! > Is this a known bug with pyspark and udfs or is it a matter of bad > configuration? > Looking forward to suggestions. Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21935) Pyspark UDF causing ExecutorLostFailure
[ https://issues.apache.org/jira/browse/SPARK-21935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166063#comment-16166063 ] Alistair Wooldrige commented on SPARK-21935: I get this a lot, and see it *only* when introducing a Pyspark UDF into a Spark SQL job. The job will run perfectly fine without the UDF, but as soon as one is introduced these {{ExecutorLostFailure}} errors start. This is the case even if the UDF does no significant work and just returns a static string such as: {noformat} def _test_udf(_): return "apple" {noformat} Any suggestions would be greatly appreciated as I've also tried many of the tuning fixes that [~nicktgr15] mentioned, to no avail. > Pyspark UDF causing ExecutorLostFailure > > > Key: SPARK-21935 > URL: https://issues.apache.org/jira/browse/SPARK-21935 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Nikolaos Tsipas > Labels: pyspark, udf > Attachments: cpu.png, Screen Shot 2017-09-06 at 11.30.28.png, Screen > Shot 2017-09-06 at 11.31.13.png, Screen Shot 2017-09-06 at 11.31.31.png > > > Hi, > I'm using spark 2.1.0 on AWS EMR (Yarn) and trying to use a UDF in python as > follows: > {code} > from pyspark.sql.functions import col, udf > from pyspark.sql.types import StringType > path = 's3://some/parquet/dir/myfile.parquet' > df = spark.read.load(path) > def _test_udf(useragent): > return useragent.upper() > test_udf = udf(_test_udf, StringType()) > df = df.withColumn('test_field', test_udf(col('std_useragent'))) > df.write.parquet('/output.parquet') > {code} > The following config is used in {{spark-defaults.conf}} (using > {{maximizeResourceAllocation}} in EMR) > {code} > ... > spark.executor.instances 4 > spark.executor.cores 8 > spark.driver.memory 8G > spark.executor.memory9658M > spark.default.parallelism64 > spark.driver.maxResultSize 3G > ... > {code} > The cluster has 4 worker nodes (+1 master) with the following specs: 8 vCPU, > 15 GiB memory, 160 SSD GB storage > The above example fails every single time with errors like the following: > {code} > 17/09/06 09:58:08 WARN TaskSetManager: Lost task 26.1 in stage 1.0 (TID 50, > ip-172-31-7-125.eu-west-1.compute.internal, executor 10): ExecutorLostFailure > (executor 10 exited caused by one of the running tasks) Reason: Container > killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical > memory used. Consider boosting spark.yarn.executor.memoryOverhead. > {code} > I tried to increase the {{spark.yarn.executor.memoryOverhead}} to 3000 which > delays the errors but eventually I get them before the end of the job. The > job eventually fails. > !Screen Shot 2017-09-06 at 11.31.31.png|width=800! > If I run the above job in scala everything works as expected (without having > to adjust the memoryOverhead) > {code} > import org.apache.spark.sql.functions.udf > val upper: String => String = _.toUpperCase > val df = spark.read.load("s3://some/parquet/dir/myfile.parquet") > val upperUDF = udf(upper) > val newdf = df.withColumn("test_field", upperUDF(col("std_useragent"))) > newdf.write.parquet("/output.parquet") > {code} > !Screen Shot 2017-09-06 at 11.31.13.png|width=800! > Cpu utilisation is very bad with pyspark > !cpu.png|width=800! > Is this a known bug with pyspark and udfs or is it a matter of bad > configuration? > Looking forward to suggestions. Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22010) Slow fromInternal conversion for TimestampType
Maciej Bryński created SPARK-22010: -- Summary: Slow fromInternal conversion for TimestampType Key: SPARK-22010 URL: https://issues.apache.org/jira/browse/SPARK-22010 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.0 Reporter: Maciej Bryński To convert timestamp type to python we are using `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 100)` code. {code} In [34]: %%timeit ...: datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344) ...: 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each) {code} It's slow, because: # we're trying to get TZ on every conversion # we're using replace method Proposed solution: custom datetime conversion and move calculation of TZ to module -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22009) Using treeAggregate improve some algs
[ https://issues.apache.org/jira/browse/SPARK-22009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-22009: - Description: I test on a dataset of about 13M instances, and found that using `treeAggregate` give a speedup in following algs: OneHotEncoder ~ 5% StatFunctions.calculateCov ~ 7% StatFunctions.multipleApproxQuantiles ~ 9% RegressionEvaluator ~ 8% was: I test on a dataset of about 13M instances, and found that using `treeAggregate` give a speedup in following algs: OneHotEncoder ~ 5% StatFunctions.calculateCov ~ 13% StatFunctions.multipleApproxQuantiles ~ 9% RegressionEvaluator ~ 8% > Using treeAggregate improve some algs > - > > Key: SPARK-22009 > URL: https://issues.apache.org/jira/browse/SPARK-22009 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: zhengruifeng >Priority: Minor > > I test on a dataset of about 13M instances, and found that using > `treeAggregate` give a speedup in following algs: > OneHotEncoder ~ 5% > StatFunctions.calculateCov ~ 7% > StatFunctions.multipleApproxQuantiles ~ 9% > RegressionEvaluator ~ 8% -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22009) Using treeAggregate improve some algs
[ https://issues.apache.org/jira/browse/SPARK-22009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22009: Assignee: Apache Spark > Using treeAggregate improve some algs > - > > Key: SPARK-22009 > URL: https://issues.apache.org/jira/browse/SPARK-22009 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > I test on a dataset of about 13M instances, and found that using > `treeAggregate` give a speedup in following algs: > OneHotEncoder ~ 5% > StatFunctions.calculateCov ~ 13% > StatFunctions.multipleApproxQuantiles ~ 9% > RegressionEvaluator ~ 8% -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22009) Using treeAggregate improve some algs
[ https://issues.apache.org/jira/browse/SPARK-22009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166003#comment-16166003 ] Apache Spark commented on SPARK-22009: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/19232 > Using treeAggregate improve some algs > - > > Key: SPARK-22009 > URL: https://issues.apache.org/jira/browse/SPARK-22009 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: zhengruifeng >Priority: Minor > > I test on a dataset of about 13M instances, and found that using > `treeAggregate` give a speedup in following algs: > OneHotEncoder ~ 5% > StatFunctions.calculateCov ~ 13% > StatFunctions.multipleApproxQuantiles ~ 9% > RegressionEvaluator ~ 8% -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22009) Using treeAggregate improve some algs
[ https://issues.apache.org/jira/browse/SPARK-22009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22009: Assignee: (was: Apache Spark) > Using treeAggregate improve some algs > - > > Key: SPARK-22009 > URL: https://issues.apache.org/jira/browse/SPARK-22009 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: zhengruifeng >Priority: Minor > > I test on a dataset of about 13M instances, and found that using > `treeAggregate` give a speedup in following algs: > OneHotEncoder ~ 5% > StatFunctions.calculateCov ~ 13% > StatFunctions.multipleApproxQuantiles ~ 9% > RegressionEvaluator ~ 8% -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22009) Using treeAggregate improve some algs
zhengruifeng created SPARK-22009: Summary: Using treeAggregate improve some algs Key: SPARK-22009 URL: https://issues.apache.org/jira/browse/SPARK-22009 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.3.0 Reporter: zhengruifeng Priority: Minor I test on a dataset of about 13M instances, and found that using `treeAggregate` give a speedup in following algs: OneHotEncoder ~ 5% StatFunctions.calculateCov ~ 13% StatFunctions.multipleApproxQuantiles ~ 9% RegressionEvaluator ~ 8% -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22008) Spark Streaming Dynamic Allocation auto fix maxNumExecutors
Yue Ma created SPARK-22008: -- Summary: Spark Streaming Dynamic Allocation auto fix maxNumExecutors Key: SPARK-22008 URL: https://issues.apache.org/jira/browse/SPARK-22008 Project: Spark Issue Type: Improvement Components: DStreams Affects Versions: 2.2.0 Reporter: Yue Ma Priority: Minor In SparkStreaming DRA .The metric we use to add or remove executor is the ratio of batch processing time / batch duration (R). And we use the parameter "spark.streaming.dynamicAllocation.maxExecutors" to set the max Num of executor .Currently it doesn't work well with Spark streaming because of several reasons: (1) For example if the max nums of executor we need is 10 and we set "spark.streaming.dynamicAllocation.maxExecutors" to 15,Obviously ,We wasted 5 executors. (2) If the number of topic partition changes ,then the partition of KafkaRDD or the num of tasks in a stage changes too.And the max executor we need will also change,so the num of maxExecutors should change with the nums of Task . The goal of this JIRA is to auto fix maxNumExecutors . Using a SparkListerner when Stage Submitted ,first figure out the num executor we need , then update the maxNumExecutor -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22007) spark-submit on yarn or local , got different result
[ https://issues.apache.org/jira/browse/SPARK-22007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xinzhang resolved SPARK-22007. -- Resolution: Won't Fix > spark-submit on yarn or local , got different result > > > Key: SPARK-22007 > URL: https://issues.apache.org/jira/browse/SPARK-22007 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 2.1.0 >Reporter: xinzhang > > submit the py script on local. > /opt/spark/spark-bin/bin/spark-submit --master local cluster test_hive.py > result: > ++ > |databaseName| > ++ > | default| > | | > | x| > ++ > submit the py script on yarn. > /opt/spark/spark-bin/bin/spark-submit --master yarn --deploy-mode cluster > test_hive.py > result: > ++ > |databaseName| > ++ > | default| > ++ > the py script : > [yangtt@dc-gateway119 test]$ cat test_hive.py > #!/usr/bin/env python > #coding=utf-8 > from os.path import expanduser, join, abspath > from pyspark.sql import SparkSession > from pyspark.sql import Row > from pyspark.conf import SparkConf > def squared(s): > return s * s > warehouse_location = abspath('/group/user/yangtt/meta/hive-temp-table') > spark = SparkSession \ > .builder \ > .appName("Python_Spark_SQL_Hive") \ > .config("spark.sql.warehouse.dir", warehouse_location) \ > .config(conf=SparkConf()) \ > .enableHiveSupport() \ > .getOrCreate() > spark.udf.register("squared",squared) > spark.sql("show databases").show() > Q:why the spark load the different hive metastore > the yarn always use the DERBY? > 17/09/14 16:10:55 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is > DERBY > my current metastore is in mysql. > any suggest will be helpful. > thanks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22007) spark-submit on yarn or local , got different result
[ https://issues.apache.org/jira/browse/SPARK-22007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165972#comment-16165972 ] xinzhang commented on SPARK-22007: -- ye .i figure it out. add this with instance sparkSession .config("hive.metastore.uris", "thrift://11.11.11.11:9083") \ maybe the web here should describe more detail. https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SparkSession.Builder > spark-submit on yarn or local , got different result > > > Key: SPARK-22007 > URL: https://issues.apache.org/jira/browse/SPARK-22007 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 2.1.0 >Reporter: xinzhang > > submit the py script on local. > /opt/spark/spark-bin/bin/spark-submit --master local cluster test_hive.py > result: > ++ > |databaseName| > ++ > | default| > | | > | x| > ++ > submit the py script on yarn. > /opt/spark/spark-bin/bin/spark-submit --master yarn --deploy-mode cluster > test_hive.py > result: > ++ > |databaseName| > ++ > | default| > ++ > the py script : > [yangtt@dc-gateway119 test]$ cat test_hive.py > #!/usr/bin/env python > #coding=utf-8 > from os.path import expanduser, join, abspath > from pyspark.sql import SparkSession > from pyspark.sql import Row > from pyspark.conf import SparkConf > def squared(s): > return s * s > warehouse_location = abspath('/group/user/yangtt/meta/hive-temp-table') > spark = SparkSession \ > .builder \ > .appName("Python_Spark_SQL_Hive") \ > .config("spark.sql.warehouse.dir", warehouse_location) \ > .config(conf=SparkConf()) \ > .enableHiveSupport() \ > .getOrCreate() > spark.udf.register("squared",squared) > spark.sql("show databases").show() > Q:why the spark load the different hive metastore > the yarn always use the DERBY? > 17/09/14 16:10:55 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is > DERBY > my current metastore is in mysql. > any suggest will be helpful. > thanks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21418) NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true
[ https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165946#comment-16165946 ] Daniel Darabos commented on SPARK-21418: Sean's fix should cover you no matter what triggers the unexpected {{toString}} call. You could try building from {{master}} (or taking a nightly from https://spark.apache.org/developer-tools.html#nightly-builds) to confirm that this is the case. > NoSuchElementException: None.get in DataSourceScanExec with > sun.io.serialization.extendedDebugInfo=true > --- > > Key: SPARK-21418 > URL: https://issues.apache.org/jira/browse/SPARK-21418 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Daniel Darabos >Assignee: Sean Owen >Priority: Minor > Fix For: 2.2.1, 2.3.0 > > > I don't have a minimal reproducible example yet, sorry. I have the following > lines in a unit test for our Spark application: > {code} > val df = mySparkSession.read.format("jdbc") > .options(Map("url" -> url, "dbtable" -> "test_table")) > .load() > df.show > println(df.rdd.collect) > {code} > The output shows the DataFrame contents from {{df.show}}. But the {{collect}} > fails: > {noformat} > org.apache.spark.SparkException: Job aborted due to stage failure: Task > serialization failed: java.util.NoSuchElementException: None.get > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70) > at > org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54) > at > org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576) > at > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480) > at > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477) > at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) >
[jira] [Updated] (SPARK-22007) spark-submit on yarn or local , got different result
[ https://issues.apache.org/jira/browse/SPARK-22007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xinzhang updated SPARK-22007: - Description: submit the py script on local. /opt/spark/spark-bin/bin/spark-submit --master local cluster test_hive.py result: ++ |databaseName| ++ | default| | | | x| ++ submit the py script on yarn. /opt/spark/spark-bin/bin/spark-submit --master yarn --deploy-mode cluster test_hive.py result: ++ |databaseName| ++ | default| ++ the py script : [yangtt@dc-gateway119 test]$ cat test_hive.py #!/usr/bin/env python #coding=utf-8 from os.path import expanduser, join, abspath from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.conf import SparkConf def squared(s): return s * s warehouse_location = abspath('/group/user/yangtt/meta/hive-temp-table') spark = SparkSession \ .builder \ .appName("Python_Spark_SQL_Hive") \ .config("spark.sql.warehouse.dir", warehouse_location) \ .config(conf=SparkConf()) \ .enableHiveSupport() \ .getOrCreate() spark.udf.register("squared",squared) spark.sql("show databases").show() Q:why the spark load the different hive metastore the yarn always use the DERBY? 17/09/14 16:10:55 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY my current metastore is in mysql. any suggest will be helpful. thanks. was: submit the py script on local. /opt/spark/spark-bin/bin/spark-submit --master local cluster test_hive.py result: ++ |databaseName| ++ | default| | | | x| ++ submit the py script on yarn. /opt/spark/spark-bin/bin/spark-submit --master yarn --deploy-mode cluster test_hive.py result: ++ |databaseName| ++ | default| ++ the py script : [yangtt@dc-gateway119 test]$ cat test_hive.py #!/usr/bin/env python #coding=utf-8 from os.path import expanduser, join, abspath from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.conf import SparkConf def squared(s): return s * s # warehouse_location points to the default location for managed databases and tables warehouse_location = abspath('/group/user/yangtt/meta/hive-temp-table') spark = SparkSession \ .builder \ .appName("Python_Spark_SQL_Hive") \ .config("spark.sql.warehouse.dir", warehouse_location) \ .config(conf=SparkConf()) \ .enableHiveSupport() \ .getOrCreate() spark.udf.register("squared",squared) spark.sql("show databases").show() Q:why the spark load the different hive metastore the yarn always use the DERBY? 17/09/14 16:10:55 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY my current metastore is in mysql. any suggest will be helpful. thanks. > spark-submit on yarn or local , got different result > > > Key: SPARK-22007 > URL: https://issues.apache.org/jira/browse/SPARK-22007 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 2.1.0 >Reporter: xinzhang > > submit the py script on local. > /opt/spark/spark-bin/bin/spark-submit --master local cluster test_hive.py > result: > ++ > |databaseName| > ++ > | default| > | | > | x| > ++ > submit the py script on yarn. > /opt/spark/spark-bin/bin/spark-submit --master yarn --deploy-mode cluster > test_hive.py > result: > ++ > |databaseName| > ++ > | default| > ++ > the py script : > [yangtt@dc-gateway119 test]$ cat test_hive.py > #!/usr/bin/env python > #coding=utf-8 > from os.path import expanduser, join, abspath > from pyspark.sql import SparkSession > from pyspark.sql import Row > from pyspark.conf import SparkConf > def squared(s): > return s * s > warehouse_location = abspath('/group/user/yangtt/meta/hive-temp-table') > spark = SparkSession \ > .builder \ > .appName("Python_Spark_SQL_Hive") \ > .config("spark.sql.warehouse.dir", warehouse_location) \ > .config(conf=SparkConf()) \ > .enableHiveSupport() \ > .getOrCreate() > spark.udf.register("squared",squared) > spark.sql("show databases").show() > Q:why the spark load the different hive metastore > the yarn always use the DERBY? > 17/09/14 16:10:55 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is > DERBY > my current metastore is in mysql. > any suggest will be helpful. > thanks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22007) spark-submit on yarn or local , got different result
xinzhang created SPARK-22007: Summary: spark-submit on yarn or local , got different result Key: SPARK-22007 URL: https://issues.apache.org/jira/browse/SPARK-22007 Project: Spark Issue Type: Bug Components: Spark Core, Spark Shell, Spark Submit Affects Versions: 2.1.0 Reporter: xinzhang submit the py script on local. /opt/spark/spark-bin/bin/spark-submit --master local cluster test_hive.py result: ++ |databaseName| ++ | default| | | | x| ++ submit the py script on yarn. /opt/spark/spark-bin/bin/spark-submit --master yarn --deploy-mode cluster test_hive.py result: ++ |databaseName| ++ | default| ++ the py script : [yangtt@dc-gateway119 test]$ cat test_hive.py #!/usr/bin/env python #coding=utf-8 from os.path import expanduser, join, abspath from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.conf import SparkConf def squared(s): return s * s # warehouse_location points to the default location for managed databases and tables warehouse_location = abspath('/group/user/yangtt/meta/hive-temp-table') spark = SparkSession \ .builder \ .appName("Python_Spark_SQL_Hive") \ .config("spark.sql.warehouse.dir", warehouse_location) \ .config(conf=SparkConf()) \ .enableHiveSupport() \ .getOrCreate() spark.udf.register("squared",squared) spark.sql("show databases").show() Q:why the spark load the different hive metastore the yarn always use the DERBY? 17/09/14 16:10:55 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY my current metastore is in mysql. any suggest will be helpful. thanks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org