[jira] [Updated] (SPARK-23890) Support DDL for adding nested columns to struct types
[ https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Otto updated SPARK-23890: Description: As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE CHANGE COLUMN commands to Hive. This restriction was loosened in [https://github.com/apache/spark/pull/12714] to allow for those commands if they only change the column comment. Wikimedia has been evolving Parquet backed Hive tables with data originally from JSON events by adding newly found columns to the Hive table schema, via a Spark job we call 'Refine'. We do this by recursively merging an input DataFrame schema with a Hive table DataFrame schema, finding new fields, and then issuing an ALTER TABLE statement to add the columns. However, because we allow for nested data types in the incoming JSON data, we make extensive use of struct type fields. In order to add newly detected fields in a nested data type, we must alter the struct column and append the nested struct field. This requires CHANGE COLUMN that alters the column type. In reality, the 'type' of the column is not changing, it just just a new field being added to the struct, but to SQL, this looks like a type change. -We were about to upgrade to Spark 2 but this new restriction in SQL DDL that can be sent to Hive will block us. I believe this is fixable by adding an exception in [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325] to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and destination type are both struct types, and the destination type only adds new fields.- In this [PR|https://github.com/apache/spark/pull/21012], I was told that the Spark 3 datasource v2 would support this. However, it is clear that it does not. There is an [explicit check|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L1441] and [test|https://github.com/apache/spark/blob/e3f46ed57dc063566cdb9425b4d5e02c65332df1/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala#L583] that prevents this from happening. This an be done via {{{}ALTER TABLE ADD COLUMN nested1.new_field1{}}}, but this is not supported for any datasource v1 sources. was: As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE CHANGE COLUMN commands to Hive. This restriction was loosened in [https://github.com/apache/spark/pull/12714] to allow for those commands if they only change the column comment. Wikimedia has been evolving Parquet backed Hive tables with data originally from JSON events by adding newly found columns to the Hive table schema, via a Spark job we call 'Refine'. We do this by recursively merging an input DataFrame schema with a Hive table DataFrame schema, finding new fields, and then issuing an ALTER TABLE statement to add the columns. However, because we allow for nested data types in the incoming JSON data, we make extensive use of struct type fields. In order to add newly detected fields in a nested data type, we must alter the struct column and append the nested struct field. This requires CHANGE COLUMN that alters the column type. In reality, the 'type' of the column is not changing, it just just a new field being added to the struct, but to SQL, this looks like a type change. -We were about to upgrade to Spark 2 but this new restriction in SQL DDL that can be sent to Hive will block us. I believe this is fixable by adding an exception in [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325] to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and destination type are both struct types, and the destination type only adds new fields.- In this [PR|https://github.com/apache/spark/pull/21012], I was told that the Spark 3 datasource v2 would support this. However, it is clear that it does not. There is an [explicit check|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L1441] and [test|https://github.com/apache/spark/blob/e3f46ed57dc063566cdb9425b4d5e02c65332df1/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala#L583] that prevents this from happening. > Support DDL for adding nested columns to struct types > - > > Key: SPARK-23890 > URL: https://issues.apache.org/jira/browse/SPARK-23890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 3.0.0 >Reporter: Andrew Otto >Priority:
[jira] [Updated] (SPARK-23890) Support DDL for adding nested columns to struct types
[ https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Otto updated SPARK-23890: Summary: Support DDL for adding nested columns to struct types (was: Hive ALTER TABLE CHANGE COLUMN for struct type no longer works) > Support DDL for adding nested columns to struct types > - > > Key: SPARK-23890 > URL: https://issues.apache.org/jira/browse/SPARK-23890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 3.0.0 >Reporter: Andrew Otto >Priority: Major > Labels: bulk-closed, pull-request-available > Fix For: 3.0.0 > > > As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE > CHANGE COLUMN commands to Hive. This restriction was loosened in > [https://github.com/apache/spark/pull/12714] to allow for those commands if > they only change the column comment. > Wikimedia has been evolving Parquet backed Hive tables with data originally > from JSON events by adding newly found columns to the Hive table schema, via > a Spark job we call 'Refine'. We do this by recursively merging an input > DataFrame schema with a Hive table DataFrame schema, finding new fields, and > then issuing an ALTER TABLE statement to add the columns. However, because > we allow for nested data types in the incoming JSON data, we make extensive > use of struct type fields. In order to add newly detected fields in a nested > data type, we must alter the struct column and append the nested struct > field. This requires CHANGE COLUMN that alters the column type. In reality, > the 'type' of the column is not changing, it just just a new field being > added to the struct, but to SQL, this looks like a type change. > -We were about to upgrade to Spark 2 but this new restriction in SQL DDL that > can be sent to Hive will block us. I believe this is fixable by adding an > exception in > [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325] > to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and > destination type are both struct types, and the destination type only adds > new fields.- > > In this [PR|https://github.com/apache/spark/pull/21012], I was told that the > Spark 3 datasource v2 would support this. > However, it is clear that it does not. There is an [explicit > check|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L1441] > and > [test|https://github.com/apache/spark/blob/e3f46ed57dc063566cdb9425b4d5e02c65332df1/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala#L583] > that prevents this from happening. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
[ https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Otto reopened SPARK-23890: - This is fixed for DataSource v2 via {{{}alter table add column nested.new_field0{}}}, but apparently there are few data sources that use the DataSource v2 code path. Iceberg file works, but Hive, Parquet, ORC, JSON still use the DataSource v1 code path to check if this is allowed. Reopening and retitling more generically to allow nested column addition for Parquet, etc. > Hive ALTER TABLE CHANGE COLUMN for struct type no longer works > -- > > Key: SPARK-23890 > URL: https://issues.apache.org/jira/browse/SPARK-23890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 3.0.0 >Reporter: Andrew Otto >Priority: Major > Labels: bulk-closed, pull-request-available > Fix For: 3.0.0 > > > As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE > CHANGE COLUMN commands to Hive. This restriction was loosened in > [https://github.com/apache/spark/pull/12714] to allow for those commands if > they only change the column comment. > Wikimedia has been evolving Parquet backed Hive tables with data originally > from JSON events by adding newly found columns to the Hive table schema, via > a Spark job we call 'Refine'. We do this by recursively merging an input > DataFrame schema with a Hive table DataFrame schema, finding new fields, and > then issuing an ALTER TABLE statement to add the columns. However, because > we allow for nested data types in the incoming JSON data, we make extensive > use of struct type fields. In order to add newly detected fields in a nested > data type, we must alter the struct column and append the nested struct > field. This requires CHANGE COLUMN that alters the column type. In reality, > the 'type' of the column is not changing, it just just a new field being > added to the struct, but to SQL, this looks like a type change. > -We were about to upgrade to Spark 2 but this new restriction in SQL DDL that > can be sent to Hive will block us. I believe this is fixable by adding an > exception in > [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325] > to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and > destination type are both struct types, and the destination type only adds > new fields.- > > In this [PR|https://github.com/apache/spark/pull/21012], I was told that the > Spark 3 datasource v2 would support this. > However, it is clear that it does not. There is an [explicit > check|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L1441] > and > [test|https://github.com/apache/spark/blob/e3f46ed57dc063566cdb9425b4d5e02c65332df1/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala#L583] > that prevents this from happening. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
[ https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Otto resolved SPARK-23890. - Fix Version/s: 3.0.0 Resolution: Fixed Ah! This is supported in DataSource v2 after all, except just not via CHANGE COLUMN. Instead, you can add a column to a nested field by addressing it with dotted notation: ALTER TABLE otto.test_table03 ADD COLUMN s1.s1_f2_added STRING; > Hive ALTER TABLE CHANGE COLUMN for struct type no longer works > -- > > Key: SPARK-23890 > URL: https://issues.apache.org/jira/browse/SPARK-23890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 3.0.0 >Reporter: Andrew Otto >Priority: Major > Labels: bulk-closed, pull-request-available > Fix For: 3.0.0 > > > As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE > CHANGE COLUMN commands to Hive. This restriction was loosened in > [https://github.com/apache/spark/pull/12714] to allow for those commands if > they only change the column comment. > Wikimedia has been evolving Parquet backed Hive tables with data originally > from JSON events by adding newly found columns to the Hive table schema, via > a Spark job we call 'Refine'. We do this by recursively merging an input > DataFrame schema with a Hive table DataFrame schema, finding new fields, and > then issuing an ALTER TABLE statement to add the columns. However, because > we allow for nested data types in the incoming JSON data, we make extensive > use of struct type fields. In order to add newly detected fields in a nested > data type, we must alter the struct column and append the nested struct > field. This requires CHANGE COLUMN that alters the column type. In reality, > the 'type' of the column is not changing, it just just a new field being > added to the struct, but to SQL, this looks like a type change. > -We were about to upgrade to Spark 2 but this new restriction in SQL DDL that > can be sent to Hive will block us. I believe this is fixable by adding an > exception in > [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325] > to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and > destination type are both struct types, and the destination type only adds > new fields.- > > In this [PR|https://github.com/apache/spark/pull/21012], I was told that the > Spark 3 datasource v2 would support this. > However, it is clear that it does not. There is an [explicit > check|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L1441] > and > [test|https://github.com/apache/spark/blob/e3f46ed57dc063566cdb9425b4d5e02c65332df1/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala#L583] > that prevents this from happening. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
[ https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Otto updated SPARK-23890: Shepherd: Max Gekk Affects Version/s: 3.0.0 Description: As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE CHANGE COLUMN commands to Hive. This restriction was loosened in [https://github.com/apache/spark/pull/12714] to allow for those commands if they only change the column comment. Wikimedia has been evolving Parquet backed Hive tables with data originally from JSON events by adding newly found columns to the Hive table schema, via a Spark job we call 'Refine'. We do this by recursively merging an input DataFrame schema with a Hive table DataFrame schema, finding new fields, and then issuing an ALTER TABLE statement to add the columns. However, because we allow for nested data types in the incoming JSON data, we make extensive use of struct type fields. In order to add newly detected fields in a nested data type, we must alter the struct column and append the nested struct field. This requires CHANGE COLUMN that alters the column type. In reality, the 'type' of the column is not changing, it just just a new field being added to the struct, but to SQL, this looks like a type change. -We were about to upgrade to Spark 2 but this new restriction in SQL DDL that can be sent to Hive will block us. I believe this is fixable by adding an exception in [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325] to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and destination type are both struct types, and the destination type only adds new fields.- In this [PR|https://github.com/apache/spark/pull/21012], I was told that the Spark 3 datasource v2 would support this. However, it is clear that it does not. There is an [explicit check|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L1441] and [test|https://github.com/apache/spark/blob/e3f46ed57dc063566cdb9425b4d5e02c65332df1/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala#L583] that prevents this from happening. was: As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE CHANGE COLUMN commands to Hive. This restriction was loosened in [https://github.com/apache/spark/pull/12714] to allow for those commands if they only change the column comment. Wikimedia has been evolving Parquet backed Hive tables with data originally from JSON events by adding newly found columns to the Hive table schema, via a Spark job we call 'Refine'. We do this by recursively merging an input DataFrame schema with a Hive table DataFrame schema, finding new fields, and then issuing an ALTER TABLE statement to add the columns. However, because we allow for nested data types in the incoming JSON data, we make extensive use of struct type fields. In order to add newly detected fields in a nested data type, we must alter the struct column and append the nested struct field. This requires CHANGE COLUMN that alters the column type. In reality, the 'type' of the column is not changing, it just just a new field being added to the struct, but to SQL, this looks like a type change. We were about to upgrade to Spark 2 but this new restriction in SQL DDL that can be sent to Hive will block us. I believe this is fixable by adding an exception in [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325] to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and destination type are both struct types, and the destination type only adds new fields. > Hive ALTER TABLE CHANGE COLUMN for struct type no longer works > -- > > Key: SPARK-23890 > URL: https://issues.apache.org/jira/browse/SPARK-23890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 3.0.0 >Reporter: Andrew Otto >Priority: Major > Labels: bulk-closed, pull-request-available > > As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE > CHANGE COLUMN commands to Hive. This restriction was loosened in > [https://github.com/apache/spark/pull/12714] to allow for those commands if > they only change the column comment. > Wikimedia has been evolving Parquet backed Hive tables with data originally > from JSON events by adding newly found columns to the Hive table schema, via > a Spark job we call 'Refine'. We do this by recursively merging an input > DataFrame schema with a
[jira] [Reopened] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
[ https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Otto reopened SPARK-23890: - This was supposed to have been fixed in Spark 3 datasource v2, but the issue persists. > Hive ALTER TABLE CHANGE COLUMN for struct type no longer works > -- > > Key: SPARK-23890 > URL: https://issues.apache.org/jira/browse/SPARK-23890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 3.0.0 >Reporter: Andrew Otto >Priority: Major > Labels: bulk-closed, pull-request-available > > As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE > CHANGE COLUMN commands to Hive. This restriction was loosened in > [https://github.com/apache/spark/pull/12714] to allow for those commands if > they only change the column comment. > Wikimedia has been evolving Parquet backed Hive tables with data originally > from JSON events by adding newly found columns to the Hive table schema, via > a Spark job we call 'Refine'. We do this by recursively merging an input > DataFrame schema with a Hive table DataFrame schema, finding new fields, and > then issuing an ALTER TABLE statement to add the columns. However, because > we allow for nested data types in the incoming JSON data, we make extensive > use of struct type fields. In order to add newly detected fields in a nested > data type, we must alter the struct column and append the nested struct > field. This requires CHANGE COLUMN that alters the column type. In reality, > the 'type' of the column is not changing, it just just a new field being > added to the struct, but to SQL, this looks like a type change. > -We were about to upgrade to Spark 2 but this new restriction in SQL DDL that > can be sent to Hive will block us. I believe this is fixable by adding an > exception in > [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325] > to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and > destination type are both struct types, and the destination type only adds > new fields.- > > In this [PR|https://github.com/apache/spark/pull/21012], I was told that the > Spark 3 datasource v2 would support this. > However, it is clear that it does not. There is an [explicit > check|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L1441] > and > [test|https://github.com/apache/spark/blob/e3f46ed57dc063566cdb9425b4d5e02c65332df1/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala#L583] > that prevents this from happening. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-43273) Spark can't read parquet files with a newer LZ4_RAW compression
[ https://issues.apache.org/jira/browse/SPARK-43273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716135#comment-17716135 ] Andrew Grigorev edited comment on SPARK-43273 at 4/25/23 12:56 PM: --- Just as a icing on the cake - Clickhouse accidently started to use LZ4_RAW by default for their Parquet output format :). https://github.com/ClickHouse/ClickHouse/issues/49141 was (Author: ei-grad): Just as a icing on the cake - Clickhouse accidently started to use LZ4_RAW by default for their Parquet output format :). > Spark can't read parquet files with a newer LZ4_RAW compression > --- > > Key: SPARK-43273 > URL: https://issues.apache.org/jira/browse/SPARK-43273 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.4, 3.3.3, 3.3.2, 3.4.0 >Reporter: Andrew Grigorev >Priority: Trivial > > hadoop-parquet version should be updated to 1.3.0 (together with other > parquet-mr libs) > {code:java} > java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job > aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent > failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): > java.lang.IllegalArgumentException: No enum constant > org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW > at java.base/java.lang.Enum.valueOf(Enum.java:273) > at > org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636) > ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43273) Spark can't read parquet files with a newer LZ4_RAW compression
[ https://issues.apache.org/jira/browse/SPARK-43273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Grigorev updated SPARK-43273: Description: hadoop-parquet version should be updated to 1.3.0 (together with other parquet-mr libs) {code:java} java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): java.lang.IllegalArgumentException: No enum constant org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW at java.base/java.lang.Enum.valueOf(Enum.java:273) at org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636) ... {code} was: hadoop-parquet version should be updated to 1.3.0 {code:java} java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): java.lang.IllegalArgumentException: No enum constant org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW at java.base/java.lang.Enum.valueOf(Enum.java:273) at org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636) ... {code} > Spark can't read parquet files with a newer LZ4_RAW compression > --- > > Key: SPARK-43273 > URL: https://issues.apache.org/jira/browse/SPARK-43273 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.4, 3.3.3, 3.3.2, 3.4.0 >Reporter: Andrew Grigorev >Priority: Trivial > > hadoop-parquet version should be updated to 1.3.0 (together with other > parquet-mr libs) > {code:java} > java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job > aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent > failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): > java.lang.IllegalArgumentException: No enum constant > org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW > at java.base/java.lang.Enum.valueOf(Enum.java:273) > at > org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636) > ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43273) Spark can't read parquet files with a newer LZ4_RAW compression
[ https://issues.apache.org/jira/browse/SPARK-43273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716135#comment-17716135 ] Andrew Grigorev commented on SPARK-43273: - Just as a icing on the cake - Clickhouse accidently started to use LZ4_RAW by default for their Parquet output format :). > Spark can't read parquet files with a newer LZ4_RAW compression > --- > > Key: SPARK-43273 > URL: https://issues.apache.org/jira/browse/SPARK-43273 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.4, 3.3.3, 3.3.2, 3.4.0 >Reporter: Andrew Grigorev >Priority: Trivial > > hadoop-parquet version should be updated to 1.3.0 > > {code:java} > java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job > aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent > failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): > java.lang.IllegalArgumentException: No enum constant > org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW > at java.base/java.lang.Enum.valueOf(Enum.java:273) > at > org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636) > ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43273) Spark can't read parquet files with a newer LZ4_RAW compression
[ https://issues.apache.org/jira/browse/SPARK-43273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Grigorev updated SPARK-43273: Description: hadoop-parquet version should be updated to 1.3.0 {code:java} java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): java.lang.IllegalArgumentException: No enum constant org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW at java.base/java.lang.Enum.valueOf(Enum.java:273) at org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636) ... {code} was:hadoop-parquet version should be updated to 1.3.0 > Spark can't read parquet files with a newer LZ4_RAW compression > --- > > Key: SPARK-43273 > URL: https://issues.apache.org/jira/browse/SPARK-43273 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.4, 3.3.3, 3.3.2, 3.4.0 >Reporter: Andrew Grigorev >Priority: Trivial > > hadoop-parquet version should be updated to 1.3.0 > > {code:java} > java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job > aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent > failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): > java.lang.IllegalArgumentException: No enum constant > org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW > at java.base/java.lang.Enum.valueOf(Enum.java:273) > at > org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636) > ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43273) Spark can't read parquet files with a newer LZ4_RAW compression
Andrew Grigorev created SPARK-43273: --- Summary: Spark can't read parquet files with a newer LZ4_RAW compression Key: SPARK-43273 URL: https://issues.apache.org/jira/browse/SPARK-43273 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0, 3.3.2, 3.2.4, 3.3.3 Reporter: Andrew Grigorev hadoop-parquet version should be updated to 1.3.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43189) No overload variant of "pandas_udf" matches argument type "str"
[ https://issues.apache.org/jira/browse/SPARK-43189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Grigorev updated SPARK-43189: Description: h2. Issue Users who have mypy enabled in their IDE or CI environment face very verbose error messages when using the {{pandas_udf}} function in PySpark. The current typing of the {{pandas_udf}} function seems to be causing these issues. As a workaround, the official documentation provides examples that use {{{}# type: ignore[call-overload]{}}}, but this is not an ideal solution. h2. Example Here's a code snippet taken from [docs|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs] that triggers the error when mypy is enabled: {code:python} from pyspark.sql.functions import pandas_udf import pandas as pd @pandas_udf("col1 string, col2 long") def func(s1: pd.Series, s2: pd.Series, s3: pd.DataFrame) -> pd.DataFrame: s3['col2'] = s1 + s2.str.len() return s3 {code} Running mypy on this code results in a long and verbose error message, which makes it difficult for users to understand the actual issue and how to resolve it. h2. Proposed Solution We kindly request the PySpark development team to review and improve the typing for the {{pandas_udf}} function to prevent these verbose error messages from appearing. This improvement will help users who have mypy enabled in their development environments to have a better experience when using PySpark. Furthermore, we suggest updating the official documentation to provide better examples that do not rely on {{# type: ignore[call-overload]}} to suppress these errors. h2. Impact By addressing this issue, users of PySpark with mypy enabled in their development environment will be able to write and verify their code more efficiently, without being overwhelmed by verbose error messages. This will lead to a more enjoyable and productive experience when working with PySpark and pandas UDFs. was: h2. Issue Users who have mypy enabled in their IDE or CI environment face very verbose error messages when using the {{pandas_udf}} function in PySpark. The current typing of the {{pandas_udf}} function seems to be causing these issues. As a workaround, the official documentation provides examples that use {{{}# type: ignore[call-overload]{}}}, but this is not an ideal solution. h2. Example Here's a code snippet that triggers the error when mypy is enabled: {code:python} from pyspark.sql.functions import pandas_udf import pandas as pd @pandas_udf("string") def f(s: pd.Series) -> pd.Series: return pd.Series(["a"]*len(s), index=s.index) {code} Running mypy on this code results in a long and verbose error message, which makes it difficult for users to understand the actual issue and how to resolve it. h2. Proposed Solution We kindly request the PySpark development team to review and improve the typing for the {{pandas_udf}} function to prevent these verbose error messages from appearing. This improvement will help users who have mypy enabled in their development environments to have a better experience when using PySpark. Furthermore, we suggest updating the official documentation to provide better examples that do not rely on {{# type: ignore[call-overload]}} to suppress these errors. h2. Impact By addressing this issue, users of PySpark with mypy enabled in their development environment will be able to write and verify their code more efficiently, without being overwhelmed by verbose error messages. This will lead to a more enjoyable and productive experience when working with PySpark and pandas UDFs. > No overload variant of "pandas_udf" matches argument type "str" > --- > > Key: SPARK-43189 > URL: https://issues.apache.org/jira/browse/SPARK-43189 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.4, 3.3.2, 3.4.0 >Reporter: Andrew Grigorev >Priority: Major > > h2. Issue > Users who have mypy enabled in their IDE or CI environment face very verbose > error messages when using the {{pandas_udf}} function in PySpark. The current > typing of the {{pandas_udf}} function seems to be causing these issues. As a > workaround, the official documentation provides examples that use {{{}# type: > ignore[call-overload]{}}}, but this is not an ideal solution. > h2. Example > Here's a code snippet taken from > [docs|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs] > that triggers the error when mypy is enabled: > {code:python} > from pyspark.sql.functions import pandas_udf > import pandas as pd > @pandas_udf("col1 string, col2 long") > def func(s1: pd.Series, s2: pd.Series, s3: pd.DataFrame) ->
[jira] [Created] (SPARK-43189) No overload variant of "pandas_udf" matches argument type "str"
Andrew Grigorev created SPARK-43189: --- Summary: No overload variant of "pandas_udf" matches argument type "str" Key: SPARK-43189 URL: https://issues.apache.org/jira/browse/SPARK-43189 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0, 3.3.2, 3.2.4 Reporter: Andrew Grigorev h2. Issue Users who have mypy enabled in their IDE or CI environment face very verbose error messages when using the {{pandas_udf}} function in PySpark. The current typing of the {{pandas_udf}} function seems to be causing these issues. As a workaround, the official documentation provides examples that use {{{}# type: ignore[call-overload]{}}}, but this is not an ideal solution. h2. Example Here's a code snippet that triggers the error when mypy is enabled: {code:python} from pyspark.sql.functions import pandas_udf import pandas as pd @pandas_udf("string") def f(s: pd.Series) -> pd.Series: return pd.Series(["a"]*len(s), index=s.index) {code} Running mypy on this code results in a long and verbose error message, which makes it difficult for users to understand the actual issue and how to resolve it. h2. Proposed Solution We kindly request the PySpark development team to review and improve the typing for the {{pandas_udf}} function to prevent these verbose error messages from appearing. This improvement will help users who have mypy enabled in their development environments to have a better experience when using PySpark. Furthermore, we suggest updating the official documentation to provide better examples that do not rely on {{# type: ignore[call-overload]}} to suppress these errors. h2. Impact By addressing this issue, users of PySpark with mypy enabled in their development environment will be able to write and verify their code more efficiently, without being overwhelmed by verbose error messages. This will lead to a more enjoyable and productive experience when working with PySpark and pandas UDFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39897) StackOverflowError in TaskMemoryManager
Andrew Ray created SPARK-39897: -- Summary: StackOverflowError in TaskMemoryManager Key: SPARK-39897 URL: https://issues.apache.org/jira/browse/SPARK-39897 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.7 Reporter: Andrew Ray I have observed the following error that looks to stem from TaskMemoryManager.allocatePage making a recursive call to itself when a page can not be allocated. I'm observing this in Spark 2.4 but since the relevant code is still the same in master this is likely still a potential point of failure in current versions. Prioritizing this as minor as this looks to be a very uncommon outcome as I can not find any other reports of a similar nature. {code:java} Py4JJavaError: An error occurred while calling o625.saveAsTable. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:503) at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217) at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:177) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.StackOverflowError at java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1012) at java.util.concurrent.ConcurrentHashMap.putIfAbsent(ConcurrentHashMap.java:1535) at java.lang.ClassLoader.getClassLoadingLock(ClassLoader.java:457) at java.lang.ClassLoader.loadClass(ClassLoader.java:398) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at java.util.ResourceBundle$RBClassLoader.loadClass(ResourceBundle.java:512) at java.util.ResourceBundle$Control.newBundle(ResourceBundle.java:2657) at java.util.ResourceBundle.loadBundle(ResourceBundle.java:1518) at java.util.ResourceBundle.findBundle(ResourceBundle.java:1482) at java.util.ResourceBundle.findBundle(ResourceBundle.java:1436) at java.util.ResourceBundle.findBundle(ResourceBundle.java:1436) at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1370) at java.util.ResourceBundle.getBundle(ResourceBundle.java:899) at sun.util.resources.LocaleData$1.run(LocaleData.java:167)
[jira] [Created] (SPARK-39883) Add DataFrame function parity check
Andrew Ray created SPARK-39883: -- Summary: Add DataFrame function parity check Key: SPARK-39883 URL: https://issues.apache.org/jira/browse/SPARK-39883 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Andrew Ray Add a test that compares the available list of DataFrame functions in org.apache.spark.sql.functions with the SQL function registry. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39734) Add call_udf to pyspark.sql.functions
Andrew Ray created SPARK-39734: -- Summary: Add call_udf to pyspark.sql.functions Key: SPARK-39734 URL: https://issues.apache.org/jira/browse/SPARK-39734 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Andrew Ray Add the call_udf function to PySpark for parity with the Scala/Java function org.apache.spark.sql.functions#call_udf -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39733) Add map_contains_key to pyspark.sql.functions
Andrew Ray created SPARK-39733: -- Summary: Add map_contains_key to pyspark.sql.functions Key: SPARK-39733 URL: https://issues.apache.org/jira/browse/SPARK-39733 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Andrew Ray SPARK-37584 added the function map_contains_key to SQL and Scala/Java functions. This JIRA is to track its addition to the PySpark function set for parity. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39728) Test for parity of SQL functions between Python and JVM DataFrame API's
[ https://issues.apache.org/jira/browse/SPARK-39728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ray updated SPARK-39728: --- Priority: Minor (was: Major) > Test for parity of SQL functions between Python and JVM DataFrame API's > --- > > Key: SPARK-39728 > URL: https://issues.apache.org/jira/browse/SPARK-39728 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: Andrew Ray >Priority: Minor > > Add a unit test that compares the available list of Python DataFrame > functions in pyspark.sql.functions with those available in the Scala/Java > DataFrame API in org.apache.spark.sql.functions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39728) Test for parity of SQL functions between Python and JVM DataFrame API's
Andrew Ray created SPARK-39728: -- Summary: Test for parity of SQL functions between Python and JVM DataFrame API's Key: SPARK-39728 URL: https://issues.apache.org/jira/browse/SPARK-39728 Project: Spark Issue Type: Improvement Components: PySpark, Tests Affects Versions: 3.3.0 Reporter: Andrew Ray Add a unit test that compares the available list of Python DataFrame functions in pyspark.sql.functions with those available in the Scala/Java DataFrame API in org.apache.spark.sql.functions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38618) Implement JDBCDataSourceV2
Andrew Murphy created SPARK-38618: - Summary: Implement JDBCDataSourceV2 Key: SPARK-38618 URL: https://issues.apache.org/jira/browse/SPARK-38618 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.1 Reporter: Andrew Murphy -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38288) Aggregate push down doesnt work using Spark SQL jdbc datasource with postgresql
[ https://issues.apache.org/jira/browse/SPARK-38288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498668#comment-17498668 ] Andrew Murphy commented on SPARK-38288: --- Hi [~llozano] I believe this is because JDBC DataSource V2 has not been fully implemented. Even though [29695|https://github.com/apache/spark/pull/29695] has merged, reading from a JDBC connection still defaults to JDBC DataSource V1. > Aggregate push down doesnt work using Spark SQL jdbc datasource with > postgresql > --- > > Key: SPARK-38288 > URL: https://issues.apache.org/jira/browse/SPARK-38288 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 3.2.1 >Reporter: Luis Lozano Coira >Priority: Major > Labels: DataSource, Spark-SQL > > I am establishing a connection with postgresql using the Spark SQL jdbc > datasource. I have started the spark shell including the postgres driver and > I can connect and execute queries without problems. I am using this statement: > {code:java} > val df = spark.read.format("jdbc").option("url", > "jdbc:postgresql://host:port/").option("driver", > "org.postgresql.Driver").option("dbtable", "test").option("user", > "postgres").option("password", > "***").option("pushDownAggregate",true).load() > {code} > I am adding the pushDownAggregate option because I would like the > aggregations are delegated to the source. But for some reason this is not > happening. > Reviewing this pull request, it seems that this feature should be merged into > 3.2. [https://github.com/apache/spark/pull/29695] > I am making the aggregations considering the mentioned limitations. An > example case where I don't see pushdown being done would be this one: > {code:java} > df.groupBy("name").max("age").show() > {code} > The results of the queryExecution are shown below: > {code:java} > scala> df.groupBy("name").max("age").queryExecution.executedPlan > res19: org.apache.spark.sql.execution.SparkPlan = > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[name#274], functions=[max(age#246)], output=[name#274, > max(age)#544]) >+- Exchange hashpartitioning(name#274, 200), ENSURE_REQUIREMENTS, [id=#205] > +- HashAggregate(keys=[name#274], functions=[partial_max(age#246)], > output=[name#274, max#548]) > +- Scan JDBCRelation(test) [numPartitions=1] [age#246,name#274] > PushedAggregates: [], PushedFilters: [], PushedGroupby: [], ReadSchema: > struct > scala> dfp.groupBy("name").max("age").queryExecution.toString > res20: String = > "== Parsed Logical Plan == > Aggregate [name#274], [name#274, max(age#246) AS max(age)#581] > +- Relation [age#246] JDBCRelation(test) [numPartitions=1] > == Analyzed Logical Plan == > name: string, max(age): int > Aggregate [name#274], [name#274, max(age#246) AS max(age)#581] > +- Relation [age#24... > {code} > What could be the problem? Should pushDownAggregate work in this case? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38296) Error Classes AnalysisException is not propagated in FunctionRegistry
[ https://issues.apache.org/jira/browse/SPARK-38296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Murphy updated SPARK-38296: -- Description: While trying to migrate invalidLiteralForWindowDurationError to use Error Classes (SPARK-38110), I realized that the error class is not propagating correctly. This is because the AnalysisException is caught and rethrown at [FunctionRegistry:154|https://github.com/apache/spark/blob/43e93b581ea5f7a1ba6cf943e6624f6847ebc3a8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L154] without a case for Error Classes. I will attach a fix for this specific error here to avoid bundling too many changes into one PR for SPARK-38110. I anticipate that similar changes will need to be made where AnalysisException is rethrown as the migration continues. (was: While trying to migrate invalidLiteralForWindowDurationError to use Error Classes (SPARK-38110), I realized that the error class is not propagating correctly. This is because the AnalysisException is caught and rethrown at [FunctionRegistry:154|https://github.com/apache/spark/blob/43e93b581ea5f7a1ba6cf943e6624f6847ebc3a8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L154] without a case for Error Classes. I will attach a fix for this specific error here to avoid bundling too many changes into one PR. I anticipate that similar changes will need to be made where AnalysisException is rethrown as the migration continues.) > Error Classes AnalysisException is not propagated in FunctionRegistry > - > > Key: SPARK-38296 > URL: https://issues.apache.org/jira/browse/SPARK-38296 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Andrew Murphy >Priority: Major > > While trying to migrate invalidLiteralForWindowDurationError to use Error > Classes (SPARK-38110), I realized that the error class is not propagating > correctly. This is because the AnalysisException is caught and rethrown at > [FunctionRegistry:154|https://github.com/apache/spark/blob/43e93b581ea5f7a1ba6cf943e6624f6847ebc3a8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L154] > without a case for Error Classes. I will attach a fix for this specific > error here to avoid bundling too many changes into one PR for SPARK-38110. I > anticipate that similar changes will need to be made where AnalysisException > is rethrown as the migration continues. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38296) Error Classes AnalysisException is not propagated in FunctionRegistry
[ https://issues.apache.org/jira/browse/SPARK-38296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Murphy updated SPARK-38296: -- External issue ID: (was: SPARK-38110) External issue URL: (was: https://issues.apache.org/jira/browse/SPARK-38110) > Error Classes AnalysisException is not propagated in FunctionRegistry > - > > Key: SPARK-38296 > URL: https://issues.apache.org/jira/browse/SPARK-38296 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Andrew Murphy >Priority: Major > > While trying to migrate invalidLiteralForWindowDurationError to use Error > Classes (SPARK-38110), I realized that the error class is not propagating > correctly. This is because the AnalysisException is caught and rethrown at > [FunctionRegistry:154|https://github.com/apache/spark/blob/43e93b581ea5f7a1ba6cf943e6624f6847ebc3a8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L154] > without a case for Error Classes. I will attach a fix for this specific > error here to avoid bundling too many changes into one PR. I anticipate that > similar changes will need to be made where AnalysisException is rethrown as > the migration continues. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38296) Error Classes AnalysisException is not propagated in FunctionRegistry
Andrew Murphy created SPARK-38296: - Summary: Error Classes AnalysisException is not propagated in FunctionRegistry Key: SPARK-38296 URL: https://issues.apache.org/jira/browse/SPARK-38296 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Andrew Murphy While trying to migrate invalidLiteralForWindowDurationError to use Error Classes (SPARK-38110), I realized that the error class is not propagating correctly. This is because the AnalysisException is caught and rethrown at [FunctionRegistry:154|https://github.com/apache/spark/blob/43e93b581ea5f7a1ba6cf943e6624f6847ebc3a8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L154] without a case for Error Classes. I will attach a fix for this specific error here to avoid bundling too many changes into one PR. I anticipate that similar changes will need to be made where AnalysisException is rethrown as the migration continues. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37954) old columns should not be available after select or drop
[ https://issues.apache.org/jira/browse/SPARK-37954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17490630#comment-17490630 ] Andrew Murphy commented on SPARK-37954: --- Hi all, Super excited to report on this one, this is my first time attempting to contribute to Spark so bear with me! So it looks like df.drop("id") should not raise an error per the [documentation|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.drop.html] "This is a no-op if schema doesn’t contain the given column name(s)." So this is intended behavior. I was sure Filter was a bug though. Normally, filtering on a nonexistent column _does_ throw an AnalysisException, but in a very specific case, Catalyst actually propagates the reference to be removed. *Code* {code:java} from pyspark.sql import SparkSession from pyspark.sql.functions import col as col spark = SparkSession.builder.appName('available_columns').getOrCreate() df = spark.createDataFrame([{"oldcol": 1}, {"oldcol": 2}]) df = df.withColumnRenamed('oldcol', 'newcol') df = df.filter(col("oldcol")!=2) df.explain("extended") {code} *Output* {code:java} +--+ |newcol| +--+ | 1| +--+ == Parsed Logical Plan == 'Filter NOT ('oldcol = 2) +- Project [oldcol#6L AS newcol#8L] +- LogicalRDD [oldcol#6L], false == Analyzed Logical Plan == newcol: bigint Project [newcol#8L] +- Filter NOT (oldcol#6L = cast(2 as bigint)) +- Project [oldcol#6L AS newcol#8L, oldcol#6L] +- LogicalRDD [oldcol#6L], false == Optimized Logical Plan == Project [oldcol#6L AS newcol#8L] +- Filter (isnotnull(oldcol#6L) AND NOT (oldcol#6L = 2)) +- LogicalRDD [oldcol#6L], false == Physical Plan == *(1) Project [oldcol#6L AS newcol#8L] +- *(1) Filter (isnotnull(oldcol#6L) AND NOT (oldcol#6L = 2)) +- *(1) Scan ExistingRDD[oldcol#6L] {code} As you can see, in the Analysis step, Catalyst propagates oldcol#6L to the Project operator for seemingly no reason. Well, it turns out the reason is actually from SPARK-24781 ([PR)|https://github.com/apache/spark/pull/21745], where Analysis was failing on this construction. {code:scala} val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id") df.select(df("name")).filter(df("id") === 0).show() {code} {noformat} org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#6 missing from name#5 in operator !Filter (id#6 = 0).;; !Filter (id#6 = 0) +- AnalysisBarrier +- Project [name#5] +- Project [_1#2 AS name#5, _2#3 AS id#6] +- LocalRelation [_1#2, _2#3]{noformat} So it looks like this PR was created to propagate references in the very specific case of a Filter after Project when we filter and select in the same step, but unfortunately we don't differentiate between {code:java} df.select(df("name")).filter(df("id") === 0).show() {code} and {code:java} df = df.select(df("name")) df = df.filter(df("id") === 0).show() {code} [~viirya] I see you were the one who fixed this problem in the first place. Any thoughts about how this could be solved? Unfortunately, creating a fix for this one is beyond my capabilities right now but I'm trying to learn! > old columns should not be available after select or drop > > > Key: SPARK-37954 > URL: https://issues.apache.org/jira/browse/SPARK-37954 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.1 >Reporter: Jean Bon >Priority: Major > > > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.functions import col as col > spark = SparkSession.builder.appName('available_columns').getOrCreate() > df = spark.range(5).select((col("id")+10).alias("id2")) > assert df.columns==["id2"] #OK > try: > df.select("id") > error_raise = False > except: > error_raise = True > assert error_raise #OK > df = df.drop("id") #should raise an error > df.filter(col("id")!=2).count() #returns 4, should raise an error > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37295) illegal reflective access operation has occurred; Please consider reporting this to the maintainers
Andrew Davidson created SPARK-37295: --- Summary: illegal reflective access operation has occurred; Please consider reporting this to the maintainers Key: SPARK-37295 URL: https://issues.apache.org/jira/browse/SPARK-37295 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.1.2 Environment: MacBook pro running mac OS 11.6 spark-3.1.2-bin-hadoop3.2 it is not clear to me how spark finds java. I believe I also have java 8 installed somewhere ``` $ which java ~/anaconda3/envs/extraCellularRNA/bin/java $ java -version openjdk version "11.0.6" 2020-01-14 OpenJDK Runtime Environment (build 11.0.6+8-b765.1) OpenJDK 64-Bit Server VM (build 11.0.6+8-b765.1, mixed mode) ``` Reporter: Andrew Davidson ``` spark = SparkSession\ .builder\ .appName("TestEstimatedScalingFactors")\ .getOrCreate() ``` generates the following warning ``` WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/Users/xxx/googleUCSC/kimLab/extraCellularRNA/terra/deseq/spark-3.1.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.2.jar) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release 21/11/11 12:51:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable ``` I am using pyspark spark-3.1.2-bin-hadoop3.2 on a MacBook pro running mac OS 11.6 My small unit test see to work okay how ever It fails when I try and run on 3.2.0 I Any idea how I track down this issue? Kind regards Andy -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36649) Support Trigger.AvailableNow on Kafka data source
[ https://issues.apache.org/jira/browse/SPARK-36649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411256#comment-17411256 ] Andrew Olson commented on SPARK-36649: -- [~kabhwan] I am not sure I fully understand the details of this. I only have experience with using a Kafka source in batch mode ([https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries]) and minimal familiarity with most of the Spark code. > Support Trigger.AvailableNow on Kafka data source > - > > Key: SPARK-36649 > URL: https://issues.apache.org/jira/browse/SPARK-36649 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Jungtaek Lim >Priority: Major > > SPARK-36533 introduces a new trigger Trigger.AvailableNow, but only > introduces the new functionality to the file stream source. Given that Kafka > data source is the one of major data sources being used in streaming query, > we should make Kafka data source support this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36576) Improve range split calculation for Kafka Source minPartitions option
[ https://issues.apache.org/jira/browse/SPARK-36576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Olson updated SPARK-36576: - Description: While the [documentation|https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html] does contain a clear disclaimer, {quote}Please note that this configuration is like a {{hint}}: the number of Spark tasks will be *approximately* {{minPartitions}}. It can be less or more depending on rounding errors or Kafka partitions that didn't receive any new data. {quote} there are cases where the calculated Kafka partition range splits can differ greatly from expectations. For evenly distributed data and most {{minPartitions}} values this would not be a major or commonly encountered concern. However when the distribution of data across partitions is very heavily skewed, somewhat surprising range split calculations can result. For example, given the following input data: * 1 partition containing 10,000 messages * 1,000 partitions each containing 1 message Spark processing code loading from this collection of 1,001 partitions may decide that it would like each task to read no more than 1,000 messages. Consequently, it could specify a {{minPartitions}} value of 1,010 - expecting the single large partition to be split into 10 equal chunks, along with the 1,000 small partitions each having their own task. That is far from what actually occurs. The {{KafkaOffsetRangeCalculator}} algorithm ends up splitting the large partition into 918 chunks of 10 or 11 messages, two orders of magnitude from the desired maximum message count per task and nearly double the number of Spark tasks hinted in the configuration. Proposing that the {{KafkaOffsetRangeCalculator}}'s range calculation logic be modified to exclude small (i.e. un-split) partitions from the overall proportional distribution math, in order to more reasonably divide the large partitions when they are accompanied by many small partitions, and to provide optimal behavior for cases where a {{minPartitions}} value is deliberately computed based on the volume of data being read. was: While the [documentation|https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html] does contain a clear disclaimer, {quote}Please note that this configuration is like a {{hint}}: the number of Spark tasks will be *approximately* {{minPartitions}}. It can be less or more depending on rounding errors or Kafka partitions that didn't receive any new data. {quote} there are cases where the calculated Kafka partition range splits can differ greatly from expectations. For evenly distributed data and most {{minPartitions}} values this would not be a major or commonly encountered concern. However when the distribution of data across partitions is very heavily skewed, somewhat surprising range split calculations can result. For example, given the following input data: * 1 partition containing 10,000 messages * 1,000 partitions each containing 1 message Spark processing code loading from this collection of 1,001 partitions may decide that it would like each task to read no more than 1,000 messages. Consequently, it could specify a {{minPartitions}} value of 1,010 - expecting the single large partition to be split into 10 equal chunks, along with the 1,000 small partitions each having their own task. That is far from what actually occurs. The {{KafkaOffsetRangeCalculator}} algorithm ends up splitting the large partition into 918 chunks of 10 or 11 messages, two orders of magnitude from the desired maximum message count per task and nearly double the number of Spark tasks hinted in the configuration. Proposing that range the calculation logic be modified to exclude small (i.e. un-split) partitions from the overall proportional distribution math, in order to more reasonably divide the large partitions when they are accompanied by many small partitions, and to provide optimal behavior for cases where a {{minPartitions}} value is deliberately computed based on the volume of data being read. > Improve range split calculation for Kafka Source minPartitions option > - > > Key: SPARK-36576 > URL: https://issues.apache.org/jira/browse/SPARK-36576 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.2 >Reporter: Andrew Olson >Priority: Minor > > While the > [documentation|https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html] > does contain a clear disclaimer, > {quote}Please note that this configuration is like a {{hint}}: the number of > Spark tasks will be *approximately* {{minPartitions}}. It can be less or more > depending on rounding errors or Kafka partitions
[jira] [Created] (SPARK-36576) Improve range split calculation for Kafka Source minPartitions option
Andrew Olson created SPARK-36576: Summary: Improve range split calculation for Kafka Source minPartitions option Key: SPARK-36576 URL: https://issues.apache.org/jira/browse/SPARK-36576 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.1.2 Reporter: Andrew Olson While the [documentation|https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html] does contain a clear disclaimer, {quote}Please note that this configuration is like a {{hint}}: the number of Spark tasks will be *approximately* {{minPartitions}}. It can be less or more depending on rounding errors or Kafka partitions that didn't receive any new data. {quote} there are cases where the calculated Kafka partition range splits can differ greatly from expectations. For evenly distributed data and most {{minPartitions}} values this would not be a major or commonly encountered concern. However when the distribution of data across partitions is very heavily skewed, somewhat surprising range split calculations can result. For example, given the following input data: * 1 partition containing 10,000 messages * 1,000 partitions each containing 1 message Spark processing code loading from this collection of 1,001 partitions may decide that it would like each task to read no more than 1,000 messages. Consequently, it could specify a {{minPartitions}} value of 1,010 - expecting the single large partition to be split into 10 equal chunks, along with the 1,000 small partitions each having their own task. That is far from what actually occurs. The {{KafkaOffsetRangeCalculator}} algorithm ends up splitting the large partition into 918 chunks of 10 or 11 messages, two orders of magnitude from the desired maximum message count per task and nearly double the number of Spark tasks hinted in the configuration. Proposing that range the calculation logic be modified to exclude small (i.e. un-split) partitions from the overall proportional distribution math, in order to more reasonably divide the large partitions when they are accompanied by many small partitions, and to provide optimal behavior for cases where a {{minPartitions}} value is deliberately computed based on the volume of data being read. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29223) Kafka source: offset by timestamp - allow specifying timestamp for "all partitions"
[ https://issues.apache.org/jira/browse/SPARK-29223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403341#comment-17403341 ] Andrew Grigorev commented on SPARK-29223: - Just an idea - couldn't this be implemented in "timestamp field filter pushdown"-like way? > Kafka source: offset by timestamp - allow specifying timestamp for "all > partitions" > --- > > Key: SPARK-29223 > URL: https://issues.apache.org/jira/browse/SPARK-29223 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Minor > Fix For: 3.2.0 > > > This issue is a follow-up of SPARK-26848. > In SPARK-26848, we decided to open possibility to let end users set > individual timestamp per partition. But in many cases, specifying timestamp > represents the intention that we would want to go back to specific timestamp > and reprocess records, which should be applied to all topics and partitions. > According to the format of > `startingOffsetsByTimestamp`/`endingOffsetsByTimestamp`, while it's not > intuitive to provide an option to set a global timestamp across topic, it's > still intuitive to provide an option to set a global timestamp across > partitions in a topic. > This issue tracks the efforts to deal with this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35919) Support pathlib.PurePath-like objects in DataFrameReader / DataFrameWriter
Andrew Grigorev created SPARK-35919: --- Summary: Support pathlib.PurePath-like objects in DataFrameReader / DataFrameWriter Key: SPARK-35919 URL: https://issues.apache.org/jira/browse/SPARK-35919 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.1.2, 2.4.8 Reporter: Andrew Grigorev It would be nice to support Path objects in `spark.\{read,write}.\{parquet,orc,csv,...etc}` methods. Without pyspark source code changes it currently seems possible only by the ugly monkeypatching hacks - https://stackoverflow.com/q/68170685/2649222. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34723) Correct parameter type for subexpression elimination under whole-stage
[ https://issues.apache.org/jira/browse/SPARK-34723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368894#comment-17368894 ] Andrew Olson commented on SPARK-34723: -- In case it might be helpful to anyone, the compilation failure's stack trace looks something like this. {noformat} [main] ERROR org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 32, Column 84: IDENTIFIER expected instead of '[' org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 32, Column 84: IDENTIFIER expected instead of '[' at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:196) at org.codehaus.janino.Parser.read(Parser.java:3705) at org.codehaus.janino.Parser.parseQualifiedIdentifier(Parser.java:446) at org.codehaus.janino.Parser.parseReferenceType(Parser.java:2569) at org.codehaus.janino.Parser.parseType(Parser.java:2549) at org.codehaus.janino.Parser.parseFormalParameter(Parser.java:1688) at org.codehaus.janino.Parser.parseFormalParameterList(Parser.java:1639) at org.codehaus.janino.Parser.parseFormalParameters(Parser.java:1620) at org.codehaus.janino.Parser.parseMethodDeclarationRest(Parser.java:1518) at org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:1028) at org.codehaus.janino.Parser.parseClassBody(Parser.java:841) at org.codehaus.janino.Parser.parseClassDeclarationRest(Parser.java:736) at org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:941) at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:234) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:205) at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1403) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1500) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1497) at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) at org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000) at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1351) at org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:721) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:720) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:92) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:112) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:184) at
[jira] [Commented] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly
[ https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991752#comment-16991752 ] Andrew White commented on SPARK-22876: -- This issue is most certainly unresolved, and the documentation is misleading at best. [~choojoyq] - perhaps you could share your code on how you deal with this in general? The general solution seems non-obvious to me given all of the intricacies of Spark + Yarn interactions such as shutdown hooks, handling signals, client vs cluster mode. Maybe I'm over thinking the solution. This feature is critical for supporting long-running applications. > spark.yarn.am.attemptFailuresValidityInterval does not work correctly > - > > Key: SPARK-22876 > URL: https://issues.apache.org/jira/browse/SPARK-22876 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.0 > Environment: hadoop version 2.7.3 >Reporter: Jinhan Zhong >Priority: Minor > Labels: bulk-closed > > I assume we can use spark.yarn.maxAppAttempts together with > spark.yarn.am.attemptFailuresValidityInterval to make a long running > application avoid stopping after acceptable number of failures. > But after testing, I found that the application always stops after failing n > times ( n is minimum value of spark.yarn.maxAppAttempts and > yarn.resourcemanager.am.max-attempts from client yarn-site.xml) > for example, following setup will allow the application master to fail 20 > times. > * spark.yarn.am.attemptFailuresValidityInterval=1s > * spark.yarn.maxAppAttempts=20 > * yarn client: yarn.resourcemanager.am.max-attempts=20 > * yarn resource manager: yarn.resourcemanager.am.max-attempts=3 > And after checking the source code, I found in source file > ApplicationMaster.scala > https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293 > there's a ShutdownHook that checks the attempt id against the maxAppAttempts, > if attempt id >= maxAppAttempts, it will try to unregister the application > and the application will finish. > is this a expected design or a bug? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28085) Spark Scala API documentation URLs not working properly in Chrome
[ https://issues.apache.org/jira/browse/SPARK-28085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903377#comment-16903377 ] Andrew Leverentz commented on SPARK-28085: -- In Chrome 76, this issue appears to be resolved. Thanks to anyone out there who submitted bug reports :) > Spark Scala API documentation URLs not working properly in Chrome > - > > Key: SPARK-28085 > URL: https://issues.apache.org/jira/browse/SPARK-28085 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.3 >Reporter: Andrew Leverentz >Priority: Minor > > In Chrome version 75, URLs in the Scala API documentation are not working > properly, which makes them difficult to bookmark. > For example, URLs like the following get redirected to a generic "root" > package page: > [https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html] > [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset] > Here's the URL that I get redirected to: > [https://spark.apache.org/docs/latest/api/scala/index.html#package] > This issue seems to have appeared between versions 74 and 75 of Chrome, but > the documentation URLs still work in Safari. I suspect that this has > something to do with security-related changes to how Chrome 75 handles frames > and/or redirects. I've reported this issue to the Chrome team via the > in-browser help menu, but I don't have any visibility into their response, so > it's not clear whether they'll consider this a bug or "working as intended". -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28085) Spark Scala API documentation URLs not working properly in Chrome
[ https://issues.apache.org/jira/browse/SPARK-28085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890314#comment-16890314 ] Andrew Leverentz commented on SPARK-28085: -- This issue still remains, more than a month after the Chrome update that caused it. It's not clear whether Google considers it a bug that needs fixing. I've reported the issue to Google, as mentioned above, but if anyone else has a better way of contacting the Chrome team, I'd appreciate it if you could try to get in touch with them to see whether they are aware of this bug and planning to fix it. > Spark Scala API documentation URLs not working properly in Chrome > - > > Key: SPARK-28085 > URL: https://issues.apache.org/jira/browse/SPARK-28085 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.3 >Reporter: Andrew Leverentz >Priority: Minor > > In Chrome version 75, URLs in the Scala API documentation are not working > properly, which makes them difficult to bookmark. > For example, URLs like the following get redirected to a generic "root" > package page: > [https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html] > [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset] > Here's the URL that I get redirected to: > [https://spark.apache.org/docs/latest/api/scala/index.html#package] > This issue seems to have appeared between versions 74 and 75 of Chrome, but > the documentation URLs still work in Safari. I suspect that this has > something to do with security-related changes to how Chrome 75 handles frames > and/or redirects. I've reported this issue to the Chrome team via the > in-browser help menu, but I don't have any visibility into their response, so > it's not clear whether they'll consider this a bug or "working as intended". -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28225) Unexpected behavior for Window functions
[ https://issues.apache.org/jira/browse/SPARK-28225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Leverentz resolved SPARK-28225. -- Resolution: Not A Problem > Unexpected behavior for Window functions > > > Key: SPARK-28225 > URL: https://issues.apache.org/jira/browse/SPARK-28225 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Andrew Leverentz >Priority: Major > > I've noticed some odd behavior when combining the "first" aggregate function > with an ordered Window. > In particular, I'm working with columns created using the syntax > {code} > first($"y", ignoreNulls = true).over(Window.orderBy($"x")) > {code} > Below, I'm including some code which reproduces this issue in a Databricks > notebook. > *Code:* > {code:java} > import org.apache.spark.sql.functions.first > import org.apache.spark.sql.expressions.Window > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.{StructType,StructField,IntegerType} > val schema = StructType(Seq( > StructField("x", IntegerType, false), > StructField("y", IntegerType, true), > StructField("z", IntegerType, true) > )) > val input = > spark.createDataFrame(sc.parallelize(Seq( > Row(101, null, 11), > Row(102, null, 12), > Row(103, null, 13), > Row(203, 24, null), > Row(201, 26, null), > Row(202, 25, null) > )), schema = schema) > input.show > val output = input > .withColumn("u1", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".asc_nulls_last))) > .withColumn("u2", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".asc))) > .withColumn("u3", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".desc_nulls_last))) > .withColumn("u4", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".desc))) > .withColumn("u5", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".asc_nulls_last))) > .withColumn("u6", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".asc))) > .withColumn("u7", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".desc_nulls_last))) > .withColumn("u8", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".desc))) > output.show > {code} > *Expectation:* > Based on my understanding of how ordered-Window and aggregate functions work, > the results I expected to see were: > * u1 = u2 = constant value of 26 > * u3 = u4 = constant value of 24 > * u5 = u6 = constant value of 11 > * u7 = u8 = constant value of 13 > However, columns u1, u2, u7, and u8 contain some unexpected nulls. > *Results:* > {code:java} > +---+++++---+---+---+---+++ > | x| y| z| u1| u2| u3| u4| u5| u6| u7| u8| > +---+++++---+---+---+---+++ > |203| 24|null| 26| 26| 24| 24| 11| 11|null|null| > |202| 25|null| 26| 26| 24| 24| 11| 11|null|null| > |201| 26|null| 26| 26| 24| 24| 11| 11|null|null| > |103|null| 13|null|null| 24| 24| 11| 11| 13| 13| > |102|null| 12|null|null| 24| 24| 11| 11| 13| 13| > |101|null| 11|null|null| 24| 24| 11| 11| 13| 13| > +---+++++---+---+---+---+++ > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28225) Unexpected behavior for Window functions
[ https://issues.apache.org/jira/browse/SPARK-28225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890310#comment-16890310 ] Andrew Leverentz commented on SPARK-28225: -- Marco, thanks for the explanation. In this case, the workaround in Scala is to use {{Window.orderBy($"x").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)}} This issue can be marked resolved. > Unexpected behavior for Window functions > > > Key: SPARK-28225 > URL: https://issues.apache.org/jira/browse/SPARK-28225 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Andrew Leverentz >Priority: Major > > I've noticed some odd behavior when combining the "first" aggregate function > with an ordered Window. > In particular, I'm working with columns created using the syntax > {code} > first($"y", ignoreNulls = true).over(Window.orderBy($"x")) > {code} > Below, I'm including some code which reproduces this issue in a Databricks > notebook. > *Code:* > {code:java} > import org.apache.spark.sql.functions.first > import org.apache.spark.sql.expressions.Window > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.{StructType,StructField,IntegerType} > val schema = StructType(Seq( > StructField("x", IntegerType, false), > StructField("y", IntegerType, true), > StructField("z", IntegerType, true) > )) > val input = > spark.createDataFrame(sc.parallelize(Seq( > Row(101, null, 11), > Row(102, null, 12), > Row(103, null, 13), > Row(203, 24, null), > Row(201, 26, null), > Row(202, 25, null) > )), schema = schema) > input.show > val output = input > .withColumn("u1", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".asc_nulls_last))) > .withColumn("u2", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".asc))) > .withColumn("u3", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".desc_nulls_last))) > .withColumn("u4", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".desc))) > .withColumn("u5", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".asc_nulls_last))) > .withColumn("u6", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".asc))) > .withColumn("u7", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".desc_nulls_last))) > .withColumn("u8", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".desc))) > output.show > {code} > *Expectation:* > Based on my understanding of how ordered-Window and aggregate functions work, > the results I expected to see were: > * u1 = u2 = constant value of 26 > * u3 = u4 = constant value of 24 > * u5 = u6 = constant value of 11 > * u7 = u8 = constant value of 13 > However, columns u1, u2, u7, and u8 contain some unexpected nulls. > *Results:* > {code:java} > +---+++++---+---+---+---+++ > | x| y| z| u1| u2| u3| u4| u5| u6| u7| u8| > +---+++++---+---+---+---+++ > |203| 24|null| 26| 26| 24| 24| 11| 11|null|null| > |202| 25|null| 26| 26| 24| 24| 11| 11|null|null| > |201| 26|null| 26| 26| 24| 24| 11| 11|null|null| > |103|null| 13|null|null| 24| 24| 11| 11| 13| 13| > |102|null| 12|null|null| 24| 24| 11| 11| 13| 13| > |101|null| 11|null|null| 24| 24| 11| 11| 13| 13| > +---+++++---+---+---+---+++ > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28225) Unexpected behavior for Window functions
[ https://issues.apache.org/jira/browse/SPARK-28225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890310#comment-16890310 ] Andrew Leverentz edited comment on SPARK-28225 at 7/22/19 4:54 PM: --- Marco, thanks for the explanation. In this case, the solution in Scala is to use {{Window.orderBy($"x").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)}} This issue can be marked resolved. was (Author: alev_etx): Marco, thanks for the explanation. In this case, the workaround in Scala is to use {{Window.orderBy($"x").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)}} This issue can be marked resolved. > Unexpected behavior for Window functions > > > Key: SPARK-28225 > URL: https://issues.apache.org/jira/browse/SPARK-28225 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Andrew Leverentz >Priority: Major > > I've noticed some odd behavior when combining the "first" aggregate function > with an ordered Window. > In particular, I'm working with columns created using the syntax > {code} > first($"y", ignoreNulls = true).over(Window.orderBy($"x")) > {code} > Below, I'm including some code which reproduces this issue in a Databricks > notebook. > *Code:* > {code:java} > import org.apache.spark.sql.functions.first > import org.apache.spark.sql.expressions.Window > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.{StructType,StructField,IntegerType} > val schema = StructType(Seq( > StructField("x", IntegerType, false), > StructField("y", IntegerType, true), > StructField("z", IntegerType, true) > )) > val input = > spark.createDataFrame(sc.parallelize(Seq( > Row(101, null, 11), > Row(102, null, 12), > Row(103, null, 13), > Row(203, 24, null), > Row(201, 26, null), > Row(202, 25, null) > )), schema = schema) > input.show > val output = input > .withColumn("u1", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".asc_nulls_last))) > .withColumn("u2", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".asc))) > .withColumn("u3", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".desc_nulls_last))) > .withColumn("u4", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".desc))) > .withColumn("u5", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".asc_nulls_last))) > .withColumn("u6", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".asc))) > .withColumn("u7", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".desc_nulls_last))) > .withColumn("u8", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".desc))) > output.show > {code} > *Expectation:* > Based on my understanding of how ordered-Window and aggregate functions work, > the results I expected to see were: > * u1 = u2 = constant value of 26 > * u3 = u4 = constant value of 24 > * u5 = u6 = constant value of 11 > * u7 = u8 = constant value of 13 > However, columns u1, u2, u7, and u8 contain some unexpected nulls. > *Results:* > {code:java} > +---+++++---+---+---+---+++ > | x| y| z| u1| u2| u3| u4| u5| u6| u7| u8| > +---+++++---+---+---+---+++ > |203| 24|null| 26| 26| 24| 24| 11| 11|null|null| > |202| 25|null| 26| 26| 24| 24| 11| 11|null|null| > |201| 26|null| 26| 26| 24| 24| 11| 11|null|null| > |103|null| 13|null|null| 24| 24| 11| 11| 13| 13| > |102|null| 12|null|null| 24| 24| 11| 11| 13| 13| > |101|null| 11|null|null| 24| 24| 11| 11| 13| 13| > +---+++++---+---+---+---+++ > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28225) Unexpected behavior for Window functions
Andrew Leverentz created SPARK-28225: Summary: Unexpected behavior for Window functions Key: SPARK-28225 URL: https://issues.apache.org/jira/browse/SPARK-28225 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Andrew Leverentz I've noticed some odd behavior when combining the "first" aggregate function with an ordered Window. In particular, I'm working with columns created using the syntax {code} first($"y", ignoreNulls = true).over(Window.orderBy($"x")) {code} Below, I'm including some code which reproduces this issue in a Databricks notebook. *Code:* {code:java} import org.apache.spark.sql.functions.first import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.Row import org.apache.spark.sql.types.{StructType,StructField,IntegerType} val schema = StructType(Seq( StructField("x", IntegerType, false), StructField("y", IntegerType, true), StructField("z", IntegerType, true) )) val input = spark.createDataFrame(sc.parallelize(Seq( Row(101, null, 11), Row(102, null, 12), Row(103, null, 13), Row(203, 24, null), Row(201, 26, null), Row(202, 25, null) )), schema = schema) input.show val output = input .withColumn("u1", first($"y", ignoreNulls = true).over(Window.orderBy($"x".asc_nulls_last))) .withColumn("u2", first($"y", ignoreNulls = true).over(Window.orderBy($"x".asc))) .withColumn("u3", first($"y", ignoreNulls = true).over(Window.orderBy($"x".desc_nulls_last))) .withColumn("u4", first($"y", ignoreNulls = true).over(Window.orderBy($"x".desc))) .withColumn("u5", first($"z", ignoreNulls = true).over(Window.orderBy($"x".asc_nulls_last))) .withColumn("u6", first($"z", ignoreNulls = true).over(Window.orderBy($"x".asc))) .withColumn("u7", first($"z", ignoreNulls = true).over(Window.orderBy($"x".desc_nulls_last))) .withColumn("u8", first($"z", ignoreNulls = true).over(Window.orderBy($"x".desc))) output.show {code} *Expectation:* Based on my understanding of how ordered-Window and aggregate functions work, the results I expected to see were: * u1 = u2 = constant value of 26 * u3 = u4 = constant value of 24 * u5 = u6 = constant value of 11 * u7 = u8 = constant value of 13 However, columns u1, u2, u7, and u8 contain some unexpected nulls. *Results:* {code:java} +---+++++---+---+---+---+++ | x| y| z| u1| u2| u3| u4| u5| u6| u7| u8| +---+++++---+---+---+---+++ |203| 24|null| 26| 26| 24| 24| 11| 11|null|null| |202| 25|null| 26| 26| 24| 24| 11| 11|null|null| |201| 26|null| 26| 26| 24| 24| 11| 11|null|null| |103|null| 13|null|null| 24| 24| 11| 11| 13| 13| |102|null| 12|null|null| 24| 24| 11| 11| 13| 13| |101|null| 11|null|null| 24| 24| 11| 11| 13| 13| +---+++++---+---+---+---+++ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28085) Spark Scala API documentation URLs not working properly in Chrome
[ https://issues.apache.org/jira/browse/SPARK-28085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Leverentz updated SPARK-28085: - Description: In Chrome version 75, URLs in the Scala API documentation are not working properly, which makes them difficult to bookmark. For example, URLs like the following get redirected to a generic "root" package page: [https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html] [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset] Here's the URL that I get redirected to: [https://spark.apache.org/docs/latest/api/scala/index.html#package] This issue seems to have appeared between versions 74 and 75 of Chrome, but the documentation URLs still work in Safari. I suspect that this has something to do with security-related changes to how Chrome 75 handles frames and/or redirects. I've reported this issue to the Chrome team via the in-browser help menu, but I don't have any visibility into their response, so it's not clear whether they'll consider this a bug or "working as intended". was: In Chrome version 75, URLs in the Scala API documentation are not working properly, which makes them difficult to bookmark. For example, URLs like the following get redirected to a generic "root" package page: [https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html] [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset] Here's the URL that I get : [https://spark.apache.org/docs/latest/api/scala/index.html#package] This issue seems to have appeared between versions 74 and 75 of Chrome, but the documentation URLs still work in Safari. I suspect that this has something to do with security-related changes to how Chrome 75 handles frames and/or redirects. I've reported this issue to the Chrome team via the in-browser help menu, but I don't have any visibility into their response, so it's not clear whether they'll consider this a bug or "working as intended". > Spark Scala API documentation URLs not working properly in Chrome > - > > Key: SPARK-28085 > URL: https://issues.apache.org/jira/browse/SPARK-28085 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.3 >Reporter: Andrew Leverentz >Priority: Minor > > In Chrome version 75, URLs in the Scala API documentation are not working > properly, which makes them difficult to bookmark. > For example, URLs like the following get redirected to a generic "root" > package page: > [https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html] > [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset] > Here's the URL that I get redirected to: > [https://spark.apache.org/docs/latest/api/scala/index.html#package] > This issue seems to have appeared between versions 74 and 75 of Chrome, but > the documentation URLs still work in Safari. I suspect that this has > something to do with security-related changes to how Chrome 75 handles frames > and/or redirects. I've reported this issue to the Chrome team via the > in-browser help menu, but I don't have any visibility into their response, so > it's not clear whether they'll consider this a bug or "working as intended". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28085) Spark Scala API documentation URLs not working properly in Chrome
Andrew Leverentz created SPARK-28085: Summary: Spark Scala API documentation URLs not working properly in Chrome Key: SPARK-28085 URL: https://issues.apache.org/jira/browse/SPARK-28085 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.4.3 Reporter: Andrew Leverentz In Chrome version 75, URLs in the Scala API documentation are not working properly, which makes them difficult to bookmark. For example, URLs like the following get redirected to a generic "root" package page: [https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html] [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset] Here's the URL that I get : [https://spark.apache.org/docs/latest/api/scala/index.html#package] This issue seems to have appeared between versions 74 and 75 of Chrome, but the documentation URLs still work in Safari. I suspect that this has something to do with security-related changes to how Chrome 75 handles frames and/or redirects. I've reported this issue to the Chrome team via the in-browser help menu, but I don't have any visibility into their response, so it's not clear whether they'll consider this a bug or "working as intended". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28062) HuberAggregator copies coefficients vector every time an instance is added
Andrew Crosby created SPARK-28062: - Summary: HuberAggregator copies coefficients vector every time an instance is added Key: SPARK-28062 URL: https://issues.apache.org/jira/browse/SPARK-28062 Project: Spark Issue Type: Bug Components: ML Affects Versions: 3.0.0 Reporter: Andrew Crosby Every time an instance is added to the HuberAggregator, a copy of the coefficients vector is created (see code snippet below). This causes a performance degradation, which is particularly severe when the instances have long sparse feature vectors. {code:scala} def add(instance: Instance): HuberAggregator = { instance match { case Instance(label, weight, features) => require(numFeatures == features.size, s"Dimensions mismatch when adding new sample." + s" Expecting $numFeatures but got ${features.size}.") require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0") if (weight == 0.0) return this val localFeaturesStd = bcFeaturesStd.value val localCoefficients = bcParameters.value.toArray.slice(0, numFeatures) val localGradientSumArray = gradientSumArray // Snip } {code} The LeastSquaresAggregator class avoids this performance issue via the use of transient lazy class variables to store such reused values. Applying a similar approach to HuberAggregator gives a significant speed boost. Running the script below locally on my machine gives the following timing results: {noformat} Current implementation: Time(s): 540.1439919471741 Iterations: 26 Intercept: 0.518109382890512 Coefficients: [0.0, -0.2516936902000245, 0.0, 0.0, -0.19633887469839809, 0.0, -0.39565545053893925, 0.0, -0.18617574426698882, 0.0478922416670529] Modified implementation to match LeastSquaresAggregator: Time(s): 46.82946586608887 Iterations: 26 Intercept: 0.5181093828893774 Coefficients: [0.0, -0.25169369020031357, 0.0, 0.0, -0.1963388746927919, 0.0, -0.3956554505389966, 0.0, -0.18617574426702874, 0.04789224166878518] {noformat} {code:python} from random import random, randint, seed import time from pyspark.ml.feature import OneHotEncoder from pyspark.ml.regression import LinearRegression from pyspark.sql import SparkSession seed(0) spark = SparkSession.builder.appName('huber-speed-test').getOrCreate() df = spark.createDataFrame([[randint(0, 10), random()] for i in range(10)], ["category", "target"]) ohe = OneHotEncoder(inputCols=["category"], outputCols=["encoded_category"]).fit(df) lr = LinearRegression(featuresCol="encoded_category", labelCol="target", loss="huber", regParam=1.0) start = time.time() model = lr.fit(ohe.transform(df)) end = time.time() print("Time(s): " + str(end - start)) print("Iterations: " + str(model.summary.totalIterations)) print("Intercept: " + str(model.intercept)) print("Coefficients: " + str(list(model.coefficients)[0:10])) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27560) HashPartitioner uses Object.hashCode which is not seeded
Andrew McHarg created SPARK-27560: - Summary: HashPartitioner uses Object.hashCode which is not seeded Key: SPARK-27560 URL: https://issues.apache.org/jira/browse/SPARK-27560 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.4.0 Environment: Notebook is running spark v2.4.0 local[*] Python 3.6.6 (default, Sep 6 2018, 13:10:03) [GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] on darwin I imagine this would reproduce on all operating systems and most versions of spark though. Reporter: Andrew McHarg Forgive the quality of the bug report here, I am a pyspark user and not super familiar with the internals of spark, yet it seems I have a strange corner case with the HashPartitioner. This may already be known but repartition with HashPartitioner seems to assign everything the same partition if data that was partitioned by the same column is only partially read (say one partition). I suppose it is obvious concequence of Object.hashCode being deterministic but took some while to track down. Steps to repro: # Get dataframe with a bunch of uuids say 1 # repartition(100, 'uuid_column') # save to parquet # read from parquet # collect()[:100] then filter using pyspark.sql.functions isin (yes I know this is bad and sampleBy should probably be used here) # repartition(10, 'uuid_column') # Resulting dataframe will have all of its data in one single partition Jupyter notebook for the above: https://gist.github.com/robo-hamburger/4752a40cb643318464e58ab66cf7d23e I think an easy fix would be to seed the HashPartitioner like many hashtable libraries do to avoid denial of service attacks. It also might be the case this is obvious behavior for more experienced spark users :) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer
[ https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822548#comment-16822548 ] Andrew Crosby commented on SPARK-26970: --- The code changes required for this looked relatively straightforward so I've had a go at creating a pull request myself (https://github.com/apache/spark/pull/24426) [~huaxingao] apologies if I've duplicated work that you've already done. > Can't load PipelineModel that was created in Scala with Python due to missing > Interaction transformer > - > > Key: SPARK-26970 > URL: https://issues.apache.org/jira/browse/SPARK-26970 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Andrew Crosby >Priority: Minor > > The Interaction transformer > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala] > is missing from the set of pyspark feature transformers > [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py] > > This means that it is impossible to create a model that includes an > Interaction transformer with pyspark. It also means that attempting to load a > PipelineModel created in Scala that includes an Interaction transformer with > pyspark fails with the following error: > {code:java} > AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27216) Kryo serialization with RoaringBitmap
[ https://issues.apache.org/jira/browse/SPARK-27216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804447#comment-16804447 ] Andrew Allen commented on SPARK-27216: -- Thanks for this bug report, [~cltlfcjin]. We we running into "FetchFailedException: Received a zero-size buffer for block shuffle_0_0_332 from BlockManagerId" and this bug report gave us enough of the hint to try {{config.set("spark.kryo.unsafe", "false")}} which worked-around the issue. > Kryo serialization with RoaringBitmap > - > > Key: SPARK-27216 > URL: https://issues.apache.org/jira/browse/SPARK-27216 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.4.0, 3.0.0 >Reporter: Lantao Jin >Priority: Major > > HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But > RoaringBitmap couldn't be ser/deser with unsafe KryoSerializer. > We can use below UT to reproduce: > {code} > test("kryo serialization with RoaringBitmap") { > val bitmap = new RoaringBitmap > bitmap.add(1787) > val safeSer = new KryoSerializer(conf).newInstance() > val bitmap2 : RoaringBitmap = > safeSer.deserialize(safeSer.serialize(bitmap)) > assert(bitmap2.equals(bitmap)) > conf.set("spark.kryo.unsafe", "true") > val unsafeSer = new KryoSerializer(conf).newInstance() > val bitmap3 : RoaringBitmap = > unsafeSer.deserialize(unsafeSer.serialize(bitmap)) > assert(bitmap3.equals(bitmap)) // this will fail > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer
[ https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Crosby updated SPARK-26970: -- Description: The Interaction transformer [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala] is missing from the set of pyspark feature transformers [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py] This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} was: The Interaction transformer [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] is missing from the set of pyspark feature transformers [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)] This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} > Can't load PipelineModel that was created in Scala with Python due to missing > Interaction transformer > - > > Key: SPARK-26970 > URL: https://issues.apache.org/jira/browse/SPARK-26970 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Andrew Crosby >Priority: Major > > The Interaction transformer > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala] > is missing from the set of pyspark feature transformers > [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py] > > This means that it is impossible to create a model that includes an > Interaction transformer with pyspark. It also means that attempting to load a > PipelineModel created in Scala that includes an Interaction transformer with > pyspark fails with the following error: > {code:java} > AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer
[ https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Crosby updated SPARK-26970: -- Description: The Interaction transformer [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] is missing from the set of pyspark feature transformers [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)] This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} was: The Interaction transformer ( [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] ) is missing from the set of pyspark feature transformers ( [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)] ). This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} > Can't load PipelineModel that was created in Scala with Python due to missing > Interaction transformer > - > > Key: SPARK-26970 > URL: https://issues.apache.org/jira/browse/SPARK-26970 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Andrew Crosby >Priority: Major > > The Interaction transformer > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] > is missing from the set of pyspark feature transformers > [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)] > > This means that it is impossible to create a model that includes an > Interaction transformer with pyspark. It also means that attempting to load a > PipelineModel created in Scala that includes an Interaction transformer with > pyspark fails with the following error: > {code:java} > AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer
[ https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Crosby updated SPARK-26970: -- Description: The Interaction transformer ( [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] ) is missing from the set of pyspark feature transformers ( [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)] ). This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} was: The Interaction transformer ([https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] is missing from the set of pyspark feature transformers ([https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]. This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} > Can't load PipelineModel that was created in Scala with Python due to missing > Interaction transformer > - > > Key: SPARK-26970 > URL: https://issues.apache.org/jira/browse/SPARK-26970 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Andrew Crosby >Priority: Major > > The Interaction transformer ( > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] > ) is missing from the set of pyspark feature transformers ( > [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)] > ). > > This means that it is impossible to create a model that includes an > Interaction transformer with pyspark. It also means that attempting to load a > PipelineModel created in Scala that includes an Interaction transformer with > pyspark fails with the following error: > {code:java} > AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer
Andrew Crosby created SPARK-26970: - Summary: Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer Key: SPARK-26970 URL: https://issues.apache.org/jira/browse/SPARK-26970 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 2.4.0 Reporter: Andrew Crosby The Interaction transformer ([https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] is missing from the set of pyspark feature transformers ([https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]. This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22865) Publish Official Apache Spark Docker images
[ https://issues.apache.org/jira/browse/SPARK-22865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695338#comment-16695338 ] Andrew Korzhuev commented on SPARK-22865: - I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them from [https://hub.docker.com/r/andrusha/spark-k8s/tags/,] scripts were also updated. I would love to help making this happen for Spark, but I need somebody to show me around Apache CI/CD infrastructure. > Publish Official Apache Spark Docker images > --- > > Key: SPARK-22865 > URL: https://issues.apache.org/jira/browse/SPARK-22865 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22865) Publish Official Apache Spark Docker images
[ https://issues.apache.org/jira/browse/SPARK-22865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695338#comment-16695338 ] Andrew Korzhuev edited comment on SPARK-22865 at 11/21/18 11:00 PM: I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them from [https://hub.docker.com/r/andrusha/spark-k8s/tags/] scripts were also updated. I would love to help making this happen for Spark, but I need somebody to show me around Apache CI/CD infrastructure. was (Author: akorzhuev): I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them from [https://hub.docker.com/r/andrusha/spark-k8s/tags/|https://hub.docker.com/r/andrusha/spark-k8s/tags/,] scripts were also updated. I would love to help making this happen for Spark, but I need somebody to show me around Apache CI/CD infrastructure. > Publish Official Apache Spark Docker images > --- > > Key: SPARK-22865 > URL: https://issues.apache.org/jira/browse/SPARK-22865 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22865) Publish Official Apache Spark Docker images
[ https://issues.apache.org/jira/browse/SPARK-22865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695338#comment-16695338 ] Andrew Korzhuev edited comment on SPARK-22865 at 11/21/18 11:00 PM: I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them from [https://hub.docker.com/r/andrusha/spark-k8s/tags/|https://hub.docker.com/r/andrusha/spark-k8s/tags/,] scripts were also updated. I would love to help making this happen for Spark, but I need somebody to show me around Apache CI/CD infrastructure. was (Author: akorzhuev): I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them from [https://hub.docker.com/r/andrusha/spark-k8s/tags/|https://hub.docker.com/r/andrusha/spark-k8s/tags/,] scripts were also updated. I would love to help making this happen for Spark, but I need somebody to show me around Apache CI/CD infrastructure. > Publish Official Apache Spark Docker images > --- > > Key: SPARK-22865 > URL: https://issues.apache.org/jira/browse/SPARK-22865 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22865) Publish Official Apache Spark Docker images
[ https://issues.apache.org/jira/browse/SPARK-22865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695338#comment-16695338 ] Andrew Korzhuev edited comment on SPARK-22865 at 11/21/18 10:59 PM: I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them from [https://hub.docker.com/r/andrusha/spark-k8s/tags/|https://hub.docker.com/r/andrusha/spark-k8s/tags/,] scripts were also updated. I would love to help making this happen for Spark, but I need somebody to show me around Apache CI/CD infrastructure. was (Author: akorzhuev): I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them from [https://hub.docker.com/r/andrusha/spark-k8s/tags/,] scripts were also updated. I would love to help making this happen for Spark, but I need somebody to show me around Apache CI/CD infrastructure. > Publish Official Apache Spark Docker images > --- > > Key: SPARK-22865 > URL: https://issues.apache.org/jira/browse/SPARK-22865 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version
[ https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691787#comment-16691787 ] Andrew Otto commented on SPARK-14492: - I found my issue: we were loading some Hive 1.1.0 jars manually using spark.driver.extraClassPath in order to initiate a Hive JDBC directly to Hive, instead of using spark.sql() to work around https://issues.apache.org/jira/browse/SPARK-23890. The Hive 1.1.0 classes were loaded before the ones included with Spark, and as such they failed referencing a Hive configuration that didn't exist in 1.1.0. > Spark SQL 1.6.0 does not work with external Hive metastore version lower than > 1.2.0; its not backwards compatible with earlier version > -- > > Key: SPARK-14492 > URL: https://issues.apache.org/jira/browse/SPARK-14492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunil Rangwani >Priority: Critical > > Spark SQL when configured with a Hive version lower than 1.2.0 throws a > java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME > because this field was introduced in Hive 1.2.0 so its not possible to use > Hive metastore version lower than 1.2.0 with Spark. The details of the Hive > changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 > {code:java} > Exception in thread "main" java.lang.NoSuchFieldError: > METASTORE_CLIENT_SOCKET_LIFETIME > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:271) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version
[ https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685375#comment-16685375 ] Andrew Otto commented on SPARK-14492: - FWIW, I am experiencing the same problem with Spark 2.3.1 and Hive 1.1.0 (from CDH 5.15.0). I've tried setting both spark.sql.hive.metastore.version and spark.sql.hive.metastore.jars (although I'm not sure I've got the right classpath for that one), and am still experiencing this problem. {code:java} 18/11/13 14:31:12 ERROR ApplicationMaster: User class threw exception: java.lang.NoSuchFieldError: METASTORE_CLIENT_SOCKET_LIFETIME java.lang.NoSuchFieldError: METASTORE_CLIENT_SOCKET_LIFETIME at org.apache.spark.sql.hive.HiveUtils$.formatTimeVarsForHiveClient(HiveUtils.scala:195) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:286) at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66) at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:195) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114) at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102) at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52) at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1.(HiveSessionStateBuilder.scala:69) at org.apache.spark.sql.hive.HiveSessionStateBuilder.analyzer(HiveSessionStateBuilder.scala:69) at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293) at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293) at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:79) at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:79) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) at org.apache.spark.sql.Dataset.(Dataset.scala:172) at org.apache.spark.sql.Dataset.(Dataset.scala:178) at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65) at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:488){code} > Spark SQL 1.6.0 does not work with external Hive metastore version lower than > 1.2.0; its not backwards compatible with earlier version > -- > > Key: SPARK-14492 > URL: https://issues.apache.org/jira/browse/SPARK-14492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunil Rangwani >Priority: Critical > > Spark SQL when configured with a Hive version lower than 1.2.0 throws a > java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME > because this field was introduced in Hive 1.2.0 so its not possible to use > Hive metastore version lower than 1.2.0 with Spark. The details of the Hive > changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 > {code:java} > Exception in thread "main" java.lang.NoSuchFieldError: > METASTORE_CLIENT_SOCKET_LIFETIME > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at
[jira] [Commented] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling
[ https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634268#comment-16634268 ] Andrew Crosby commented on SPARK-25544: --- SPARK-23537 contains what might be another occurrence of this issue. The model in that case contains only binary features, so standardization shouldn't really be used. However, turning standardization off causes the model to take 4992 iterations to converge as opposed to 37 iterations when standardization is turned on. > Slow/failed convergence in Spark ML models due to internal predictor scaling > > > Key: SPARK-25544 > URL: https://issues.apache.org/jira/browse/SPARK-25544 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.2 > Environment: Databricks runtime 4.2: Spark 2.3.1, Scala 2.11 >Reporter: Andrew Crosby >Priority: Major > > The LinearRegression and LogisticRegression estimators in Spark ML can take a > large number of iterations to converge, or fail to converge altogether, when > trained using the l-bfgs method with standardization turned off. > *Details:* > LinearRegression and LogisticRegression standardize their input features by > default. In SPARK-8522 the option to disable standardization was added. This > is implemented internally by changing the effective strength of > regularization rather than disabling the feature scaling. Mathematically, > both changing the effective regularizaiton strength, and disabling feature > scaling should give the same solution, but they can have very different > convergence properties. > The normal justication given for scaling features is that it ensures that all > covariances are O(1) and should improve numerical convergence, but this > argument does not account for the regularization term. This doesn't cause any > issues if standardization is set to true, since all features will have an > O(1) regularization strength. But it does cause issues when standardization > is set to false, since the effecive regularization strength of feature i is > now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. > This means that predictors with small standard deviations (which can occur > legitimately e.g. via one hot encoding) will have very large effective > regularization strengths and consequently lead to very large gradients and > thus poor convergence in the solver. > *Example code to recreate:* > To demonstrate just how bad these convergence issues can be, here is a very > simple test case which builds a linear regression model with a categorical > feature, a numerical feature and their interaction. When fed the specified > training data, this model will fail to converge before it hits the maximum > iteration limit. In this case, it is the interaction between category "2" and > the numeric feature that leads to a feature with a small standard deviation. > Training data: > ||category||numericFeature||label|| > |1|1.0|0.5| > |1|0.5|1.0| > |2|0.01|2.0| > > {code:java} > val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, > 2.0)).toDF("category", "numericFeature", "label") > val indexer = new StringIndexer().setInputCol("category") > .setOutputCol("categoryIndex") > val encoder = new > OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false) > val interaction = new Interaction().setInputCols(Array("categoryEncoded", > "numericFeature")).setOutputCol("interaction") > val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", > "interaction")).setOutputCol("features") > val model = new > LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100) > val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, > assembler, model)) > val pipelineModel = pipeline.fit(df) > val numIterations = > pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code} > *Possible fix:* > These convergence issues can be fixed by turning off feature scaling when > standardization is set to false rather than using an effective regularization > strength. This can be hacked into LinearRegression.scala by simply replacing > line 423 > {code:java} > val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt) > {code} > with > {code:java} > val featuresStd = if ($(standardization)) > featuresSummarizer.variance.toArray.map(math.sqrt) else > featuresSummarizer.variance.toArray.map(x => 1.0) > {code} > Rerunning the above test code with that hack in place, will lead to > convergence after just 4 iterations instead of hitting the max iterations > limit! > *Impact:* > I
[jira] [Commented] (SPARK-23537) Logistic Regression without standardization
[ https://issues.apache.org/jira/browse/SPARK-23537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634263#comment-16634263 ] Andrew Crosby commented on SPARK-23537: --- The different results for standardization=True vs standardization=False are to be expected. The resason for this difference is that the two settings lead to different effective regularization strengths. With standardization=True, the regularization is applied to the scaled model coefficients. Whereas, with standardization=False, the regularization is applied to the unscaled model coefficients. As it's implemented in Spark, the features actually get scaled regardless of whether standardization is set to true or false, but when standardization=False the strength of the regularization in the scaled space is altered to account for this. See the comment at [https://github.com/apache/spark/blob/a802c69b130b69a35b372ffe1b01289577f6fafb/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L685.] As an aside, your results show a very slow rate of convergence when standardization is set to false. I believe this to be an issue caused by the continued application of feature scaling when standardization=False which can lead to very large gradients from the regularization terms in the solver. I've recently raised SPARK-25544 to cover this issue. > Logistic Regression without standardization > --- > > Key: SPARK-23537 > URL: https://issues.apache.org/jira/browse/SPARK-23537 > Project: Spark > Issue Type: Bug > Components: ML, Optimizer >Affects Versions: 2.0.2, 2.2.1 >Reporter: Jordi >Priority: Major > Attachments: non-standardization.log, standardization.log > > > I'm trying to train a Logistic Regression model, using Spark 2.2.1. I prefer > to not use standardization since all my features are binary, using the > hashing trick (2^20 sparse vector). > I trained two models to compare results, I've been expecting to end with two > similar models since it seems that internally the optimizer performs > standardization and "de-standardization" (when it's deactivated) in order to > improve the convergence. > Here you have the code I used: > {code:java} > val lr = new org.apache.spark.ml.classification.LogisticRegression() > .setRegParam(0.05) > .setElasticNetParam(0.0) > .setFitIntercept(true) > .setMaxIter(5000) > .setStandardization(false) > val model = lr.fit(data) > {code} > The results are disturbing me, I end with two significantly different models. > *Standardization:* > Training time: 8min. > Iterations: 37 > Intercept: -4.386090107224499 > Max weight: 4.724752299455218 > Min weight: -3.560570478164854 > Mean weight: -0.049325201841722795 > l1 norm: 116710.39522171849 > l2 norm: 402.2581552373957 > Non zero weights: 128084 > Non zero ratio: 0.12215042114257812 > Last 10 LBFGS Val and Grand Norms: > {code:java} > 18/02/27 17:14:45 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 8.00e-07) > 0.000559057 > 18/02/27 17:14:50 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 3.94e-07) > 0.000267527 > 18/02/27 17:14:54 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.62e-07) > 0.000205888 > 18/02/27 17:14:59 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.36e-07) > 0.000144173 > 18/02/27 17:15:04 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 7.74e-08) > 0.000140296 > 18/02/27 17:15:09 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.52e-08) > 0.000122709 > 18/02/27 17:15:13 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.78e-08) > 3.08789e-05 > 18/02/27 17:15:18 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.66e-09) > 2.23806e-05 > 18/02/27 17:15:23 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 4.31e-09) > 1.47422e-05 > 18/02/27 17:15:28 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 9.17e-10) > 2.37442e-05 > {code} > *No standardization:* > Training time: 7h 14 min. > Iterations: 4992 > Intercept: -4.216690468849263 > Max weight: 0.41930559767624725 > Min weight: -0.5949182537565524 > Mean weight: -1.2659769019012E-6 > l1 norm: 14.262025330648694 > l2 norm: 1.2508777025612263 > Non zero weights: 128955 > Non zero ratio: 0.12298107147216797 > Last 10 LBFGS Val and Grand Norms: > {code:java} > 18/02/28 00:28:56 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 2.17e-07) > 0.217581 > 18/02/28 00:29:01 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.88e-07) > 0.185812 > 18/02/28 00:29:06 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.33e-07) > 0.214570 > 18/02/28 00:29:11 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 8.62e-08) > 0.489464 > 18/02/28 00:29:16 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.90e-07) > 0.178448 > 18/02/28 00:29:21 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 7.91e-08) > 0.172527 > 18/02/28 00:29:26 INFO LBFGS: Val and Grad Norm:
[jira] [Updated] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling
[ https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Crosby updated SPARK-25544: -- Description: The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling. Mathematically, both changing the effective regularizaiton strength, and disabling feature scaling should give the same solution, but they can have very different convergence properties. The normal justication given for scaling features is that it ensures that all covariances are O(1) and should improve numerical convergence, but this argument does not account for the regularization term. This doesn't cause any issues if standardization is set to true, since all features will have an O(1) regularization strength. But it does cause issues when standardization is set to false, since the effecive regularization strength of feature i is now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. This means that predictors with small standard deviations (which can occur legitimately e.g. via one hot encoding) will have very large effective regularization strengths and consequently lead to very large gradients and thus poor convergence in the solver. *Example code to recreate:* To demonstrate just how bad these convergence issues can be, here is a very simple test case which builds a linear regression model with a categorical feature, a numerical feature and their interaction. When fed the specified training data, this model will fail to converge before it hits the maximum iteration limit. In this case, it is the interaction between category "2" and the numeric feature that leads to a feature with a small standard deviation. Training data: ||category||numericFeature||label|| |1|1.0|0.5| |1|0.5|1.0| |2|0.01|2.0| {code:java} val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 2.0)).toDF("category", "numericFeature", "label") val indexer = new StringIndexer().setInputCol("category") .setOutputCol("categoryIndex") val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false) val interaction = new Interaction().setInputCols(Array("categoryEncoded", "numericFeature")).setOutputCol("interaction") val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", "interaction")).setOutputCol("features") val model = new LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100) val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, assembler, model)) val pipelineModel = pipeline.fit(df) val numIterations = pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code} *Possible fix:* These convergence issues can be fixed by turning off feature scaling when standardization is set to false rather than using an effective regularization strength. This can be hacked into LinearRegression.scala by simply replacing line 423 {code:java} val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt) {code} with {code:java} val featuresStd = if ($(standardization)) featuresSummarizer.variance.toArray.map(math.sqrt) else featuresSummarizer.variance.toArray.map(x => 1.0) {code} Rerunning the above test code with that hack in place, will lead to convergence after just 4 iterations instead of hitting the max iterations limit! *Impact:* I can't speak for other people, but I've personally encountered these convergence issues several times when building production scale Spark ML models, and have resorted to writing my only implementation of LinearRegression with the above hack in place. The issue is made worse by the fact that Spark does not raise an error when the maximum number of iterations is hit, so the first time you encounter the issue it can take a while to figure out what is going on. was: The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling.
[jira] [Updated] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling
[ https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Crosby updated SPARK-25544: -- Description: The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling. Mathematically, both changing the effective regularizaiton strength, and disabling feature scaling should give the same solution, but they can have very different convergence properties. The normal justication given for scaling features is that it ensures that all covariances are O(1) and should improve numerical convergence, but this argument does not account for the regularization term. This doesn't cause any issues if standardization is set to true, since all features will have an O(1) regularization strength. But it does cause issues when standardization is set to false, since the effecive regularization strength of feature i is now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. This means that predictors with small standard deviations will have very large effective regularization strengths and consequently lead to very large gradients and thus poor convergence in the solver. *Example code to recreate:* To demonstrate just how bad these convergence issues can be, here is a very simple test case which builds a linear regression model with a categorical feature, a numerical feature and their interaction. When fed the specified training data, this model will fail to converge before it hits the maximum iteration limit. In this case, it is the interaction between category "2" and the numeric feature that leads to a feature with a small standard deviation. Training data: ||category||numericFeature||label|| |1|1.0|0.5| |1|0.5|1.0| |2|0.01|2.0| {code:java} val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 2.0)).toDF("category", "numericFeature", "label") val indexer = new StringIndexer().setInputCol("category") .setOutputCol("categoryIndex") val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false) val interaction = new Interaction().setInputCols(Array("categoryEncoded", "numericFeature")).setOutputCol("interaction") val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", "interaction")).setOutputCol("features") val model = new LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100) val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, assembler, model)) val pipelineModel = pipeline.fit(df) val numIterations = pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code} *Possible fix:* These convergence issues can be fixed by turning off feature scaling when standardization is set to false rather than using an effective regularization strength. This can be hacked into LinearRegression.scala by simply replacing line 423 {code:java} val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt) {code} with {code:java} val featuresStd = if ($(standardization)) featuresSummarizer.variance.toArray.map(math.sqrt) else featuresSummarizer.variance.toArray.map(x => 1.0) {code} Rerunning the above test code with that hack in place, will lead to convergence after just 4 iterations instead of hitting the max iterations limit! *Impact:* I can't speak for other people, but I've personally encountered these convergence issues several times when building production scale Spark ML models, and have resorted to writing my only implementation of LinearRegression with the above hack in place. The issue is made worse by the fact that Spark does not raise an error when the maximum number of iterations is hit, so the first time you encounter the issue it can take a while to figure out what is going on. was: The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling. Mathematically, both changing the effective regularizaiton
[jira] [Updated] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling
[ https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Crosby updated SPARK-25544: -- Description: The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling. Mathematically, both changing the effective regularizaiton strength, and disabling feature scaling should give the same solution, but they can have very different convergence properties. The normal justication given for scaling features is that it ensures that all covariances are O(1) and should improve numerical convergence, but this argument does not account for the regularization term. This doesn't cause any issues if standardization is set to true, since all features will have an O(1) regularization strength. But it does cause issues when standardization is set to false, since the effecive regularization strength of feature i is now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. This means that predictors with small standard deviations will have very large effective regularization strengths and consequently lead to very large gradients and thus poor convergence in the solver. *Example code to recreate:* To demonstrate just how bad these convergence issues can be, here is a very simple test case which builds a linear regression model with a categorical feature, a numerical feature and their interaction. When fed the specified training data, this model will fail to converge before it hits the maximum iteration limit. Training data: ||category||numericFeature||label|| |1|1.0|0.5| |1|0.5|1.0| |2|0.01|2.0| {code:java} val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 2.0)).toDF("category", "numericFeature", "label") val indexer = new StringIndexer().setInputCol("category") .setOutputCol("categoryIndex") val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false) val interaction = new Interaction().setInputCols(Array("categoryEncoded", "numericFeature")).setOutputCol("interaction") val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", "interaction")).setOutputCol("features") val model = new LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100) val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, assembler, model)) val pipelineModel = pipeline.fit(df) val numIterations = pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code} *Possible fix:* These convergence issues can be fixed by turning off feature scaling when standardization is set to false rather than using an effective regularization strength. This can be hacked into LinearRegression.scala by simply replacing line 423 {code:java} val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt) {code} with {code:java} val featuresStd = if ($(standardization)) featuresSummarizer.variance.toArray.map(math.sqrt) else featuresSummarizer.variance.toArray.map(x => 1.0) {code} Rerunning the above test code with that hack in place, will lead to convergence after just 4 iterations instead of hitting the max iterations limit! *Impact:* I can't speak for other people, but I've personally encountered these convergence issues several times when building production scale Spark ML models, and have resorted to writing my only implementation of LinearRegression with the above hack in place. The issue is made worse by the fact that Spark does not raise an error when the maximum number of iterations is hit, so the first time you encounter the issue it can take a while to figure out what is going on. was: The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling. Mathematically, both changing the effective regularizaiton strength, and disabling feature scaling should give the same solution, but they can have very different convergence properties. The normal
[jira] [Created] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling
Andrew Crosby created SPARK-25544: - Summary: Slow/failed convergence in Spark ML models due to internal predictor scaling Key: SPARK-25544 URL: https://issues.apache.org/jira/browse/SPARK-25544 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.2 Environment: Databricks runtime 4.2: Spark 2.3.1, Scala 2.11 Reporter: Andrew Crosby The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling. Mathematically, both changing the effective regularizaiton strength, and disabling feature scaling should give the same solution, but they can have very different convergence properties. The normal justication given for scaling features is that it ensures that all covariances are O(1) and should improve numerical convergence, but this argument does not account for the regularization term. This doesn't cause any issues if standardization is set to true, since all features will have an O(1) regularization strength. But it does cause issues when standardization is set to false, since the effecive regularization strength of feature i is now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. This means that predictors with small standard deviations will have very large effective regularization strengths and consequently lead to very large gradients and thus poor convergence in the solver. *Example code to recreate:* To demonstrate just how bad these convergence issues can be, here is a very simple test case which builds a linear regression model with a categorical feature, a numerical feature and their interaction. When fed the specified training data, this model will fail to converge before it hits the maximum iteration limit. Training data: ||category||numericFeature||label|| |1|1.0|0.5| |1|0.5|1.0| |2|0.01|2.0| {code:java} val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 2.0)).toDF("category", "numericFeature", "label") val indexer = new StringIndexer().setInputCol("category") .setOutputCol("categoryIndex") val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false) val interaction = new Interaction().setInputCols(Array("categoryEncoded", "numericFeature")).setOutputCol("interaction") val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", "interaction")).setOutputCol("features") val model = new LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100) val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, assembler, model)) val pipelineModel = pipeline.fit(df) val numIterations = pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code} *Possible fix:* These convergence issues can be fixed by turning of feature scaling when standardization is set to false rather than using an effective regularization strength. This can be hacked into LinearRegression.scala by simply replacing line 423 {code:java} val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt) {code} with {code:java} val featuresStd = if ($(standardization)) featuresSummarizer.variance.toArray.map(math.sqrt) else featuresSummarizer.variance.toArray.map(x => 1.0) {code} Rerunning the above test code with that hack in place, will lead to convergence after just 4 iterations instead of hitting the max iterations limit! *Impact:* I can't speak for other people, but I've personally encountered these convergence issues several times when building production scale Spark ML models, and have resorted to writing my only implementation of LinearRegression with the above hack in place. The issue is made worse by the fact that Spark does not raise an error when the maximum number of iterations is hit, so the first time you encounter the issue it can take a while to figure out what is going on. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21962) Distributed Tracing in Spark
[ https://issues.apache.org/jira/browse/SPARK-21962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541265#comment-16541265 ] Andrew Ash commented on SPARK-21962: Note that HTrace is now being removed from Hadoop – HADOOP-15566 > Distributed Tracing in Spark > > > Key: SPARK-21962 > URL: https://issues.apache.org/jira/browse/SPARK-21962 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Andrew Ash >Priority: Major > > Spark should support distributed tracing, which is the mechanism, widely > popularized by Google in the [Dapper > Paper|https://research.google.com/pubs/pub36356.html], where network requests > have additional metadata used for tracing requests between services. > This would be useful for me since I have OpenZipkin style tracing in my > distributed application up to the Spark driver, and from the executors out to > my other services, but the link is broken in Spark between driver and > executor since the Span IDs aren't propagated across that link. > An initial implementation could instrument the most important network calls > with trace ids (like launching and finishing tasks), and incrementally add > more tracing to other calls (torrent block distribution, external shuffle > service, etc) as the feature matures. > Search keywords: Dapper, Brave, OpenZipkin, HTrace -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24422) Add JDK9+ in our Jenkins' build servers
[ https://issues.apache.org/jira/browse/SPARK-24422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531241#comment-16531241 ] Andrew Korzhuev commented on SPARK-24422: - Also `.travis.yml` needs to be fixed in the following way: {code:java} # 2. Choose language and target JDKs for parallel builds. language: java jdk: - openjdk8 - openjdk9 - openjdk10 {code} > Add JDK9+ in our Jenkins' build servers > --- > > Key: SPARK-24422 > URL: https://issues.apache.org/jira/browse/SPARK-24422 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.3.0 >Reporter: DB Tsai >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24422) Add JDK9+ in our Jenkins' build servers
[ https://issues.apache.org/jira/browse/SPARK-24422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531241#comment-16531241 ] Andrew Korzhuev edited comment on SPARK-24422 at 7/3/18 11:46 AM: -- Also `.travis.yml` needs to be fixed in the following way: {code:java} # 2. Choose language and target JDKs for parallel builds. language: java jdk: - openjdk8 - openjdk9 {code} was (Author: akorzhuev): Also `.travis.yml` needs to be fixed in the following way: {code:java} # 2. Choose language and target JDKs for parallel builds. language: java jdk: - openjdk8 - openjdk9 - openjdk10 {code} > Add JDK9+ in our Jenkins' build servers > --- > > Key: SPARK-24422 > URL: https://issues.apache.org/jira/browse/SPARK-24422 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.3.0 >Reporter: DB Tsai >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24421) sun.misc.Unsafe in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531235#comment-16531235 ] Andrew Korzhuev edited comment on SPARK-24421 at 7/3/18 11:44 AM: -- If I understand this correctly, then the only deprecated JDK9+ API Spark is using is `sun.misc.Cleaner` (while `sun.misc.Unsafe` is still accessible) in `[common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java|https://github.com/andrusha/spark/commit/7da06d3c725169f9764225f5a29886eb56bee191#diff-c7483c7efce631c783676f014ba2b0ed]`, which is fixable in the following way: {code:java} @@ -22,7 +22,7 @@ import java.lang.reflect.Method; import java.nio.ByteBuffer; -import sun.misc.Cleaner; +import java.lang.ref.Cleaner; import sun.misc.Unsafe; public final class Platform { @@ -169,7 +169,8 @@ public static ByteBuffer allocateDirectBuffer(int size) { cleanerField.setAccessible(true); long memory = allocateMemory(size); ByteBuffer buffer = (ByteBuffer) constructor.newInstance(memory, size); - Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory)); + Cleaner cleaner = Cleaner.create(); + cleaner.register(buffer, () -> freeMemory(memory)); cleanerField.set(buffer, cleaner); return buffer; {code} [https://github.com/andrusha/spark/commit/7da06d3c725169f9764225f5a29886eb56bee191#diff-c7483c7efce631c783676f014ba2b0ed] was (Author: akorzhuev): If I understand this correctly, then the only deprecated JDK9+ API Spark is using is `sun.misc.Cleaner` (while `sun.misc.Unsafe` is still accessible) in `[common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java|https://github.com/andrusha/spark/commit/7da06d3c725169f9764225f5a29886eb56bee191#diff-c7483c7efce631c783676f014ba2b0ed]`, which is fixable in the following way: {code:java} @@ -22,7 +22,7 @@ import java.lang.reflect.Method; import java.nio.ByteBuffer; -import sun.misc.Cleaner; +import java.lang.ref.Cleaner; import sun.misc.Unsafe; public final class Platform { @@ -169,7 +169,8 @@ public static ByteBuffer allocateDirectBuffer(int size) { cleanerField.setAccessible(true); long memory = allocateMemory(size); ByteBuffer buffer = (ByteBuffer) constructor.newInstance(memory, size); - Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory)); + Cleaner cleaner = Cleaner.create(); + cleaner.register(buffer, () -> freeMemory(memory)); cleanerField.set(buffer, cleaner); return buffer; {code} [https://github.com/andrusha/spark/commit/7da06d3c725169f9764225f5a29886eb56bee191#diff-c7483c7efce631c783676f014ba2b0ed] > sun.misc.Unsafe in JDK9+ > > > Key: SPARK-24421 > URL: https://issues.apache.org/jira/browse/SPARK-24421 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.3.0 >Reporter: DB Tsai >Priority: Major > > Many internal APIs such as unsafe are encapsulated in JDK9+, see > http://openjdk.java.net/jeps/260 for detail. > To use Unsafe, we need to add *jdk.unsupported* to our code’s module > declaration: > {code:java} > module java9unsafe { > requires jdk.unsupported; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24421) sun.misc.Unsafe in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531235#comment-16531235 ] Andrew Korzhuev commented on SPARK-24421: - If I understand this correctly, then the only deprecated JDK9+ API Spark is using is `sun.misc.Cleaner` (while `sun.misc.Unsafe` is still accessible) in `[common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java|https://github.com/andrusha/spark/commit/7da06d3c725169f9764225f5a29886eb56bee191#diff-c7483c7efce631c783676f014ba2b0ed]`, which is fixable in the following way: {code:java} @@ -22,7 +22,7 @@ import java.lang.reflect.Method; import java.nio.ByteBuffer; -import sun.misc.Cleaner; +import java.lang.ref.Cleaner; import sun.misc.Unsafe; public final class Platform { @@ -169,7 +169,8 @@ public static ByteBuffer allocateDirectBuffer(int size) { cleanerField.setAccessible(true); long memory = allocateMemory(size); ByteBuffer buffer = (ByteBuffer) constructor.newInstance(memory, size); - Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory)); + Cleaner cleaner = Cleaner.create(); + cleaner.register(buffer, () -> freeMemory(memory)); cleanerField.set(buffer, cleaner); return buffer; {code} [https://github.com/andrusha/spark/commit/7da06d3c725169f9764225f5a29886eb56bee191#diff-c7483c7efce631c783676f014ba2b0ed] > sun.misc.Unsafe in JDK9+ > > > Key: SPARK-24421 > URL: https://issues.apache.org/jira/browse/SPARK-24421 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.3.0 >Reporter: DB Tsai >Priority: Major > > Many internal APIs such as unsafe are encapsulated in JDK9+, see > http://openjdk.java.net/jeps/260 for detail. > To use Unsafe, we need to add *jdk.unsupported* to our code’s module > declaration: > {code:java} > module java9unsafe { > requires jdk.unsupported; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504965#comment-16504965 ] Andrew Conegliano commented on SPARK-24481: --- Thanks Marco. Forgot to mention, this error doesn't happen in 2.0.2 and 2.2.0. And for 2.3.0, even though the error pops up, the code will still run because it disables wholestagecodegen to run it. The main problem is that in a spark streaming context, the error pops up for every message so logs fill disk very quickly. > GeneratedIteratorForCodegenStage1 grows beyond 64 KB > > > Key: SPARK-24481 > URL: https://issues.apache.org/jira/browse/SPARK-24481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: Emr 5.13.0 and Databricks Cloud 4.0 >Reporter: Andrew Conegliano >Priority: Major > Attachments: log4j-active(1).log > > > Similar to other "grows beyond 64 KB" errors. Happens with large case > statement: > {code:java} > import org.apache.spark.sql.functions._ > import scala.collection.mutable > import org.apache.spark.sql.Column > var rdd = sc.parallelize(Array("""{ > "event": > { > "timestamp": 1521086591110, > "event_name": "yu", > "page": > { > "page_url": "https://;, > "page_name": "es" > }, > "properties": > { > "id": "87", > "action": "action", > "navigate_action": "navigate_action" > } > } > } > """)) > var df = spark.read.json(rdd) > df = > df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") > .toDF("id","event_time","url","action","page_name","event_name","navigation_action") > var a = "case " > for(i <- 1 to 300){ > a = a + s"when action like '$i%' THEN '$i' " > } > a = a + " else null end as task_id" > val expression = expr(a) > df = df.filter("id is not null and id <> '' and event_time is not null") > val transformationExpressions: mutable.HashMap[String, Column] = > mutable.HashMap( > "action" -> expr("coalesce(action, navigation_action) as action"), > "task_id" -> expression > ) > for((col, expr) <- transformationExpressions) > df = df.withColumn(col, expr) > df = df.filter("(action is not null and action <> '') or (page_name is not > null and page_name <> '')") > df.show > {code} > > Exception: > {code:java} > 18/06/07 01:06:34 ERROR CodeGenerator: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Code of method > "project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1" > grows beyond 64 KB > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Code of method > "project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1" > grows beyond 64 KB > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at > org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1444) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1523) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1520) > at > com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3522) > at > com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2315) > at > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278) > at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193) > at com.google.common.cache.LocalCache.get(LocalCache.java:3932) > at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936) > at > com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806) > at
[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Conegliano updated SPARK-24481: -- Description: Similar to other "grows beyond 64 KB" errors. Happens with large case statement: {code:java} import org.apache.spark.sql.functions._ import scala.collection.mutable import org.apache.spark.sql.Column var rdd = sc.parallelize(Array("""{ "event": { "timestamp": 1521086591110, "event_name": "yu", "page": { "page_url": "https://;, "page_name": "es" }, "properties": { "id": "87", "action": "action", "navigate_action": "navigate_action" } } } """)) var df = spark.read.json(rdd) df = df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") .toDF("id","event_time","url","action","page_name","event_name","navigation_action") var a = "case " for(i <- 1 to 300){ a = a + s"when action like '$i%' THEN '$i' " } a = a + " else null end as task_id" val expression = expr(a) df = df.filter("id is not null and id <> '' and event_time is not null") val transformationExpressions: mutable.HashMap[String, Column] = mutable.HashMap( "action" -> expr("coalesce(action, navigation_action) as action"), "task_id" -> expression ) for((col, expr) <- transformationExpressions) df = df.withColumn(col, expr) df = df.filter("(action is not null and action <> '') or (page_name is not null and page_name <> '')") df.show {code} Exception: {code:java} 18/06/07 01:06:34 ERROR CodeGenerator: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1" grows beyond 64 KB org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1" grows beyond 64 KB at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) at org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) at org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1444) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1523) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1520) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3522) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2315) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193) at com.google.common.cache.LocalCache.get(LocalCache.java:3932) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1392) at org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:579) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:578) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$3.apply(SparkPlan.scala:167) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:164) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:61) at
[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Conegliano updated SPARK-24481: -- Description: Similar to other "grows beyond 64 KB" errors. Happens with large case statement: {code:java} import org.apache.spark.sql.functions._ import scala.collection.mutable import org.apache.spark.sql.Column var rdd = sc.parallelize(Array("""{ "event": { "timestamp": 1521086591110, "event_name": "yu", "page": { "page_url": "https://;, "page_name": "es" }, "properties": { "id": "87", "action": "action", "navigate_action": "navigate_action" } } } """)) var df = spark.read.json(rdd) df = df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") .toDF("id","event_time","url","action","page_name","event_name","navigation_action") var a = "case " for(i <- 1 to 300){ a = a + s"when action like '$i%' THEN '$i' " } a = a + " else null end as task_id" val expression = expr(a) df = df.filter("id is not null and id <> '' and event_time is not null") val transformationExpressions: mutable.HashMap[String, Column] = mutable.HashMap( "action" -> expr("coalesce(action, navigation_action) as action"), "task_id" -> expression ) for((col, expr) <- transformationExpressions) df = df.withColumn(col, expr) df = df.filter("(action is not null and action <> '') or (page_name is not null and page_name <> '')") df.show {code} Exception: {code:java} 18/06/07 01:06:34 ERROR CodeGenerator: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1" grows beyond 64 KB org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1" grows beyond 64 KB{code} Log file is attached was: Similar to other "grows beyond 64 KB" errors. Happens with large case statement: {code:java} import org.apache.spark.sql.functions._ import scala.collection.mutable import org.apache.spark.sql.Column var rdd = sc.parallelize(Array("""{ "event": { "timestamp": 1521086591110, "event_name": "yu", "page": { "page_url": "https://;, "page_name": "es" }, "properties": { "id": "87", "action": "action", "navigate_action": "navigate_action" } } } """)) var df = spark.read.json(rdd) df = df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") .toDF("id","event_time","url","action","page_name","event_name","navigation_action") var a = "case " for(i <- 1 to 300){ a = a + s"when action like '$i%' THEN '$i' " } a = a + " else null end as task_id" val expression = expr(a) df = df.filter("id is not null and id <> '' and event_time is not null") val transformationExpressions: mutable.HashMap[String, Column] = mutable.HashMap( "action" -> expr("coalesce(action, navigation_action) as action"), "task_id" -> expression ) for((col, expr) <- transformationExpressions) df = df.withColumn(col, expr) df = df.filter("(action is not null and action <> '') or (page_name is not null and page_name <> '')") df.show {code} Log file is attached > GeneratedIteratorForCodegenStage1 grows beyond 64 KB > > > Key: SPARK-24481 > URL: https://issues.apache.org/jira/browse/SPARK-24481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: Emr 5.13.0 and Databricks Cloud 4.0 >Reporter: Andrew Conegliano >Priority: Major > Attachments: log4j-active(1).log > > > Similar to other "grows beyond 64 KB" errors. Happens with large case > statement: > {code:java} > import org.apache.spark.sql.functions._ > import scala.collection.mutable > import org.apache.spark.sql.Column > var rdd = sc.parallelize(Array("""{ > "event": > { > "timestamp": 1521086591110, > "event_name": "yu", > "page": > { > "page_url": "https://;, > "page_name": "es" > }, > "properties": > { > "id": "87", > "action": "action", > "navigate_action": "navigate_action" > } > } > } > """)) > var df = spark.read.json(rdd) > df = >
[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Conegliano updated SPARK-24481: -- Environment: Emr 5.13.0 and Databricks Cloud 4.0 (was: Emr 5.13.0) > GeneratedIteratorForCodegenStage1 grows beyond 64 KB > > > Key: SPARK-24481 > URL: https://issues.apache.org/jira/browse/SPARK-24481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: Emr 5.13.0 and Databricks Cloud 4.0 >Reporter: Andrew Conegliano >Priority: Major > Attachments: log4j-active(1).log > > > Similar to other "grows beyond 64 KB" errors. Happens with large case > statement: > {code:java} > import org.apache.spark.sql.functions._ > import scala.collection.mutable > import org.apache.spark.sql.Column > var rdd = sc.parallelize(Array("""{ > "event": > { > "timestamp": 1521086591110, > "event_name": "yu", > "page": > { > "page_url": "https://;, > "page_name": "es" > }, > "properties": > { > "id": "87", > "action": "action", > "navigate_action": "navigate_action" > } > } > } > """)) > var df = spark.read.json(rdd) > df = > df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") > .toDF("id","event_time","url","action","page_name","event_name","navigation_action") > var a = "case " > for(i <- 1 to 300){ > a = a + s"when action like '$i%' THEN '$i' " > } > a = a + " else null end as task_id" > val expression = expr(a) > df = df.filter("id is not null and id <> '' and event_time is not null") > val transformationExpressions: mutable.HashMap[String, Column] = > mutable.HashMap( > "action" -> expr("coalesce(action, navigation_action) as action"), > "task_id" -> expression > ) > for((col, expr) <- transformationExpressions) > df = df.withColumn(col, expr) > df = df.filter("(action is not null and action <> '') or (page_name is not > null and page_name <> '')") > df.show > {code} > Log file is attached -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Conegliano updated SPARK-24481: -- Description: Similar to other "grows beyond 64 KB" errors. Happens with large case statement: {code:java} import org.apache.spark.sql.functions._ import scala.collection.mutable import org.apache.spark.sql.Column var rdd = sc.parallelize(Array("""{ "event": { "timestamp": 1521086591110, "event_name": "yu", "page": { "page_url": "https://;, "page_name": "es" }, "properties": { "id": "87", "action": "action", "navigate_action": "navigate_action" } } } """)) var df = spark.read.json(rdd) df = df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") .toDF("id","event_time","url","action","page_name","event_name","navigation_action") var a = "case " for(i <- 1 to 300){ a = a + s"when action like '$i%' THEN '$i' " } a = a + " else null end as task_id" val expression = expr(a) df = df.filter("id is not null and id <> '' and event_time is not null") val transformationExpressions: mutable.HashMap[String, Column] = mutable.HashMap( "action" -> expr("coalesce(action, navigation_action) as action"), "task_id" -> expression ) for((col, expr) <- transformationExpressions) df = df.withColumn(col, expr) df = df.filter("(action is not null and action <> '') or (page_name is not null and page_name <> '')") df.show {code} Log file is attached was: Similar to other "grows beyond 64 KB" errors. Happens with large case statement: {code:java} // Databricks notebook source import org.apache.spark.sql.functions._ import scala.collection.mutable import org.apache.spark.sql.Column var rdd = sc.parallelize(Array("""{ "event": { "timestamp": 1521086591110, "event_name": "yu", "page": { "page_url": "https://;, "page_name": "es" }, "properties": { "id": "87", "action": "action", "navigate_action": "navigate_action" } } } """)) var df = spark.read.json(rdd) df = df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") .toDF("id","event_time","url","action","page_name","event_name","navigation_action") var a = "case " for(i <- 1 to 300){ a = a + s"when action like '$i%' THEN '$i' " } a = a + " else null end as task_id" val expression = expr(a) df = df.filter("id is not null and id <> '' and event_time is not null") val transformationExpressions: mutable.HashMap[String, Column] = mutable.HashMap( "action" -> expr("coalesce(action, navigation_action) as action"), "task_id" -> expression ) for((col, expr) <- transformationExpressions) df = df.withColumn(col, expr) df = df.filter("(action is not null and action <> '') or (page_name is not null and page_name <> '')") df.show {code} Log file is attached > GeneratedIteratorForCodegenStage1 grows beyond 64 KB > > > Key: SPARK-24481 > URL: https://issues.apache.org/jira/browse/SPARK-24481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: Emr 5.13.0 >Reporter: Andrew Conegliano >Priority: Major > Attachments: log4j-active(1).log > > > Similar to other "grows beyond 64 KB" errors. Happens with large case > statement: > {code:java} > import org.apache.spark.sql.functions._ > import scala.collection.mutable > import org.apache.spark.sql.Column > var rdd = sc.parallelize(Array("""{ > "event": > { > "timestamp": 1521086591110, > "event_name": "yu", > "page": > { > "page_url": "https://;, > "page_name": "es" > }, > "properties": > { > "id": "87", > "action": "action", > "navigate_action": "navigate_action" > } > } > } > """)) > var df = spark.read.json(rdd) > df = > df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") > .toDF("id","event_time","url","action","page_name","event_name","navigation_action") > var a = "case " > for(i <- 1 to 300){ > a = a + s"when action like '$i%' THEN '$i' " > } > a = a + " else null end as task_id" > val expression = expr(a) > df = df.filter("id is not null and id <> '' and event_time is not null") > val transformationExpressions: mutable.HashMap[String, Column] = > mutable.HashMap( > "action" -> expr("coalesce(action, navigation_action) as action"), > "task_id" -> expression > ) > for((col, expr) <- transformationExpressions) > df = df.withColumn(col, expr) > df = df.filter("(action is not null and action <> '') or (page_name is not > null and page_name <> '')") > df.show > {code} > Log file is attached -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Conegliano updated SPARK-24481: -- Description: Similar to other "grows beyond 64 KB" errors. Happens with large case statement: {code:java} // Databricks notebook source import org.apache.spark.sql.functions._ import scala.collection.mutable import org.apache.spark.sql.Column var rdd = sc.parallelize(Array("""{ "event": { "timestamp": 1521086591110, "event_name": "yu", "page": { "page_url": "https://;, "page_name": "es" }, "properties": { "id": "87", "action": "action", "navigate_action": "navigate_action" } } } """)) var df = spark.read.json(rdd) df = df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") .toDF("id","event_time","url","action","page_name","event_name","navigation_action") var a = "case " for(i <- 1 to 300){ a = a + s"when action like '$i%' THEN '$i' " } a = a + " else null end as task_id" val expression = expr(a) df = df.filter("id is not null and id <> '' and event_time is not null") val transformationExpressions: mutable.HashMap[String, Column] = mutable.HashMap( "action" -> expr("coalesce(action, navigation_action) as action"), "task_id" -> expression ) for((col, expr) <- transformationExpressions) df = df.withColumn(col, expr) df = df.filter("(action is not null and action <> '') or (page_name is not null and page_name <> '')") df.show {code} Log file is attached was: Similar to other "grows beyond 64 KB" errors. Happens with large case statement: {code:java} // Databricks notebook source import org.apache.spark.sql.functions._ import scala.collection.mutable import org.apache.spark.sql.Column var rdd = sc.parallelize(Array("""{ "event": { "timestamp": 1521086591110, "event_name": "yu", "page": { "page_url": "https://;, "page_name": "es" }, "properties": { "id": "87", "action": "action", "navigate_action": "navigate_action" } } } """)) var df = spark.read.json(rdd) df = df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") .toDF("id","event_time","url","action","page_name","event_name","navigation_action") var a = "case " for(i <- 1 to 300) a = a + s"when action like '$i%' THEN '$i' " a = a + " else null end as task_id" val expression = expr(a) df = df.filter("id is not null and id <> '' and event_time is not null") val transformationExpressions: mutable.HashMap[String, Column] = mutable.HashMap( "action" -> expr("coalesce(action, navigation_action) as action"), "task_id" -> expression ) for((col, expr) <- transformationExpressions) df = df.withColumn(col, expr) df = df.filter("(action is not null and action <> '') or (page_name is not null and page_name <> '')") df.show {code} Log file is attached > GeneratedIteratorForCodegenStage1 grows beyond 64 KB > > > Key: SPARK-24481 > URL: https://issues.apache.org/jira/browse/SPARK-24481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: Emr 5.13.0 >Reporter: Andrew Conegliano >Priority: Major > Attachments: log4j-active(1).log > > > Similar to other "grows beyond 64 KB" errors. Happens with large case > statement: > {code:java} > // Databricks notebook source > import org.apache.spark.sql.functions._ > import scala.collection.mutable > import org.apache.spark.sql.Column > var rdd = sc.parallelize(Array("""{ > "event": > { > "timestamp": 1521086591110, > "event_name": "yu", > "page": > { > "page_url": "https://;, > "page_name": "es" > }, > "properties": > { > "id": "87", > "action": "action", > "navigate_action": "navigate_action" > } > } > } > """)) > var df = spark.read.json(rdd) > df = > df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") > .toDF("id","event_time","url","action","page_name","event_name","navigation_action") > var a = "case " > for(i <- 1 to 300){ > a = a + s"when action like '$i%' THEN '$i' " > } > a = a + " else null end as task_id" > val expression = expr(a) > df = df.filter("id is not null and id <> '' and event_time is not null") > val transformationExpressions: mutable.HashMap[String, Column] = > mutable.HashMap( > "action" -> expr("coalesce(action, navigation_action) as action"), > "task_id" -> expression > ) > for((col, expr) <- transformationExpressions) > df = df.withColumn(col, expr) > df = df.filter("(action is not null and action <> '') or (page_name is not > null and page_name <> '')") > df.show > {code} > Log file is attached --
[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Conegliano updated SPARK-24481: -- Attachment: log4j-active(1).log > GeneratedIteratorForCodegenStage1 grows beyond 64 KB > > > Key: SPARK-24481 > URL: https://issues.apache.org/jira/browse/SPARK-24481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: Emr 5.13.0 >Reporter: Andrew Conegliano >Priority: Major > Attachments: log4j-active(1).log > > > Similar to other "grows beyond 64 KB" errors. Happens with large case > statement: > {code:java} > // Databricks notebook source > import org.apache.spark.sql.functions._ > import scala.collection.mutable > import org.apache.spark.sql.Column > var rdd = sc.parallelize(Array("""{ > "event": > { > "timestamp": 1521086591110, > "event_name": "yu", > "page": > { > "page_url": "https://;, > "page_name": "es" > }, > "properties": > { > "id": "87", > "action": "action", > "navigate_action": "navigate_action" > } > } > } > """)) > var df = spark.read.json(rdd) > df = > df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") > .toDF("id","event_time","url","action","page_name","event_name","navigation_action") > var a = "case " > for(i <- 1 to 300) > a = a + s"when action like '$i%' THEN '$i' " > a = a + " else null end as task_id" > val expression = expr(a) > df = df.filter("id is not null and id <> '' and event_time is not null") > val transformationExpressions: mutable.HashMap[String, Column] = > mutable.HashMap( > "action" -> expr("coalesce(action, navigation_action) as action"), > "task_id" -> expression > ) > for((col, expr) <- transformationExpressions) > df = df.withColumn(col, expr) > df = df.filter("(action is not null and action <> '') or (page_name is not > null and page_name <> '')") > df.show > {code} > Log file is attached -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB
Andrew Conegliano created SPARK-24481: - Summary: GeneratedIteratorForCodegenStage1 grows beyond 64 KB Key: SPARK-24481 URL: https://issues.apache.org/jira/browse/SPARK-24481 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Environment: Emr 5.13.0 Reporter: Andrew Conegliano Attachments: log4j-active(1).log Similar to other "grows beyond 64 KB" errors. Happens with large case statement: {code:java} // Databricks notebook source import org.apache.spark.sql.functions._ import scala.collection.mutable import org.apache.spark.sql.Column var rdd = sc.parallelize(Array("""{ "event": { "timestamp": 1521086591110, "event_name": "yu", "page": { "page_url": "https://;, "page_name": "es" }, "properties": { "id": "87", "action": "action", "navigate_action": "navigate_action" } } } """)) var df = spark.read.json(rdd) df = df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") .toDF("id","event_time","url","action","page_name","event_name","navigation_action") var a = "case " for(i <- 1 to 300) a = a + s"when action like '$i%' THEN '$i' " a = a + " else null end as task_id" val expression = expr(a) df = df.filter("id is not null and id <> '' and event_time is not null") val transformationExpressions: mutable.HashMap[String, Column] = mutable.HashMap( "action" -> expr("coalesce(action, navigation_action) as action"), "task_id" -> expression ) for((col, expr) <- transformationExpressions) df = df.withColumn(col, expr) df = df.filter("(action is not null and action <> '') or (page_name is not null and page_name <> '')") df.show {code} Log file is attached -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982
[ https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440683#comment-16440683 ] Andrew Clegg edited comment on SPARK-22371 at 4/17/18 9:53 AM: --- Another data point -- I've seen this happen (in 2.3.0) during cleanup after a task failure: {code:none} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: java.lang.IllegalStateException: Attempted to access garbage collected accumulator 365 at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194) ... 31 more {code} was (Author: aclegg): Another data point -- I've seen this happen (in 2.3.0) during cleanup after a task failure: {{Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: java.lang.IllegalStateException: Attempted to access garbage collected accumulator 365}} {{ at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)}} {{ at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)}} {{ at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)}} {{ at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)}} {{ at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)}} {{ at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)}} {{ at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)}} {{ at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)}} {{ at scala.Option.foreach(Option.scala:257)}} {{ at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)}} {{ at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)}} {{ at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)}} {{ at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)}} {{ at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)}} {{ at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)}} {{ at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)}} {{ at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)}} {{ ... 31 more}} > dag-scheduler-event-loop thread stopped with error Attempted to access > garbage collected accumulator 5605982 > - > > Key: SPARK-22371 > URL: https://issues.apache.org/jira/browse/SPARK-22371 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Mayank Agarwal >Priority: Major > Attachments: Helper.scala, ShuffleIssue.java, > driver-thread-dump-spark2.1.txt, sampledata > > > Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler > thread is stopped because of *Attempted to access garbage collected > accumulator 5605982*. > from our
[jira] [Commented] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982
[ https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440683#comment-16440683 ] Andrew Clegg commented on SPARK-22371: -- Another data point -- I've seen this happen (in 2.3.0) during cleanup after a task failure: {{Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: java.lang.IllegalStateException: Attempted to access garbage collected accumulator 365}} {{ at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)}} {{ at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)}} {{ at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)}} {{ at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)}} {{ at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)}} {{ at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)}} {{ at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)}} {{ at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)}} {{ at scala.Option.foreach(Option.scala:257)}} {{ at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)}} {{ at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)}} {{ at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)}} {{ at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)}} {{ at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)}} {{ at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)}} {{ at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)}} {{ at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)}} {{ ... 31 more}} > dag-scheduler-event-loop thread stopped with error Attempted to access > garbage collected accumulator 5605982 > - > > Key: SPARK-22371 > URL: https://issues.apache.org/jira/browse/SPARK-22371 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Mayank Agarwal >Priority: Major > Attachments: Helper.scala, ShuffleIssue.java, > driver-thread-dump-spark2.1.txt, sampledata > > > Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler > thread is stopped because of *Attempted to access garbage collected > accumulator 5605982*. > from our investigation it look like accumulator is cleaned by GC first and > same accumulator is used for merging the results from executor on task > completion event. > As the error java.lang.IllegalAccessError is LinkageError which is treated as > FatalError so dag-scheduler loop is finished with below exception. > ---ERROR stack trace -- > Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 5605982 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > I am attaching the thread dump of driver as well -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Comment Edited] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
[ https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431099#comment-16431099 ] Andrew Otto edited comment on SPARK-23890 at 4/9/18 7:32 PM: - Hah! As a temporary workaround, we are [instantiating a JDBC connection to Hive|https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala] to get around Spark 2's restriction...halp! Don't make us do this! :) was (Author: ottomata): Hah! As a temporary workaround, we are [instantiating a JDBC connection to Hive|https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala]|http://example.com/] to get around Spark 2's restriction...halp! Don't make us do this! :) > Hive ALTER TABLE CHANGE COLUMN for struct type no longer works > -- > > Key: SPARK-23890 > URL: https://issues.apache.org/jira/browse/SPARK-23890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Otto >Priority: Major > > As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE > CHANGE COLUMN commands to Hive. This restriction was loosened in > [https://github.com/apache/spark/pull/12714] to allow for those commands if > they only change the column comment. > Wikimedia has been evolving Parquet backed Hive tables with data originally > from JSON events by adding newly found columns to the Hive table schema, via > a Spark job we call 'Refine'. We do this by recursively merging an input > DataFrame schema with a Hive table DataFrame schema, finding new fields, and > then issuing an ALTER TABLE statement to add the columns. However, because > we allow for nested data types in the incoming JSON data, we make extensive > use of struct type fields. In order to add newly detected fields in a nested > data type, we must alter the struct column and append the nested struct > field. This requires CHANGE COLUMN that alters the column type. In reality, > the 'type' of the column is not changing, it just just a new field being > added to the struct, but to SQL, this looks like a type change. > We were about to upgrade to Spark 2 but this new restriction in SQL DDL that > can be sent to Hive will block us. I believe this is fixable by adding an > exception in > [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325] > to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and > destination type are both struct types, and the destination type only adds > new fields. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
[ https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431099#comment-16431099 ] Andrew Otto commented on SPARK-23890: - Hah! As a temporary workaround, we are [instantiating a JDBC connection to Hive|http://example.com]https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala to get around Spark 2's restriction...halp! Don't make us do this! :) > Hive ALTER TABLE CHANGE COLUMN for struct type no longer works > -- > > Key: SPARK-23890 > URL: https://issues.apache.org/jira/browse/SPARK-23890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Otto >Priority: Major > > As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE > CHANGE COLUMN commands to Hive. This restriction was loosened in > [https://github.com/apache/spark/pull/12714] to allow for those commands if > they only change the column comment. > Wikimedia has been evolving Parquet backed Hive tables with data originally > from JSON events by adding newly found columns to the Hive table schema, via > a Spark job we call 'Refine'. We do this by recursively merging an input > DataFrame schema with a Hive table DataFrame schema, finding new fields, and > then issuing an ALTER TABLE statement to add the columns. However, because > we allow for nested data types in the incoming JSON data, we make extensive > use of struct type fields. In order to add newly detected fields in a nested > data type, we must alter the struct column and append the nested struct > field. This requires CHANGE COLUMN that alters the column type. In reality, > the 'type' of the column is not changing, it just just a new field being > added to the struct, but to SQL, this looks like a type change. > We were about to upgrade to Spark 2 but this new restriction in SQL DDL that > can be sent to Hive will block us. I believe this is fixable by adding an > exception in > [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325] > to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and > destination type are both struct types, and the destination type only adds > new fields. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
[ https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431099#comment-16431099 ] Andrew Otto edited comment on SPARK-23890 at 4/9/18 7:31 PM: - Hah! As a temporary workaround, we are [[instantiating a JDBC connection to Hive|https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala]|http://example.com/] to get around Spark 2's restriction...halp! Don't make us do this! :) was (Author: ottomata): Hah! As a temporary workaround, we are [instantiating a JDBC connection to Hive|http://example.com/] to get around Spark 2's restriction...halp! Don't make us do this! :) > Hive ALTER TABLE CHANGE COLUMN for struct type no longer works > -- > > Key: SPARK-23890 > URL: https://issues.apache.org/jira/browse/SPARK-23890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Otto >Priority: Major > > As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE > CHANGE COLUMN commands to Hive. This restriction was loosened in > [https://github.com/apache/spark/pull/12714] to allow for those commands if > they only change the column comment. > Wikimedia has been evolving Parquet backed Hive tables with data originally > from JSON events by adding newly found columns to the Hive table schema, via > a Spark job we call 'Refine'. We do this by recursively merging an input > DataFrame schema with a Hive table DataFrame schema, finding new fields, and > then issuing an ALTER TABLE statement to add the columns. However, because > we allow for nested data types in the incoming JSON data, we make extensive > use of struct type fields. In order to add newly detected fields in a nested > data type, we must alter the struct column and append the nested struct > field. This requires CHANGE COLUMN that alters the column type. In reality, > the 'type' of the column is not changing, it just just a new field being > added to the struct, but to SQL, this looks like a type change. > We were about to upgrade to Spark 2 but this new restriction in SQL DDL that > can be sent to Hive will block us. I believe this is fixable by adding an > exception in > [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325] > to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and > destination type are both struct types, and the destination type only adds > new fields. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
[ https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431099#comment-16431099 ] Andrew Otto edited comment on SPARK-23890 at 4/9/18 7:31 PM: - Hah! As a temporary workaround, we are [instantiating a JDBC connection to Hive|https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala]|http://example.com/] to get around Spark 2's restriction...halp! Don't make us do this! :) was (Author: ottomata): Hah! As a temporary workaround, we are [[instantiating a JDBC connection to Hive|https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala]|http://example.com/] to get around Spark 2's restriction...halp! Don't make us do this! :) > Hive ALTER TABLE CHANGE COLUMN for struct type no longer works > -- > > Key: SPARK-23890 > URL: https://issues.apache.org/jira/browse/SPARK-23890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Otto >Priority: Major > > As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE > CHANGE COLUMN commands to Hive. This restriction was loosened in > [https://github.com/apache/spark/pull/12714] to allow for those commands if > they only change the column comment. > Wikimedia has been evolving Parquet backed Hive tables with data originally > from JSON events by adding newly found columns to the Hive table schema, via > a Spark job we call 'Refine'. We do this by recursively merging an input > DataFrame schema with a Hive table DataFrame schema, finding new fields, and > then issuing an ALTER TABLE statement to add the columns. However, because > we allow for nested data types in the incoming JSON data, we make extensive > use of struct type fields. In order to add newly detected fields in a nested > data type, we must alter the struct column and append the nested struct > field. This requires CHANGE COLUMN that alters the column type. In reality, > the 'type' of the column is not changing, it just just a new field being > added to the struct, but to SQL, this looks like a type change. > We were about to upgrade to Spark 2 but this new restriction in SQL DDL that > can be sent to Hive will block us. I believe this is fixable by adding an > exception in > [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325] > to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and > destination type are both struct types, and the destination type only adds > new fields. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
[ https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431099#comment-16431099 ] Andrew Otto edited comment on SPARK-23890 at 4/9/18 7:31 PM: - Hah! As a temporary workaround, we are [instantiating a JDBC connection to Hive|http://example.com/] to get around Spark 2's restriction...halp! Don't make us do this! :) was (Author: ottomata): Hah! As a temporary workaround, we are [instantiating a JDBC connection to Hive|http://example.com]https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala to get around Spark 2's restriction...halp! Don't make us do this! :) > Hive ALTER TABLE CHANGE COLUMN for struct type no longer works > -- > > Key: SPARK-23890 > URL: https://issues.apache.org/jira/browse/SPARK-23890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Otto >Priority: Major > > As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE > CHANGE COLUMN commands to Hive. This restriction was loosened in > [https://github.com/apache/spark/pull/12714] to allow for those commands if > they only change the column comment. > Wikimedia has been evolving Parquet backed Hive tables with data originally > from JSON events by adding newly found columns to the Hive table schema, via > a Spark job we call 'Refine'. We do this by recursively merging an input > DataFrame schema with a Hive table DataFrame schema, finding new fields, and > then issuing an ALTER TABLE statement to add the columns. However, because > we allow for nested data types in the incoming JSON data, we make extensive > use of struct type fields. In order to add newly detected fields in a nested > data type, we must alter the struct column and append the nested struct > field. This requires CHANGE COLUMN that alters the column type. In reality, > the 'type' of the column is not changing, it just just a new field being > added to the struct, but to SQL, this looks like a type change. > We were about to upgrade to Spark 2 but this new restriction in SQL DDL that > can be sent to Hive will block us. I believe this is fixable by adding an > exception in > [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325] > to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and > destination type are both struct types, and the destination type only adds > new fields. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
[ https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Otto updated SPARK-23890: External issue URL: https://github.com/apache/spark/pull/21012 > Hive ALTER TABLE CHANGE COLUMN for struct type no longer works > -- > > Key: SPARK-23890 > URL: https://issues.apache.org/jira/browse/SPARK-23890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Otto >Priority: Major > > As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE > CHANGE COLUMN commands to Hive. This restriction was loosened in > [https://github.com/apache/spark/pull/12714] to allow for those commands if > they only change the column comment. > Wikimedia has been evolving Parquet backed Hive tables with data originally > from JSON events by adding newly found columns to the Hive table schema, via > a Spark job we call 'Refine'. We do this by recursively merging an input > DataFrame schema with a Hive table DataFrame schema, finding new fields, and > then issuing an ALTER TABLE statement to add the columns. However, because > we allow for nested data types in the incoming JSON data, we make extensive > use of struct type fields. In order to add newly detected fields in a nested > data type, we must alter the struct column and append the nested struct > field. This requires CHANGE COLUMN that alters the column type. In reality, > the 'type' of the column is not changing, it just just a new field being > added to the struct, but to SQL, this looks like a type change. > We were about to upgrade to Spark 2 but this new restriction in SQL DDL that > can be sent to Hive will block us. I believe this is fixable by adding an > exception in > [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325] > to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and > destination type are both struct types, and the destination type only adds > new fields. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
[ https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Otto updated SPARK-23890: Description: As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE CHANGE COLUMN commands to Hive. This restriction was loosened in [https://github.com/apache/spark/pull/12714] to allow for those commands if they only change the column comment. Wikimedia has been evolving Parquet backed Hive tables with data originally from JSON events by adding newly found columns to the Hive table schema, via a Spark job we call 'Refine'. We do this by recursively merging an input DataFrame schema with a Hive table DataFrame schema, finding new fields, and then issuing an ALTER TABLE statement to add the columns. However, because we allow for nested data types in the incoming JSON data, we make extensive use of struct type fields. In order to add newly detected fields in a nested data type, we must alter the struct column and append the nested struct field. This requires CHANGE COLUMN that alters the column type. In reality, the 'type' of the column is not changing, it just just a new field being added to the struct, but to SQL, this looks like a type change. We were about to upgrade to Spark 2 but this new restriction in SQL DDL that can be sent to Hive will block us. I believe this is fixable by adding an exception in [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325] to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and destination type are both struct types, and the destination type only adds new fields. was: As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE CHANGE COLUMN commands to Hive. This restriction was loosened in [https://github.com/apache/spark/pull/12714] to allow for those commands if they only change the column comment. Wikimedia has been evolving Parquet backed Hive tables with data originally come from JSON data by adding newly found columns to the Hive table schema, via a Spark job we call 'Refine'. We do this by recursively merging an input DataFrame schema with a Hive table DataFrame schema, finding new fields, and then issuing an ALTER TABLE statement to add the columns. However, because we allow for nested data types in the incoming JSON data, we make extensive use of struct type fields. In order to add newly detected fields in a nested data type, we must alter the struct column and append the nested struct field. This requires CHANGE COLUMN that alters the column type. In reality, the 'type' of the column is not changing, it just just a new field being added to the struct, but to SQL, this looks like a type change. We were about to upgrade to Spark 2 but this new restriction in SQL DDL that can be sent to Hive will block us. I believe this is fixable by adding an exception in [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325] to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and destination type are both struct types, and the destination type only adds new fields. > Hive ALTER TABLE CHANGE COLUMN for struct type no longer works > -- > > Key: SPARK-23890 > URL: https://issues.apache.org/jira/browse/SPARK-23890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Otto >Priority: Major > > As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE > CHANGE COLUMN commands to Hive. This restriction was loosened in > [https://github.com/apache/spark/pull/12714] to allow for those commands if > they only change the column comment. > Wikimedia has been evolving Parquet backed Hive tables with data originally > from JSON events by adding newly found columns to the Hive table schema, via > a Spark job we call 'Refine'. We do this by recursively merging an input > DataFrame schema with a Hive table DataFrame schema, finding new fields, and > then issuing an ALTER TABLE statement to add the columns. However, because > we allow for nested data types in the incoming JSON data, we make extensive > use of struct type fields. In order to add newly detected fields in a nested > data type, we must alter the struct column and append the nested struct > field. This requires CHANGE COLUMN that alters the column type. In reality, > the 'type' of the column is not changing, it just just a new field being > added to the struct, but to SQL, this looks like a type change. > We were about to upgrade to Spark 2 but this new restriction in SQL DDL that >
[jira] [Updated] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
[ https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Otto updated SPARK-23890: Description: As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE CHANGE COLUMN commands to Hive. This restriction was loosened in [https://github.com/apache/spark/pull/12714] to allow for those commands if they only change the column comment. Wikimedia has been evolving Parquet backed Hive tables with data originally come from JSON data by adding newly found columns to the Hive table schema, via a Spark job we call 'Refine'. We do this by recursively merging an input DataFrame schema with a Hive table DataFrame schema, finding new fields, and then issuing an ALTER TABLE statement to add the columns. However, because we allow for nested data types in the incoming JSON data, we make extensive use of struct type fields. In order to add newly detected fields in a nested data type, we must alter the struct column and append the nested struct field. This requires CHANGE COLUMN that alters the column type. In reality, the 'type' of the column is not changing, it just just a new field being added to the struct, but to SQL, this looks like a type change. We were about to upgrade to Spark 2 but this new restriction in SQL DDL that can be sent to Hive will block us. I believe this is fixable by adding an exception in [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325] to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and destination type are both struct types, and the destination type only adds new fields. was: As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE CHANGE COLUMN commands to Hive. This was expanded in [https://github.com/apache/spark/pull/12714] to allow for those commands if they only change the column comment. Wikimedia has been evolving Parquet backed Hive tables with data originally come from JSON data by adding newly found columns to the Hive table schema, via a Spark job we call 'Refine'. We do this by recursively merging an input DataFrame schema with a Hive table DataFrame schema, finding new fields, and then issuing an ALTER TABLE statement to add the columns. However, because we allow for nested data types in the incoming JSON data, we make extensive use of struct type fields. In order to add newly detected fields in a nested data type, we must alter the struct column and append the nested struct field. This requires CHANGE COLUMN that alters the column type. In reality, the 'type' of the column is not changing, it just just a new field being added to the struct, but to SQL, this looks like a type change. We were about to upgrade to Spark 2 but this new restriction in SQL DDL that can be sent to Hive will block us. I believe this is fixable by adding an exception in [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325] to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and destination type are both struct types, and the destination type only adds new fields. > Hive ALTER TABLE CHANGE COLUMN for struct type no longer works > -- > > Key: SPARK-23890 > URL: https://issues.apache.org/jira/browse/SPARK-23890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Otto >Priority: Major > > As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE > CHANGE COLUMN commands to Hive. This restriction was loosened in > [https://github.com/apache/spark/pull/12714] to allow for those commands if > they only change the column comment. > Wikimedia has been evolving Parquet backed Hive tables with data originally > come from JSON data by adding newly found columns to the Hive table schema, > via a Spark job we call 'Refine'. We do this by recursively merging an input > DataFrame schema with a Hive table DataFrame schema, finding new fields, and > then issuing an ALTER TABLE statement to add the columns. However, because > we allow for nested data types in the incoming JSON data, we make extensive > use of struct type fields. In order to add newly detected fields in a nested > data type, we must alter the struct column and append the nested struct > field. This requires CHANGE COLUMN that alters the column type. In reality, > the 'type' of the column is not changing, it just just a new field being > added to the struct, but to SQL, this looks like a type change. > We were about to upgrade to Spark 2 but this new restriction in SQL DDL that > can be
[jira] [Created] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
Andrew Otto created SPARK-23890: --- Summary: Hive ALTER TABLE CHANGE COLUMN for struct type no longer works Key: SPARK-23890 URL: https://issues.apache.org/jira/browse/SPARK-23890 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Andrew Otto As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE CHANGE COLUMN commands to Hive. This was expanded in [https://github.com/apache/spark/pull/12714] to allow for those commands if they only change the column comment. Wikimedia has been evolving Parquet backed Hive tables with data originally come from JSON data by adding newly found columns to the Hive table schema, via a Spark job we call 'Refine'. We do this by recursively merging an input DataFrame schema with a Hive table DataFrame schema, finding new fields, and then issuing an ALTER TABLE statement to add the columns. However, because we allow for nested data types in the incoming JSON data, we make extensive use of struct type fields. In order to add newly detected fields in a nested data type, we must alter the struct column and append the nested struct field. This requires CHANGE COLUMN that alters the column type. In reality, the 'type' of the column is not changing, it just just a new field being added to the struct, but to SQL, this looks like a type change. We were about to upgrade to Spark 2 but this new restriction in SQL DDL that can be sent to Hive will block us. I believe this is fixable by adding an exception in [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325] to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and destination type are both struct types, and the destination type only adds new fields. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23878) unable to import col() or lit()
[ https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428517#comment-16428517 ] Andrew Davidson commented on SPARK-23878: - yes please mark resolved > unable to import col() or lit() > --- > > Key: SPARK-23878 > URL: https://issues.apache.org/jira/browse/SPARK-23878 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: eclipse 4.7.3 > pyDev 6.3.2 > pyspark==2.3.0 >Reporter: Andrew Davidson >Priority: Major > Attachments: eclipsePyDevPySparkConfig.png > > > I have some code I am moving from a jupyter notebook to separate python > modules. My notebook uses col() and list() and works fine > when I try to work with module files in my IDE I get the following errors. I > am also not able to run my unit tests. > {color:#FF}Description Resource Path Location Type{color} > {color:#FF}Unresolved import: lit load.py > /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color} > {color:#FF}Description Resource Path Location Type{color} > {color:#FF}Unresolved import: col load.py > /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color} > I suspect that when you run pyspark it is generating the col and lit > functions? > I found a discription of the problem @ > [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark] > I do not understand how to make this work in my IDE. I am not running > pyspark just an editor > is there some sort of workaround or replacement for these missing functions? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23878) unable to import col() or lit()
[ https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428509#comment-16428509 ] Andrew Davidson commented on SPARK-23878: - added screen shot to https://www.pinterest.com/pin/842454674019358931/ > unable to import col() or lit() > --- > > Key: SPARK-23878 > URL: https://issues.apache.org/jira/browse/SPARK-23878 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: eclipse 4.7.3 > pyDev 6.3.2 > pyspark==2.3.0 >Reporter: Andrew Davidson >Priority: Major > Attachments: eclipsePyDevPySparkConfig.png > > > I have some code I am moving from a jupyter notebook to separate python > modules. My notebook uses col() and list() and works fine > when I try to work with module files in my IDE I get the following errors. I > am also not able to run my unit tests. > {color:#FF}Description Resource Path Location Type{color} > {color:#FF}Unresolved import: lit load.py > /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color} > {color:#FF}Description Resource Path Location Type{color} > {color:#FF}Unresolved import: col load.py > /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color} > I suspect that when you run pyspark it is generating the col and lit > functions? > I found a discription of the problem @ > [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark] > I do not understand how to make this work in my IDE. I am not running > pyspark just an editor > is there some sort of workaround or replacement for these missing functions? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23878) unable to import col() or lit()
[ https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428494#comment-16428494 ] Andrew Davidson commented on SPARK-23878: - Hi Hyukjin many thanks! I am not a python expert. I did not know about dynamic name spaces. Did a little googling to configure dynamic namespace in eclipse preferences ->pydev -> interpreter ->python interpreters add pyspark to 'forced build' I attached a screen shot. Any idea how this can be added to the documentation so others do not have to waste times on this detail p.s. you sent me a link to "support vitualenv in PySpark" https://issues.apache.org/jira/browse/SPARK-13587 Once I made the pydev configuration change I am able to use pyspark in a virtualenv. I use this environment in eclipse pydev IDE with out any problems. I am able to run Juypter notebooks with out an problem > unable to import col() or lit() > --- > > Key: SPARK-23878 > URL: https://issues.apache.org/jira/browse/SPARK-23878 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: eclipse 4.7.3 > pyDev 6.3.2 > pyspark==2.3.0 >Reporter: Andrew Davidson >Priority: Major > Attachments: eclipsePyDevPySparkConfig.png > > > I have some code I am moving from a jupyter notebook to separate python > modules. My notebook uses col() and list() and works fine > when I try to work with module files in my IDE I get the following errors. I > am also not able to run my unit tests. > {color:#FF}Description Resource Path Location Type{color} > {color:#FF}Unresolved import: lit load.py > /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color} > {color:#FF}Description Resource Path Location Type{color} > {color:#FF}Unresolved import: col load.py > /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color} > I suspect that when you run pyspark it is generating the col and lit > functions? > I found a discription of the problem @ > [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark] > I do not understand how to make this work in my IDE. I am not running > pyspark just an editor > is there some sort of workaround or replacement for these missing functions? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23878) unable to import col() or lit()
[ https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-23878: Attachment: eclipsePyDevPySparkConfig.png > unable to import col() or lit() > --- > > Key: SPARK-23878 > URL: https://issues.apache.org/jira/browse/SPARK-23878 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: eclipse 4.7.3 > pyDev 6.3.2 > pyspark==2.3.0 >Reporter: Andrew Davidson >Priority: Major > Attachments: eclipsePyDevPySparkConfig.png > > > I have some code I am moving from a jupyter notebook to separate python > modules. My notebook uses col() and list() and works fine > when I try to work with module files in my IDE I get the following errors. I > am also not able to run my unit tests. > {color:#FF}Description Resource Path Location Type{color} > {color:#FF}Unresolved import: lit load.py > /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color} > {color:#FF}Description Resource Path Location Type{color} > {color:#FF}Unresolved import: col load.py > /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color} > I suspect that when you run pyspark it is generating the col and lit > functions? > I found a discription of the problem @ > [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark] > I do not understand how to make this work in my IDE. I am not running > pyspark just an editor > is there some sort of workaround or replacement for these missing functions? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23878) unable to import col() or lit()
[ https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427883#comment-16427883 ] Andrew Davidson commented on SPARK-23878: - Hi Hyukjin you are correct. Most IDE's are primarily language aware editors and builders. For example consider eclipse or IntelJ for developing a javascript website, or java servlet. The editor functionality knows about the syntax of the language you are working with along with the libraries and packages you are using. Often the IDE does some sort of continuous build or code analysis to help you find bugs without having to deploy Often the IDE makes it easy build, package, to actually deploy on some sort of test server and debug and or run unit tests. So if pyspark is generating functions at turn time that going to cause problems for the IDE. the functions are not defined in the edit session. [http://www.learn4master.com/algorithms/pyspark-unit-test-set-up-sparkcontext] describes how to write unititests for pyspark that you can run from your command line and or from with in elipse. I think a side effect is that they might cause the functions lit() and col() to be generated? I could not find a work around for col() and lit(). ret = df.select( col(columnName).cast("string").alias("key"), lit(value).alias("source") ) Kind regards Andy > unable to import col() or lit() > --- > > Key: SPARK-23878 > URL: https://issues.apache.org/jira/browse/SPARK-23878 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: eclipse 4.7.3 > pyDev 6.3.2 > pyspark==2.3.0 >Reporter: Andrew Davidson >Priority: Major > > I have some code I am moving from a jupyter notebook to separate python > modules. My notebook uses col() and list() and works fine > when I try to work with module files in my IDE I get the following errors. I > am also not able to run my unit tests. > {color:#FF}Description Resource Path Location Type{color} > {color:#FF}Unresolved import: lit load.py > /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color} > {color:#FF}Description Resource Path Location Type{color} > {color:#FF}Unresolved import: col load.py > /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color} > I suspect that when you run pyspark it is generating the col and lit > functions? > I found a discription of the problem @ > [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark] > I do not understand how to make this work in my IDE. I am not running > pyspark just an editor > is there some sort of workaround or replacement for these missing functions? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23878) unable to import col() or lit()
Andrew Davidson created SPARK-23878: --- Summary: unable to import col() or lit() Key: SPARK-23878 URL: https://issues.apache.org/jira/browse/SPARK-23878 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.0 Environment: eclipse 4.7.3 pyDev 6.3.2 pyspark==2.3.0 Reporter: Andrew Davidson I have some code I am moving from a jupyter notebook to separate python modules. My notebook uses col() and list() and works fine when I try to work with module files in my IDE I get the following errors. I am also not able to run my unit tests. {color:#FF}Description Resource Path Location Type{color} {color:#FF}Unresolved import: lit load.py /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color} {color:#FF}Description Resource Path Location Type{color} {color:#FF}Unresolved import: col load.py /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color} I suspect that when you run pyspark it is generating the col and lit functions? I found a discription of the problem @ [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark] I do not understand how to make this work in my IDE. I am not running pyspark just an editor is there some sort of workaround or replacement for these missing functions? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23682) Memory issue with Spark structured streaming
[ https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417474#comment-16417474 ] Andrew Korzhuev edited comment on SPARK-23682 at 3/28/18 3:00 PM: -- I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on: * AWS S3 checkpoint * Spark 2.3.0 on k8s * Structured stream - stream join I managed to track the leak down to [https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_ {code:java} private lazy val loadedMaps = new mutable.HashMap[Long, MapType]{code} _,_ which appears not to clean up _UnsafeRow_ coming from: {code:java} type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, UnsafeRow]{code} I noticed that memory leaks slower if data is buffered to disk: {code:java} spark.hadoop.fs.s3a.fast.upload true spark.hadoop.fs.s3a.fast.upload.bufferdisk {code} It also seems that the state persisted to S3 is never cleaned up, as both number of objects and volume grows indefinitely. Before worker dies: !screen_shot_2018-03-20_at_15.23.29.png! Heap dump of worker running for some time: !Screen Shot 2018-03-28 at 16.44.20.png! was (Author: akorzhuev): I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on: * AWS S3 checkpoint * Spark 2.3.0 on k8s * Structured stream - stream join I managed to track the leak down to [https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_ {code:java} private lazy val loadedMaps = new mutable.HashMap[Long, MapType]{code} _,_ which appears not to clean up _UnsafeRow_ coming from: {code:java} type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, UnsafeRow]{code} I noticed that memory leaks slower if data is buffered to disk: {code:java} spark.hadoop.fs.s3a.fast.upload true spark.hadoop.fs.s3a.fast.upload.bufferdisk {code} It also seems that the state persisted to S3 is never cleaned up, as both number of objects and volume grows indefinitely. !Screen Shot 2018-03-28 at 16.44.20.png! > Memory issue with Spark structured streaming > > > Key: SPARK-23682 > URL: https://issues.apache.org/jira/browse/SPARK-23682 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 > Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3 > |spark.blacklist.decommissioning.enabled|true| > |spark.blacklist.decommissioning.timeout|1h| > |spark.cleaner.periodicGC.interval|10min| > |spark.default.parallelism|18| > |spark.dynamicAllocation.enabled|false| > |spark.eventLog.enabled|true| > |spark.executor.cores|3| > |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'| > |spark.executor.id|driver| > |spark.executor.instances|3| > |spark.executor.memory|22G| > |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2| > |spark.hadoop.parquet.enable.summary-metadata|false| > |spark.hadoop.yarn.timeline-service.enabled|false| > |spark.jars| | > |spark.master|yarn| > |spark.memory.fraction|0.9| > |spark.memory.storageFraction|0.3| > |spark.memory.useLegacyMode|false| > |spark.rdd.compress|true| > |spark.resourceManager.cleanupExpiredHost|true| > |spark.scheduler.mode|FIFO| > |spark.serializer|org.apache.spark.serializer.KryoSerializer| > |spark.shuffle.service.enabled|true| > |spark.speculation|false| > |spark.sql.parquet.filterPushdown|true| > |spark.sql.parquet.mergeSchema|false| > |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse| > |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true| > |spark.submit.deployMode|client| > |spark.yarn.am.cores|1| > |spark.yarn.am.memory|2G| > |spark.yarn.am.memoryOverhead|1G| > |spark.yarn.executor.memoryOverhead|3G| >Reporter: Yuriy Bondaruk >Priority: Major > Labels: Memory, memory, memory-leak > Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot > 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen > Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, > Spark executors GC time.png, image-2018-03-22-14-46-31-960.png, > screen_shot_2018-03-20_at_15.23.29.png > > > It seems like there is an issue with memory in structured streaming. A stream > with aggregation (dropDuplicates()) and data partitioning constantly > increases memory usage and finally executors fails with exit code 137: >
[jira] [Updated] (SPARK-23682) Memory issue with Spark structured streaming
[ https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Korzhuev updated SPARK-23682: Attachment: screen_shot_2018-03-20_at_15.23.29.png > Memory issue with Spark structured streaming > > > Key: SPARK-23682 > URL: https://issues.apache.org/jira/browse/SPARK-23682 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 > Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3 > |spark.blacklist.decommissioning.enabled|true| > |spark.blacklist.decommissioning.timeout|1h| > |spark.cleaner.periodicGC.interval|10min| > |spark.default.parallelism|18| > |spark.dynamicAllocation.enabled|false| > |spark.eventLog.enabled|true| > |spark.executor.cores|3| > |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'| > |spark.executor.id|driver| > |spark.executor.instances|3| > |spark.executor.memory|22G| > |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2| > |spark.hadoop.parquet.enable.summary-metadata|false| > |spark.hadoop.yarn.timeline-service.enabled|false| > |spark.jars| | > |spark.master|yarn| > |spark.memory.fraction|0.9| > |spark.memory.storageFraction|0.3| > |spark.memory.useLegacyMode|false| > |spark.rdd.compress|true| > |spark.resourceManager.cleanupExpiredHost|true| > |spark.scheduler.mode|FIFO| > |spark.serializer|org.apache.spark.serializer.KryoSerializer| > |spark.shuffle.service.enabled|true| > |spark.speculation|false| > |spark.sql.parquet.filterPushdown|true| > |spark.sql.parquet.mergeSchema|false| > |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse| > |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true| > |spark.submit.deployMode|client| > |spark.yarn.am.cores|1| > |spark.yarn.am.memory|2G| > |spark.yarn.am.memoryOverhead|1G| > |spark.yarn.executor.memoryOverhead|3G| >Reporter: Yuriy Bondaruk >Priority: Major > Labels: Memory, memory, memory-leak > Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot > 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen > Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, > Spark executors GC time.png, image-2018-03-22-14-46-31-960.png, > screen_shot_2018-03-20_at_15.23.29.png > > > It seems like there is an issue with memory in structured streaming. A stream > with aggregation (dropDuplicates()) and data partitioning constantly > increases memory usage and finally executors fails with exit code 137: > {quote}ExecutorLostFailure (executor 2 exited caused by one of the running > tasks) Reason: Container marked as failed: > container_1520214726510_0001_01_03 on host: > ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: > Container killed on request. Exit code is 137 > Container exited with a non-zero exit code 137 > Killed by external signal{quote} > Stream creating looks something like this: > {quote}session > .readStream() > .schema(inputSchema) > .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB) > .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF) > .csv("s3://test-bucket/input") > .as(Encoders.bean(TestRecord.class)) > .flatMap(mf, Encoders.bean(TestRecord.class)) > .dropDuplicates("testId", "testName") > .withColumn("year", > functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), > "")) > .writeStream() > .option("path", "s3://test-bucket/output") > .option("checkpointLocation", "s3://test-bucket/checkpoint") > .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS)) > .partitionBy("year") > .format("parquet") > .outputMode(OutputMode.Append()) > .queryName("test-stream") > .start();{quote} > Analyzing the heap dump I found that most of the memory used by > {{org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider}} > that is referenced from > [StateStore|https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L196] > > On the first glance it looks normal since that is how Spark keeps aggregation > keys in memory. However I did my testing by renaming files in source folder, > so that they could be picked up by spark again. Since input records are the > same all further rows should be rejected as duplicates and memory consumption > shouldn't increase but it's not true. Moreover, GC time took more than 30% of > total processing time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (SPARK-23682) Memory issue with Spark structured streaming
[ https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417474#comment-16417474 ] Andrew Korzhuev edited comment on SPARK-23682 at 3/28/18 2:56 PM: -- I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on: * AWS S3 checkpoint * Spark 2.3.0 on k8s * Structured stream - stream join I managed to track the leak down to [https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_ {code:java} private lazy val loadedMaps = new mutable.HashMap[Long, MapType]{code} _,_ which appears not to clean up _UnsafeRow_ coming from: {code:java} type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, UnsafeRow]{code} I noticed that memory leaks slower if data is buffered to disk: {code:java} spark.hadoop.fs.s3a.fast.upload true spark.hadoop.fs.s3a.fast.upload.bufferdisk {code} It also seems that the state persisted to S3 is never cleaned up, as both number of objects and volume grows indefinitely. !Screen Shot 2018-03-28 at 16.44.20.png! was (Author: akorzhuev): I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on: * AWS S3 checkpoint * Spark 2.3.0 on k8s * Structured stream - stream join I managed to track the leak down to [https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_ {code:java} private lazy val loadedMaps = new mutable.HashMap[Long, MapType]{code} _,_ which appears not to clean up _UnsafeRow_s coming from: {code:java} type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, UnsafeRow]{code} I noticed that memory leaks slower if data is buffered to disk: {code:java} spark.hadoop.fs.s3a.fast.upload true spark.hadoop.fs.s3a.fast.upload.bufferdisk {code} It also seems that the state persisted to S3 is never cleaned up, as both number of objects and volume grows indefinitely. !Screen Shot 2018-03-28 at 16.44.20.png! > Memory issue with Spark structured streaming > > > Key: SPARK-23682 > URL: https://issues.apache.org/jira/browse/SPARK-23682 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 > Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3 > |spark.blacklist.decommissioning.enabled|true| > |spark.blacklist.decommissioning.timeout|1h| > |spark.cleaner.periodicGC.interval|10min| > |spark.default.parallelism|18| > |spark.dynamicAllocation.enabled|false| > |spark.eventLog.enabled|true| > |spark.executor.cores|3| > |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'| > |spark.executor.id|driver| > |spark.executor.instances|3| > |spark.executor.memory|22G| > |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2| > |spark.hadoop.parquet.enable.summary-metadata|false| > |spark.hadoop.yarn.timeline-service.enabled|false| > |spark.jars| | > |spark.master|yarn| > |spark.memory.fraction|0.9| > |spark.memory.storageFraction|0.3| > |spark.memory.useLegacyMode|false| > |spark.rdd.compress|true| > |spark.resourceManager.cleanupExpiredHost|true| > |spark.scheduler.mode|FIFO| > |spark.serializer|org.apache.spark.serializer.KryoSerializer| > |spark.shuffle.service.enabled|true| > |spark.speculation|false| > |spark.sql.parquet.filterPushdown|true| > |spark.sql.parquet.mergeSchema|false| > |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse| > |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true| > |spark.submit.deployMode|client| > |spark.yarn.am.cores|1| > |spark.yarn.am.memory|2G| > |spark.yarn.am.memoryOverhead|1G| > |spark.yarn.executor.memoryOverhead|3G| >Reporter: Yuriy Bondaruk >Priority: Major > Labels: Memory, memory, memory-leak > Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot > 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen > Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, > Spark executors GC time.png, image-2018-03-22-14-46-31-960.png > > > It seems like there is an issue with memory in structured streaming. A stream > with aggregation (dropDuplicates()) and data partitioning constantly > increases memory usage and finally executors fails with exit code 137: > {quote}ExecutorLostFailure (executor 2 exited caused by one of the running > tasks) Reason: Container marked as failed: >
[jira] [Comment Edited] (SPARK-23682) Memory issue with Spark structured streaming
[ https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417474#comment-16417474 ] Andrew Korzhuev edited comment on SPARK-23682 at 3/28/18 2:55 PM: -- I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on: * AWS S3 checkpoint * Spark 2.3.0 on k8s * Structured stream - stream join I managed to track the leak down to [https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_ {code:java} private lazy val loadedMaps = new mutable.HashMap[Long, MapType]{code} _,_ which appears not to clean up _UnsafeRow_s coming from: {code:java} type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, UnsafeRow]{code} I noticed that memory leaks slower if data is buffered to disk: {code:java} spark.hadoop.fs.s3a.fast.upload true spark.hadoop.fs.s3a.fast.upload.bufferdisk {code} It also seems that the state persisted to S3 is never cleaned up, as both number of objects and volume grows indefinitely. !Screen Shot 2018-03-28 at 16.44.20.png! was (Author: akorzhuev): I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on: * AWS S3 checkpoint * Spark 2.3.0 on k8s * Structured stream - stream join I managed to track the leak down to [https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_ _private lazy val loadedMaps = new mutable.HashMap[Long, MapType]_ _,_ which appears not to clean up _UnsafeRow_s coming from: {code:java} type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, UnsafeRow]{code} I noticed that memory leaks slower if data is buffered to disk: {code:java} spark.hadoop.fs.s3a.fast.upload true spark.hadoop.fs.s3a.fast.upload.bufferdisk {code} It also seems that the state persisted to S3 is never cleaned up, as both number of objects and volume grows indefinitely. !Screen Shot 2018-03-28 at 16.44.20.png! > Memory issue with Spark structured streaming > > > Key: SPARK-23682 > URL: https://issues.apache.org/jira/browse/SPARK-23682 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 > Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3 > |spark.blacklist.decommissioning.enabled|true| > |spark.blacklist.decommissioning.timeout|1h| > |spark.cleaner.periodicGC.interval|10min| > |spark.default.parallelism|18| > |spark.dynamicAllocation.enabled|false| > |spark.eventLog.enabled|true| > |spark.executor.cores|3| > |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'| > |spark.executor.id|driver| > |spark.executor.instances|3| > |spark.executor.memory|22G| > |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2| > |spark.hadoop.parquet.enable.summary-metadata|false| > |spark.hadoop.yarn.timeline-service.enabled|false| > |spark.jars| | > |spark.master|yarn| > |spark.memory.fraction|0.9| > |spark.memory.storageFraction|0.3| > |spark.memory.useLegacyMode|false| > |spark.rdd.compress|true| > |spark.resourceManager.cleanupExpiredHost|true| > |spark.scheduler.mode|FIFO| > |spark.serializer|org.apache.spark.serializer.KryoSerializer| > |spark.shuffle.service.enabled|true| > |spark.speculation|false| > |spark.sql.parquet.filterPushdown|true| > |spark.sql.parquet.mergeSchema|false| > |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse| > |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true| > |spark.submit.deployMode|client| > |spark.yarn.am.cores|1| > |spark.yarn.am.memory|2G| > |spark.yarn.am.memoryOverhead|1G| > |spark.yarn.executor.memoryOverhead|3G| >Reporter: Yuriy Bondaruk >Priority: Major > Labels: Memory, memory, memory-leak > Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot > 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen > Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, > Spark executors GC time.png, image-2018-03-22-14-46-31-960.png > > > It seems like there is an issue with memory in structured streaming. A stream > with aggregation (dropDuplicates()) and data partitioning constantly > increases memory usage and finally executors fails with exit code 137: > {quote}ExecutorLostFailure (executor 2 exited caused by one of the running > tasks) Reason: Container marked as failed: >
[jira] [Commented] (SPARK-23682) Memory issue with Spark structured streaming
[ https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417474#comment-16417474 ] Andrew Korzhuev commented on SPARK-23682: - I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on: * AWS S3 checkpoint * Spark 2.3.0 on k8s * Structured stream - stream join I managed to track the leak down to [https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_ _private lazy val loadedMaps = new mutable.HashMap[Long, MapType]_ _,_ which appears not to clean up _UnsafeRow_s coming from: {code:java} type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, UnsafeRow]{code} I noticed that memory leaks slower if data is buffered to disk: {code:java} spark.hadoop.fs.s3a.fast.upload true spark.hadoop.fs.s3a.fast.upload.bufferdisk {code} It also seems that the state persisted to S3 is never cleaned up, as both number of objects and volume grows indefinitely. !Screen Shot 2018-03-28 at 16.44.20.png! > Memory issue with Spark structured streaming > > > Key: SPARK-23682 > URL: https://issues.apache.org/jira/browse/SPARK-23682 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 > Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3 > |spark.blacklist.decommissioning.enabled|true| > |spark.blacklist.decommissioning.timeout|1h| > |spark.cleaner.periodicGC.interval|10min| > |spark.default.parallelism|18| > |spark.dynamicAllocation.enabled|false| > |spark.eventLog.enabled|true| > |spark.executor.cores|3| > |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'| > |spark.executor.id|driver| > |spark.executor.instances|3| > |spark.executor.memory|22G| > |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2| > |spark.hadoop.parquet.enable.summary-metadata|false| > |spark.hadoop.yarn.timeline-service.enabled|false| > |spark.jars| | > |spark.master|yarn| > |spark.memory.fraction|0.9| > |spark.memory.storageFraction|0.3| > |spark.memory.useLegacyMode|false| > |spark.rdd.compress|true| > |spark.resourceManager.cleanupExpiredHost|true| > |spark.scheduler.mode|FIFO| > |spark.serializer|org.apache.spark.serializer.KryoSerializer| > |spark.shuffle.service.enabled|true| > |spark.speculation|false| > |spark.sql.parquet.filterPushdown|true| > |spark.sql.parquet.mergeSchema|false| > |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse| > |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true| > |spark.submit.deployMode|client| > |spark.yarn.am.cores|1| > |spark.yarn.am.memory|2G| > |spark.yarn.am.memoryOverhead|1G| > |spark.yarn.executor.memoryOverhead|3G| >Reporter: Yuriy Bondaruk >Priority: Major > Labels: Memory, memory, memory-leak > Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot > 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen > Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, > Spark executors GC time.png, image-2018-03-22-14-46-31-960.png > > > It seems like there is an issue with memory in structured streaming. A stream > with aggregation (dropDuplicates()) and data partitioning constantly > increases memory usage and finally executors fails with exit code 137: > {quote}ExecutorLostFailure (executor 2 exited caused by one of the running > tasks) Reason: Container marked as failed: > container_1520214726510_0001_01_03 on host: > ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: > Container killed on request. Exit code is 137 > Container exited with a non-zero exit code 137 > Killed by external signal{quote} > Stream creating looks something like this: > {quote}session > .readStream() > .schema(inputSchema) > .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB) > .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF) > .csv("s3://test-bucket/input") > .as(Encoders.bean(TestRecord.class)) > .flatMap(mf, Encoders.bean(TestRecord.class)) > .dropDuplicates("testId", "testName") > .withColumn("year", > functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), > "")) > .writeStream() > .option("path", "s3://test-bucket/output") > .option("checkpointLocation", "s3://test-bucket/checkpoint") > .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS)) > .partitionBy("year") > .format("parquet") > .outputMode(OutputMode.Append()) >
[jira] [Updated] (SPARK-23682) Memory issue with Spark structured streaming
[ https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Korzhuev updated SPARK-23682: Attachment: Screen Shot 2018-03-28 at 16.44.20.png > Memory issue with Spark structured streaming > > > Key: SPARK-23682 > URL: https://issues.apache.org/jira/browse/SPARK-23682 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 > Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3 > |spark.blacklist.decommissioning.enabled|true| > |spark.blacklist.decommissioning.timeout|1h| > |spark.cleaner.periodicGC.interval|10min| > |spark.default.parallelism|18| > |spark.dynamicAllocation.enabled|false| > |spark.eventLog.enabled|true| > |spark.executor.cores|3| > |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'| > |spark.executor.id|driver| > |spark.executor.instances|3| > |spark.executor.memory|22G| > |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2| > |spark.hadoop.parquet.enable.summary-metadata|false| > |spark.hadoop.yarn.timeline-service.enabled|false| > |spark.jars| | > |spark.master|yarn| > |spark.memory.fraction|0.9| > |spark.memory.storageFraction|0.3| > |spark.memory.useLegacyMode|false| > |spark.rdd.compress|true| > |spark.resourceManager.cleanupExpiredHost|true| > |spark.scheduler.mode|FIFO| > |spark.serializer|org.apache.spark.serializer.KryoSerializer| > |spark.shuffle.service.enabled|true| > |spark.speculation|false| > |spark.sql.parquet.filterPushdown|true| > |spark.sql.parquet.mergeSchema|false| > |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse| > |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true| > |spark.submit.deployMode|client| > |spark.yarn.am.cores|1| > |spark.yarn.am.memory|2G| > |spark.yarn.am.memoryOverhead|1G| > |spark.yarn.executor.memoryOverhead|3G| >Reporter: Yuriy Bondaruk >Priority: Major > Labels: Memory, memory, memory-leak > Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot > 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen > Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, > Spark executors GC time.png, image-2018-03-22-14-46-31-960.png > > > It seems like there is an issue with memory in structured streaming. A stream > with aggregation (dropDuplicates()) and data partitioning constantly > increases memory usage and finally executors fails with exit code 137: > {quote}ExecutorLostFailure (executor 2 exited caused by one of the running > tasks) Reason: Container marked as failed: > container_1520214726510_0001_01_03 on host: > ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: > Container killed on request. Exit code is 137 > Container exited with a non-zero exit code 137 > Killed by external signal{quote} > Stream creating looks something like this: > {quote}session > .readStream() > .schema(inputSchema) > .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB) > .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF) > .csv("s3://test-bucket/input") > .as(Encoders.bean(TestRecord.class)) > .flatMap(mf, Encoders.bean(TestRecord.class)) > .dropDuplicates("testId", "testName") > .withColumn("year", > functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), > "")) > .writeStream() > .option("path", "s3://test-bucket/output") > .option("checkpointLocation", "s3://test-bucket/checkpoint") > .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS)) > .partitionBy("year") > .format("parquet") > .outputMode(OutputMode.Append()) > .queryName("test-stream") > .start();{quote} > Analyzing the heap dump I found that most of the memory used by > {{org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider}} > that is referenced from > [StateStore|https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L196] > > On the first glance it looks normal since that is how Spark keeps aggregation > keys in memory. However I did my testing by renaming files in source folder, > so that they could be picked up by spark again. Since input records are the > same all further rows should be rejected as duplicates and memory consumption > shouldn't increase but it's not true. Moreover, GC time took more than 30% of > total processing time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) -
[jira] [Updated] (SPARK-23682) Memory issue with Spark structured streaming
[ https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Korzhuev updated SPARK-23682: Attachment: Screen Shot 2018-03-28 at 16.44.20.png > Memory issue with Spark structured streaming > > > Key: SPARK-23682 > URL: https://issues.apache.org/jira/browse/SPARK-23682 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 > Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3 > |spark.blacklist.decommissioning.enabled|true| > |spark.blacklist.decommissioning.timeout|1h| > |spark.cleaner.periodicGC.interval|10min| > |spark.default.parallelism|18| > |spark.dynamicAllocation.enabled|false| > |spark.eventLog.enabled|true| > |spark.executor.cores|3| > |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'| > |spark.executor.id|driver| > |spark.executor.instances|3| > |spark.executor.memory|22G| > |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2| > |spark.hadoop.parquet.enable.summary-metadata|false| > |spark.hadoop.yarn.timeline-service.enabled|false| > |spark.jars| | > |spark.master|yarn| > |spark.memory.fraction|0.9| > |spark.memory.storageFraction|0.3| > |spark.memory.useLegacyMode|false| > |spark.rdd.compress|true| > |spark.resourceManager.cleanupExpiredHost|true| > |spark.scheduler.mode|FIFO| > |spark.serializer|org.apache.spark.serializer.KryoSerializer| > |spark.shuffle.service.enabled|true| > |spark.speculation|false| > |spark.sql.parquet.filterPushdown|true| > |spark.sql.parquet.mergeSchema|false| > |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse| > |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true| > |spark.submit.deployMode|client| > |spark.yarn.am.cores|1| > |spark.yarn.am.memory|2G| > |spark.yarn.am.memoryOverhead|1G| > |spark.yarn.executor.memoryOverhead|3G| >Reporter: Yuriy Bondaruk >Priority: Major > Labels: Memory, memory, memory-leak > Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot > 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen > Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, > Spark executors GC time.png, image-2018-03-22-14-46-31-960.png > > > It seems like there is an issue with memory in structured streaming. A stream > with aggregation (dropDuplicates()) and data partitioning constantly > increases memory usage and finally executors fails with exit code 137: > {quote}ExecutorLostFailure (executor 2 exited caused by one of the running > tasks) Reason: Container marked as failed: > container_1520214726510_0001_01_03 on host: > ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: > Container killed on request. Exit code is 137 > Container exited with a non-zero exit code 137 > Killed by external signal{quote} > Stream creating looks something like this: > {quote}session > .readStream() > .schema(inputSchema) > .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB) > .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF) > .csv("s3://test-bucket/input") > .as(Encoders.bean(TestRecord.class)) > .flatMap(mf, Encoders.bean(TestRecord.class)) > .dropDuplicates("testId", "testName") > .withColumn("year", > functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), > "")) > .writeStream() > .option("path", "s3://test-bucket/output") > .option("checkpointLocation", "s3://test-bucket/checkpoint") > .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS)) > .partitionBy("year") > .format("parquet") > .outputMode(OutputMode.Append()) > .queryName("test-stream") > .start();{quote} > Analyzing the heap dump I found that most of the memory used by > {{org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider}} > that is referenced from > [StateStore|https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L196] > > On the first glance it looks normal since that is how Spark keeps aggregation > keys in memory. However I did my testing by renaming files in source folder, > so that they could be picked up by spark again. Since input records are the > same all further rows should be rejected as duplicates and memory consumption > shouldn't increase but it's not true. Moreover, GC time took more than 30% of > total processing time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) -
[jira] [Updated] (SPARK-23682) Memory issue with Spark structured streaming
[ https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Korzhuev updated SPARK-23682: Attachment: Screen Shot 2018-03-28 at 16.44.20.png > Memory issue with Spark structured streaming > > > Key: SPARK-23682 > URL: https://issues.apache.org/jira/browse/SPARK-23682 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 > Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3 > |spark.blacklist.decommissioning.enabled|true| > |spark.blacklist.decommissioning.timeout|1h| > |spark.cleaner.periodicGC.interval|10min| > |spark.default.parallelism|18| > |spark.dynamicAllocation.enabled|false| > |spark.eventLog.enabled|true| > |spark.executor.cores|3| > |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'| > |spark.executor.id|driver| > |spark.executor.instances|3| > |spark.executor.memory|22G| > |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2| > |spark.hadoop.parquet.enable.summary-metadata|false| > |spark.hadoop.yarn.timeline-service.enabled|false| > |spark.jars| | > |spark.master|yarn| > |spark.memory.fraction|0.9| > |spark.memory.storageFraction|0.3| > |spark.memory.useLegacyMode|false| > |spark.rdd.compress|true| > |spark.resourceManager.cleanupExpiredHost|true| > |spark.scheduler.mode|FIFO| > |spark.serializer|org.apache.spark.serializer.KryoSerializer| > |spark.shuffle.service.enabled|true| > |spark.speculation|false| > |spark.sql.parquet.filterPushdown|true| > |spark.sql.parquet.mergeSchema|false| > |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse| > |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true| > |spark.submit.deployMode|client| > |spark.yarn.am.cores|1| > |spark.yarn.am.memory|2G| > |spark.yarn.am.memoryOverhead|1G| > |spark.yarn.executor.memoryOverhead|3G| >Reporter: Yuriy Bondaruk >Priority: Major > Labels: Memory, memory, memory-leak > Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot > 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Spark > executors GC time.png, image-2018-03-22-14-46-31-960.png > > > It seems like there is an issue with memory in structured streaming. A stream > with aggregation (dropDuplicates()) and data partitioning constantly > increases memory usage and finally executors fails with exit code 137: > {quote}ExecutorLostFailure (executor 2 exited caused by one of the running > tasks) Reason: Container marked as failed: > container_1520214726510_0001_01_03 on host: > ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: > Container killed on request. Exit code is 137 > Container exited with a non-zero exit code 137 > Killed by external signal{quote} > Stream creating looks something like this: > {quote}session > .readStream() > .schema(inputSchema) > .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB) > .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF) > .csv("s3://test-bucket/input") > .as(Encoders.bean(TestRecord.class)) > .flatMap(mf, Encoders.bean(TestRecord.class)) > .dropDuplicates("testId", "testName") > .withColumn("year", > functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), > "")) > .writeStream() > .option("path", "s3://test-bucket/output") > .option("checkpointLocation", "s3://test-bucket/checkpoint") > .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS)) > .partitionBy("year") > .format("parquet") > .outputMode(OutputMode.Append()) > .queryName("test-stream") > .start();{quote} > Analyzing the heap dump I found that most of the memory used by > {{org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider}} > that is referenced from > [StateStore|https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L196] > > On the first glance it looks normal since that is how Spark keeps aggregation > keys in memory. However I did my testing by renaming files in source folder, > so that they could be picked up by spark again. Since input records are the > same all further rows should be rejected as duplicates and memory consumption > shouldn't increase but it's not true. Moreover, GC time took more than 30% of > total processing time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Commented] (SPARK-22865) Publish Official Apache Spark Docker images
[ https://issues.apache.org/jira/browse/SPARK-22865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409609#comment-16409609 ] Andrew Korzhuev commented on SPARK-22865: - Here is an example how this can be done with Travis CI and prebuilt Spark binaries: [https://github.com/andrusha/spark-k8s-docker] Published here: [https://hub.docker.com/r/andrusha/spark-k8s/] > Publish Official Apache Spark Docker images > --- > > Key: SPARK-22865 > URL: https://issues.apache.org/jira/browse/SPARK-22865 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org