from:"Andrew Or \(JIRA\)"

[jira] [Updated] (SPARK-23890) Support DDL for adding nested columns to struct types

2023-12-21 Thread Andrew Otto (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Otto updated SPARK-23890:

Description: 
As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
CHANGE COLUMN commands to Hive.  This restriction was loosened in 
[https://github.com/apache/spark/pull/12714] to allow for those commands if 
they only change the column comment.

Wikimedia has been evolving Parquet backed Hive tables with data originally 
from JSON events by adding newly found columns to the Hive table schema, via a 
Spark job we call 'Refine'.  We do this by recursively merging an input 
DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
then issuing an ALTER TABLE statement to add the columns.  However, because we 
allow for nested data types in the incoming JSON data, we make extensive use of 
struct type fields.  In order to add newly detected fields in a nested data 
type, we must alter the struct column and append the nested struct field.  This 
requires CHANGE COLUMN that alters the column type.  In reality, the 'type' of 
the column is not changing, it just just a new field being added to the struct, 
but to SQL, this looks like a type change.

-We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
can be sent to Hive will block us.  I believe this is fixable by adding an 
exception in 
[command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
 to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
destination type are both struct types, and the destination type only adds new 
fields.-

 

In this [PR|https://github.com/apache/spark/pull/21012], I was told that the 
Spark 3 datasource v2 would support this.

However, it is clear that it does not.  There is an [explicit 
check|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L1441]
 and 
[test|https://github.com/apache/spark/blob/e3f46ed57dc063566cdb9425b4d5e02c65332df1/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala#L583]
 that prevents this from happening.

This an be done via {{{}ALTER TABLE ADD COLUMN nested1.new_field1{}}}, but this 
is not supported for any datasource v1 sources.

 

 

 

 

  was:
As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
CHANGE COLUMN commands to Hive.  This restriction was loosened in 
[https://github.com/apache/spark/pull/12714] to allow for those commands if 
they only change the column comment.

Wikimedia has been evolving Parquet backed Hive tables with data originally 
from JSON events by adding newly found columns to the Hive table schema, via a 
Spark job we call 'Refine'.  We do this by recursively merging an input 
DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
then issuing an ALTER TABLE statement to add the columns.  However, because we 
allow for nested data types in the incoming JSON data, we make extensive use of 
struct type fields.  In order to add newly detected fields in a nested data 
type, we must alter the struct column and append the nested struct field.  This 
requires CHANGE COLUMN that alters the column type.  In reality, the 'type' of 
the column is not changing, it just just a new field being added to the struct, 
but to SQL, this looks like a type change.

-We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
can be sent to Hive will block us.  I believe this is fixable by adding an 
exception in 
[command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
 to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
destination type are both struct types, and the destination type only adds new 
fields.-

 

In this [PR|https://github.com/apache/spark/pull/21012], I was told that the 
Spark 3 datasource v2 would support this.

However, it is clear that it does not.  There is an [explicit 
check|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L1441]
 and 
[test|https://github.com/apache/spark/blob/e3f46ed57dc063566cdb9425b4d5e02c65332df1/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala#L583]
 that prevents this from happening.

 

 


> Support DDL for adding nested columns to struct types
> -
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 3.0.0
>Reporter: Andrew Otto
>Priority:

[jira] [Updated] (SPARK-23890) Support DDL for adding nested columns to struct types

2023-12-21 Thread Andrew Otto (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Otto updated SPARK-23890:

Summary: Support DDL for adding nested columns to struct types  (was: Hive 
ALTER TABLE CHANGE COLUMN for struct type no longer works)

> Support DDL for adding nested columns to struct types
> -
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 3.0.0
>Reporter: Andrew Otto
>Priority: Major
>  Labels: bulk-closed, pull-request-available
> Fix For: 3.0.0
>
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> -We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
> can be sent to Hive will block us.  I believe this is fixable by adding an 
> exception in 
> [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
>  to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
> destination type are both struct types, and the destination type only adds 
> new fields.-
>  
> In this [PR|https://github.com/apache/spark/pull/21012], I was told that the 
> Spark 3 datasource v2 would support this.
> However, it is clear that it does not.  There is an [explicit 
> check|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L1441]
>  and 
> [test|https://github.com/apache/spark/blob/e3f46ed57dc063566cdb9425b4d5e02c65332df1/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala#L583]
>  that prevents this from happening.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2023-12-21 Thread Andrew Otto (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Otto reopened SPARK-23890:
-

This is fixed for DataSource v2 via {{{}alter table add column 
nested.new_field0{}}}, but apparently there are few data sources that use the 
DataSource v2 code path.  Iceberg file works, but Hive, Parquet, ORC, JSON 
still use the DataSource v1 code path to check if this is allowed.

Reopening and retitling more generically to allow nested column addition for 
Parquet, etc.

 

> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 3.0.0
>Reporter: Andrew Otto
>Priority: Major
>  Labels: bulk-closed, pull-request-available
> Fix For: 3.0.0
>
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> -We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
> can be sent to Hive will block us.  I believe this is fixable by adding an 
> exception in 
> [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
>  to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
> destination type are both struct types, and the destination type only adds 
> new fields.-
>  
> In this [PR|https://github.com/apache/spark/pull/21012], I was told that the 
> Spark 3 datasource v2 would support this.
> However, it is clear that it does not.  There is an [explicit 
> check|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L1441]
>  and 
> [test|https://github.com/apache/spark/blob/e3f46ed57dc063566cdb9425b4d5e02c65332df1/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala#L583]
>  that prevents this from happening.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2023-12-15 Thread Andrew Otto (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Otto resolved SPARK-23890.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Ah! This is supported in DataSource v2 after all, except just not via CHANGE 
COLUMN.  Instead, you can add a column to a nested field by addressing it with 
dotted notation:

 
ALTER TABLE otto.test_table03 ADD COLUMN s1.s1_f2_added STRING;

> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 3.0.0
>Reporter: Andrew Otto
>Priority: Major
>  Labels: bulk-closed, pull-request-available
> Fix For: 3.0.0
>
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> -We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
> can be sent to Hive will block us.  I believe this is fixable by adding an 
> exception in 
> [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
>  to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
> destination type are both struct types, and the destination type only adds 
> new fields.-
>  
> In this [PR|https://github.com/apache/spark/pull/21012], I was told that the 
> Spark 3 datasource v2 would support this.
> However, it is clear that it does not.  There is an [explicit 
> check|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L1441]
>  and 
> [test|https://github.com/apache/spark/blob/e3f46ed57dc063566cdb9425b4d5e02c65332df1/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala#L583]
>  that prevents this from happening.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2023-12-14 Thread Andrew Otto (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Otto updated SPARK-23890:

 Shepherd: Max Gekk
Affects Version/s: 3.0.0
  Description: 
As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
CHANGE COLUMN commands to Hive.  This restriction was loosened in 
[https://github.com/apache/spark/pull/12714] to allow for those commands if 
they only change the column comment.

Wikimedia has been evolving Parquet backed Hive tables with data originally 
from JSON events by adding newly found columns to the Hive table schema, via a 
Spark job we call 'Refine'.  We do this by recursively merging an input 
DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
then issuing an ALTER TABLE statement to add the columns.  However, because we 
allow for nested data types in the incoming JSON data, we make extensive use of 
struct type fields.  In order to add newly detected fields in a nested data 
type, we must alter the struct column and append the nested struct field.  This 
requires CHANGE COLUMN that alters the column type.  In reality, the 'type' of 
the column is not changing, it just just a new field being added to the struct, 
but to SQL, this looks like a type change.

-We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
can be sent to Hive will block us.  I believe this is fixable by adding an 
exception in 
[command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
 to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
destination type are both struct types, and the destination type only adds new 
fields.-

 

In this [PR|https://github.com/apache/spark/pull/21012], I was told that the 
Spark 3 datasource v2 would support this.

However, it is clear that it does not.  There is an [explicit 
check|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L1441]
 and 
[test|https://github.com/apache/spark/blob/e3f46ed57dc063566cdb9425b4d5e02c65332df1/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala#L583]
 that prevents this from happening.

 

 

  was:
As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
CHANGE COLUMN commands to Hive.  This restriction was loosened in 
[https://github.com/apache/spark/pull/12714] to allow for those commands if 
they only change the column comment.

Wikimedia has been evolving Parquet backed Hive tables with data originally 
from JSON events by adding newly found columns to the Hive table schema, via a 
Spark job we call 'Refine'.  We do this by recursively merging an input 
DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
then issuing an ALTER TABLE statement to add the columns.  However, because we 
allow for nested data types in the incoming JSON data, we make extensive use of 
struct type fields.  In order to add newly detected fields in a nested data 
type, we must alter the struct column and append the nested struct field.  This 
requires CHANGE COLUMN that alters the column type.  In reality, the 'type' of 
the column is not changing, it just just a new field being added to the struct, 
but to SQL, this looks like a type change.

We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
can be sent to Hive will block us.  I believe this is fixable by adding an 
exception in 
[command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
 to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
destination type are both struct types, and the destination type only adds new 
fields.

 

 


> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 3.0.0
>Reporter: Andrew Otto
>Priority: Major
>  Labels: bulk-closed, pull-request-available
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a

[jira] [Reopened] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2023-12-14 Thread Andrew Otto (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Otto reopened SPARK-23890:
-

This was supposed to have been fixed in Spark 3 datasource v2, but the issue 
persists.

> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 3.0.0
>Reporter: Andrew Otto
>Priority: Major
>  Labels: bulk-closed, pull-request-available
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> -We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
> can be sent to Hive will block us.  I believe this is fixable by adding an 
> exception in 
> [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
>  to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
> destination type are both struct types, and the destination type only adds 
> new fields.-
>  
> In this [PR|https://github.com/apache/spark/pull/21012], I was told that the 
> Spark 3 datasource v2 would support this.
> However, it is clear that it does not.  There is an [explicit 
> check|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L1441]
>  and 
> [test|https://github.com/apache/spark/blob/e3f46ed57dc063566cdb9425b4d5e02c65332df1/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala#L583]
>  that prevents this from happening.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-43273) Spark can't read parquet files with a newer LZ4_RAW compression

2023-04-25 Thread Andrew Grigorev (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716135#comment-17716135
 ] 

Andrew Grigorev edited comment on SPARK-43273 at 4/25/23 12:56 PM:
---

Just as a icing on the cake - Clickhouse accidently started to use LZ4_RAW by 
default for their Parquet output format :).

https://github.com/ClickHouse/ClickHouse/issues/49141


was (Author: ei-grad):
Just as a icing on the cake - Clickhouse accidently started to use LZ4_RAW by 
default for their Parquet output format :).

> Spark can't read parquet files with a newer LZ4_RAW compression
> ---
>
> Key: SPARK-43273
> URL: https://issues.apache.org/jira/browse/SPARK-43273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.4, 3.3.3, 3.3.2, 3.4.0
>Reporter: Andrew Grigorev
>Priority: Trivial
>
> hadoop-parquet version should be updated to 1.3.0 (together with other 
> parquet-mr libs)
> {code:java}
> java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job 
> aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent 
> failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): 
> java.lang.IllegalArgumentException: No enum constant 
> org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW
>     at java.base/java.lang.Enum.valueOf(Enum.java:273)
>     at 
> org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26)
>     at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636)
> ... {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43273) Spark can't read parquet files with a newer LZ4_RAW compression

2023-04-25 Thread Andrew Grigorev (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Grigorev updated SPARK-43273:

Description: 
hadoop-parquet version should be updated to 1.3.0 (together with other 
parquet-mr libs)


{code:java}
java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent 
failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): 
java.lang.IllegalArgumentException: No enum constant 
org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW
    at java.base/java.lang.Enum.valueOf(Enum.java:273)
    at 
org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26)
    at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636)
... {code}

  was:
hadoop-parquet version should be updated to 1.3.0

 
{code:java}
java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent 
failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): 
java.lang.IllegalArgumentException: No enum constant 
org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW
    at java.base/java.lang.Enum.valueOf(Enum.java:273)
    at 
org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26)
    at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636)
... {code}


> Spark can't read parquet files with a newer LZ4_RAW compression
> ---
>
> Key: SPARK-43273
> URL: https://issues.apache.org/jira/browse/SPARK-43273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.4, 3.3.3, 3.3.2, 3.4.0
>Reporter: Andrew Grigorev
>Priority: Trivial
>
> hadoop-parquet version should be updated to 1.3.0 (together with other 
> parquet-mr libs)
> {code:java}
> java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job 
> aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent 
> failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): 
> java.lang.IllegalArgumentException: No enum constant 
> org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW
>     at java.base/java.lang.Enum.valueOf(Enum.java:273)
>     at 
> org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26)
>     at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636)
> ... {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43273) Spark can't read parquet files with a newer LZ4_RAW compression

2023-04-25 Thread Andrew Grigorev (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716135#comment-17716135
 ] 

Andrew Grigorev commented on SPARK-43273:
-

Just as a icing on the cake - Clickhouse accidently started to use LZ4_RAW by 
default for their Parquet output format :).

> Spark can't read parquet files with a newer LZ4_RAW compression
> ---
>
> Key: SPARK-43273
> URL: https://issues.apache.org/jira/browse/SPARK-43273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.4, 3.3.3, 3.3.2, 3.4.0
>Reporter: Andrew Grigorev
>Priority: Trivial
>
> hadoop-parquet version should be updated to 1.3.0
>  
> {code:java}
> java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job 
> aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent 
> failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): 
> java.lang.IllegalArgumentException: No enum constant 
> org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW
>     at java.base/java.lang.Enum.valueOf(Enum.java:273)
>     at 
> org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26)
>     at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636)
> ... {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43273) Spark can't read parquet files with a newer LZ4_RAW compression

2023-04-25 Thread Andrew Grigorev (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Grigorev updated SPARK-43273:

Description: 
hadoop-parquet version should be updated to 1.3.0

 
{code:java}
java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent 
failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): 
java.lang.IllegalArgumentException: No enum constant 
org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW
    at java.base/java.lang.Enum.valueOf(Enum.java:273)
    at 
org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26)
    at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636)
... {code}

  was:hadoop-parquet version should be updated to 1.3.0


> Spark can't read parquet files with a newer LZ4_RAW compression
> ---
>
> Key: SPARK-43273
> URL: https://issues.apache.org/jira/browse/SPARK-43273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.4, 3.3.3, 3.3.2, 3.4.0
>Reporter: Andrew Grigorev
>Priority: Trivial
>
> hadoop-parquet version should be updated to 1.3.0
>  
> {code:java}
> java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job 
> aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent 
> failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): 
> java.lang.IllegalArgumentException: No enum constant 
> org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW
>     at java.base/java.lang.Enum.valueOf(Enum.java:273)
>     at 
> org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26)
>     at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636)
> ... {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43273) Spark can't read parquet files with a newer LZ4_RAW compression

2023-04-25 Thread Andrew Grigorev (Jira)

Andrew Grigorev created SPARK-43273:
---

 Summary: Spark can't read parquet files with a newer LZ4_RAW 
compression
 Key: SPARK-43273
 URL: https://issues.apache.org/jira/browse/SPARK-43273
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0, 3.3.2, 3.2.4, 3.3.3
Reporter: Andrew Grigorev


hadoop-parquet version should be updated to 1.3.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43189) No overload variant of "pandas_udf" matches argument type "str"

2023-04-19 Thread Andrew Grigorev (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Grigorev updated SPARK-43189:

Description: 
h2. Issue

Users who have mypy enabled in their IDE or CI environment face very verbose 
error messages when using the {{pandas_udf}} function in PySpark. The current 
typing of the {{pandas_udf}} function seems to be causing these issues. As a 
workaround, the official documentation provides examples that use {{{}# type: 
ignore[call-overload]{}}}, but this is not an ideal solution.
h2. Example

Here's a code snippet taken from 
[docs|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs]
 that triggers the error when mypy is enabled:
{code:python}
from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf("col1 string, col2 long")
def func(s1: pd.Series, s2: pd.Series, s3: pd.DataFrame) -> pd.DataFrame:
s3['col2'] = s1 + s2.str.len()
return s3 {code}
Running mypy on this code results in a long and verbose error message, which 
makes it difficult for users to understand the actual issue and how to resolve 
it.
h2. Proposed Solution

We kindly request the PySpark development team to review and improve the typing 
for the {{pandas_udf}} function to prevent these verbose error messages from 
appearing. This improvement will help users who have mypy enabled in their 
development environments to have a better experience when using PySpark.

Furthermore, we suggest updating the official documentation to provide better 
examples that do not rely on {{# type: ignore[call-overload]}} to suppress 
these errors.
h2. Impact

By addressing this issue, users of PySpark with mypy enabled in their 
development environment will be able to write and verify their code more 
efficiently, without being overwhelmed by verbose error messages. This will 
lead to a more enjoyable and productive experience when working with PySpark 
and pandas UDFs.

  was:
h2. Issue

Users who have mypy enabled in their IDE or CI environment face very verbose 
error messages when using the {{pandas_udf}} function in PySpark. The current 
typing of the {{pandas_udf}} function seems to be causing these issues. As a 
workaround, the official documentation provides examples that use {{{}# type: 
ignore[call-overload]{}}}, but this is not an ideal solution.
h2. Example

Here's a code snippet that triggers the error when mypy is enabled:
{code:python}
from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf("string")
def f(s: pd.Series) -> pd.Series:
return pd.Series(["a"]*len(s), index=s.index)
{code}
Running mypy on this code results in a long and verbose error message, which 
makes it difficult for users to understand the actual issue and how to resolve 
it.
h2. Proposed Solution

We kindly request the PySpark development team to review and improve the typing 
for the {{pandas_udf}} function to prevent these verbose error messages from 
appearing. This improvement will help users who have mypy enabled in their 
development environments to have a better experience when using PySpark.

Furthermore, we suggest updating the official documentation to provide better 
examples that do not rely on {{# type: ignore[call-overload]}} to suppress 
these errors.
h2. Impact

By addressing this issue, users of PySpark with mypy enabled in their 
development environment will be able to write and verify their code more 
efficiently, without being overwhelmed by verbose error messages. This will 
lead to a more enjoyable and productive experience when working with PySpark 
and pandas UDFs.


> No overload variant of "pandas_udf" matches argument type "str"
> ---
>
> Key: SPARK-43189
> URL: https://issues.apache.org/jira/browse/SPARK-43189
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.4, 3.3.2, 3.4.0
>Reporter: Andrew Grigorev
>Priority: Major
>
> h2. Issue
> Users who have mypy enabled in their IDE or CI environment face very verbose 
> error messages when using the {{pandas_udf}} function in PySpark. The current 
> typing of the {{pandas_udf}} function seems to be causing these issues. As a 
> workaround, the official documentation provides examples that use {{{}# type: 
> ignore[call-overload]{}}}, but this is not an ideal solution.
> h2. Example
> Here's a code snippet taken from 
> [docs|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs]
>  that triggers the error when mypy is enabled:
> {code:python}
> from pyspark.sql.functions import pandas_udf
> import pandas as pd
> @pandas_udf("col1 string, col2 long")
> def func(s1: pd.Series, s2: pd.Series, s3: pd.DataFrame) ->

[jira] [Created] (SPARK-43189) No overload variant of "pandas_udf" matches argument type "str"

2023-04-19 Thread Andrew Grigorev (Jira)

Andrew Grigorev created SPARK-43189:
---

 Summary: No overload variant of "pandas_udf" matches argument type 
"str"
 Key: SPARK-43189
 URL: https://issues.apache.org/jira/browse/SPARK-43189
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0, 3.3.2, 3.2.4
Reporter: Andrew Grigorev


h2. Issue

Users who have mypy enabled in their IDE or CI environment face very verbose 
error messages when using the {{pandas_udf}} function in PySpark. The current 
typing of the {{pandas_udf}} function seems to be causing these issues. As a 
workaround, the official documentation provides examples that use {{{}# type: 
ignore[call-overload]{}}}, but this is not an ideal solution.
h2. Example

Here's a code snippet that triggers the error when mypy is enabled:
{code:python}
from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf("string")
def f(s: pd.Series) -> pd.Series:
return pd.Series(["a"]*len(s), index=s.index)
{code}
Running mypy on this code results in a long and verbose error message, which 
makes it difficult for users to understand the actual issue and how to resolve 
it.
h2. Proposed Solution

We kindly request the PySpark development team to review and improve the typing 
for the {{pandas_udf}} function to prevent these verbose error messages from 
appearing. This improvement will help users who have mypy enabled in their 
development environments to have a better experience when using PySpark.

Furthermore, we suggest updating the official documentation to provide better 
examples that do not rely on {{# type: ignore[call-overload]}} to suppress 
these errors.
h2. Impact

By addressing this issue, users of PySpark with mypy enabled in their 
development environment will be able to write and verify their code more 
efficiently, without being overwhelmed by verbose error messages. This will 
lead to a more enjoyable and productive experience when working with PySpark 
and pandas UDFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39897) StackOverflowError in TaskMemoryManager

2022-07-27 Thread Andrew Ray (Jira)

Andrew Ray created SPARK-39897:
--

 Summary: StackOverflowError in TaskMemoryManager
 Key: SPARK-39897
 URL: https://issues.apache.org/jira/browse/SPARK-39897
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.7
Reporter: Andrew Ray


I have observed the following error that looks to stem from 
TaskMemoryManager.allocatePage making a recursive call to itself when a page 
can not be allocated. I'm observing this in Spark 2.4 but since the relevant 
code is still the same in master this is likely still a potential point of 
failure in current versions. Prioritizing this as minor as this looks to be a 
very uncommon outcome as I can not find any other reports of a similar nature.
{code:java}
Py4JJavaError: An error occurred while calling o625.saveAsTable.
: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
at 
org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:503)
at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217)
at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:177)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at 
org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474)
at 
org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453)
at 
org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.StackOverflowError
at 
java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1012)
at 
java.util.concurrent.ConcurrentHashMap.putIfAbsent(ConcurrentHashMap.java:1535)
at java.lang.ClassLoader.getClassLoadingLock(ClassLoader.java:457)
at java.lang.ClassLoader.loadClass(ClassLoader.java:398)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.util.ResourceBundle$RBClassLoader.loadClass(ResourceBundle.java:512)
at java.util.ResourceBundle$Control.newBundle(ResourceBundle.java:2657)
at java.util.ResourceBundle.loadBundle(ResourceBundle.java:1518)
at java.util.ResourceBundle.findBundle(ResourceBundle.java:1482)
at java.util.ResourceBundle.findBundle(ResourceBundle.java:1436)
at java.util.ResourceBundle.findBundle(ResourceBundle.java:1436)
at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1370)
at java.util.ResourceBundle.getBundle(ResourceBundle.java:899)
at sun.util.resources.LocaleData$1.run(LocaleData.java:167)

[jira] [Created] (SPARK-39883) Add DataFrame function parity check

2022-07-26 Thread Andrew Ray (Jira)

Andrew Ray created SPARK-39883:
--

 Summary: Add DataFrame function parity check
 Key: SPARK-39883
 URL: https://issues.apache.org/jira/browse/SPARK-39883
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Andrew Ray


Add a test that compares the available list of DataFrame functions in 
org.apache.spark.sql.functions with the SQL function registry.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39734) Add call_udf to pyspark.sql.functions

2022-07-10 Thread Andrew Ray (Jira)

Andrew Ray created SPARK-39734:
--

 Summary: Add call_udf to pyspark.sql.functions
 Key: SPARK-39734
 URL: https://issues.apache.org/jira/browse/SPARK-39734
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Andrew Ray


Add the call_udf function to PySpark for parity with the Scala/Java function 
org.apache.spark.sql.functions#call_udf



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39733) Add map_contains_key to pyspark.sql.functions

2022-07-10 Thread Andrew Ray (Jira)

Andrew Ray created SPARK-39733:
--

 Summary: Add map_contains_key to pyspark.sql.functions
 Key: SPARK-39733
 URL: https://issues.apache.org/jira/browse/SPARK-39733
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Andrew Ray


SPARK-37584 added the function map_contains_key to SQL and Scala/Java 
functions. This JIRA is to track its addition to the PySpark function set for 
parity.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39728) Test for parity of SQL functions between Python and JVM DataFrame API's

2022-07-09 Thread Andrew Ray (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ray updated SPARK-39728:
---
Priority: Minor  (was: Major)

> Test for parity of SQL functions between Python and JVM DataFrame API's
> ---
>
> Key: SPARK-39728
> URL: https://issues.apache.org/jira/browse/SPARK-39728
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Andrew Ray
>Priority: Minor
>
> Add a unit test that compares the available list of Python DataFrame 
> functions in pyspark.sql.functions with those available in the Scala/Java 
> DataFrame API in org.apache.spark.sql.functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39728) Test for parity of SQL functions between Python and JVM DataFrame API's

2022-07-09 Thread Andrew Ray (Jira)

Andrew Ray created SPARK-39728:
--

 Summary: Test for parity of SQL functions between Python and JVM 
DataFrame API's
 Key: SPARK-39728
 URL: https://issues.apache.org/jira/browse/SPARK-39728
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Tests
Affects Versions: 3.3.0
Reporter: Andrew Ray


Add a unit test that compares the available list of Python DataFrame functions 
in pyspark.sql.functions with those available in the Scala/Java DataFrame API 
in org.apache.spark.sql.functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38618) Implement JDBCDataSourceV2

2022-03-21 Thread Andrew Murphy (Jira)

Andrew Murphy created SPARK-38618:
-

 Summary: Implement JDBCDataSourceV2
 Key: SPARK-38618
 URL: https://issues.apache.org/jira/browse/SPARK-38618
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.1
Reporter: Andrew Murphy






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38288) Aggregate push down doesnt work using Spark SQL jdbc datasource with postgresql

2022-02-27 Thread Andrew Murphy (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498668#comment-17498668
 ] 

Andrew Murphy commented on SPARK-38288:
---

Hi [~llozano] I believe this is because JDBC DataSource V2 has not been fully 
implemented. Even though [29695|https://github.com/apache/spark/pull/29695] has 
merged, reading from a JDBC connection still defaults to JDBC DataSource V1.

> Aggregate push down doesnt work using Spark SQL jdbc datasource with 
> postgresql
> ---
>
> Key: SPARK-38288
> URL: https://issues.apache.org/jira/browse/SPARK-38288
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Luis Lozano Coira
>Priority: Major
>  Labels: DataSource, Spark-SQL
>
> I am establishing a connection with postgresql using the Spark SQL jdbc 
> datasource. I have started the spark shell including the postgres driver and 
> I can connect and execute queries without problems. I am using this statement:
> {code:java}
> val df = spark.read.format("jdbc").option("url", 
> "jdbc:postgresql://host:port/").option("driver", 
> "org.postgresql.Driver").option("dbtable", "test").option("user", 
> "postgres").option("password", 
> "***").option("pushDownAggregate",true).load()
> {code}
> I am adding the pushDownAggregate option because I would like the 
> aggregations are delegated to the source. But for some reason this is not 
> happening.
> Reviewing this pull request, it seems that this feature should be merged into 
> 3.2. [https://github.com/apache/spark/pull/29695]
> I am making the aggregations considering the mentioned limitations. An 
> example case where I don't see pushdown being done would be this one:
> {code:java}
> df.groupBy("name").max("age").show()
> {code}
> The results of the queryExecution are shown below:
> {code:java}
> scala> df.groupBy("name").max("age").queryExecution.executedPlan
> res19: org.apache.spark.sql.execution.SparkPlan =
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[name#274], functions=[max(age#246)], output=[name#274, 
> max(age)#544])
>+- Exchange hashpartitioning(name#274, 200), ENSURE_REQUIREMENTS, [id=#205]
>   +- HashAggregate(keys=[name#274], functions=[partial_max(age#246)], 
> output=[name#274, max#548])
>  +- Scan JDBCRelation(test) [numPartitions=1] [age#246,name#274] 
> PushedAggregates: [], PushedFilters: [], PushedGroupby: [], ReadSchema: 
> struct
> scala> dfp.groupBy("name").max("age").queryExecution.toString
> res20: String =
> "== Parsed Logical Plan ==
> Aggregate [name#274], [name#274, max(age#246) AS max(age)#581]
> +- Relation [age#246] JDBCRelation(test) [numPartitions=1]
> == Analyzed Logical Plan ==
> name: string, max(age): int
> Aggregate [name#274], [name#274, max(age#246) AS max(age)#581]
> +- Relation [age#24...
> {code}
> What could be the problem? Should pushDownAggregate work in this case?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38296) Error Classes AnalysisException is not propagated in FunctionRegistry

2022-02-22 Thread Andrew Murphy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Murphy updated SPARK-38296:
--
Description: While trying to migrate invalidLiteralForWindowDurationError 
to use Error Classes (SPARK-38110), I realized that the error class is not 
propagating correctly. This is because the AnalysisException is caught and 
rethrown at 
[FunctionRegistry:154|https://github.com/apache/spark/blob/43e93b581ea5f7a1ba6cf943e6624f6847ebc3a8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L154]
 without a case for Error Classes. I will attach a fix for this specific error 
here to avoid bundling too many changes into one PR for SPARK-38110. I 
anticipate that similar changes will need to be made where AnalysisException is 
rethrown as the migration continues.  (was: While trying to migrate 
invalidLiteralForWindowDurationError to use Error Classes (SPARK-38110), I 
realized that the error class is not propagating correctly. This is because the 
AnalysisException is caught and rethrown at 
[FunctionRegistry:154|https://github.com/apache/spark/blob/43e93b581ea5f7a1ba6cf943e6624f6847ebc3a8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L154]
 without a case for Error Classes. I will attach a fix for this specific error 
here to avoid bundling too many changes into one PR. I anticipate that similar 
changes will need to be made where AnalysisException is rethrown as the 
migration continues.)

> Error Classes AnalysisException is not propagated in FunctionRegistry
> -
>
> Key: SPARK-38296
> URL: https://issues.apache.org/jira/browse/SPARK-38296
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Andrew Murphy
>Priority: Major
>
> While trying to migrate invalidLiteralForWindowDurationError to use Error 
> Classes (SPARK-38110), I realized that the error class is not propagating 
> correctly. This is because the AnalysisException is caught and rethrown at 
> [FunctionRegistry:154|https://github.com/apache/spark/blob/43e93b581ea5f7a1ba6cf943e6624f6847ebc3a8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L154]
>  without a case for Error Classes. I will attach a fix for this specific 
> error here to avoid bundling too many changes into one PR for SPARK-38110. I 
> anticipate that similar changes will need to be made where AnalysisException 
> is rethrown as the migration continues.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38296) Error Classes AnalysisException is not propagated in FunctionRegistry

2022-02-22 Thread Andrew Murphy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Murphy updated SPARK-38296:
--
 External issue ID:   (was: SPARK-38110)
External issue URL:   (was: 
https://issues.apache.org/jira/browse/SPARK-38110)

> Error Classes AnalysisException is not propagated in FunctionRegistry
> -
>
> Key: SPARK-38296
> URL: https://issues.apache.org/jira/browse/SPARK-38296
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Andrew Murphy
>Priority: Major
>
> While trying to migrate invalidLiteralForWindowDurationError to use Error 
> Classes (SPARK-38110), I realized that the error class is not propagating 
> correctly. This is because the AnalysisException is caught and rethrown at 
> [FunctionRegistry:154|https://github.com/apache/spark/blob/43e93b581ea5f7a1ba6cf943e6624f6847ebc3a8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L154]
>  without a case for Error Classes. I will attach a fix for this specific 
> error here to avoid bundling too many changes into one PR. I anticipate that 
> similar changes will need to be made where AnalysisException is rethrown as 
> the migration continues.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38296) Error Classes AnalysisException is not propagated in FunctionRegistry

2022-02-22 Thread Andrew Murphy (Jira)

Andrew Murphy created SPARK-38296:
-

 Summary: Error Classes AnalysisException is not propagated in 
FunctionRegistry
 Key: SPARK-38296
 URL: https://issues.apache.org/jira/browse/SPARK-38296
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Andrew Murphy


While trying to migrate invalidLiteralForWindowDurationError to use Error 
Classes (SPARK-38110), I realized that the error class is not propagating 
correctly. This is because the AnalysisException is caught and rethrown at 
[FunctionRegistry:154|https://github.com/apache/spark/blob/43e93b581ea5f7a1ba6cf943e6624f6847ebc3a8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L154]
 without a case for Error Classes. I will attach a fix for this specific error 
here to avoid bundling too many changes into one PR. I anticipate that similar 
changes will need to be made where AnalysisException is rethrown as the 
migration continues.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37954) old columns should not be available after select or drop

2022-02-10 Thread Andrew Murphy (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17490630#comment-17490630
 ] 

Andrew Murphy commented on SPARK-37954:
---

 Hi all,

Super excited to report on this one, this is my first time attempting to 
contribute to Spark so bear with me!

So it looks like df.drop("id") should not raise an error per the 
[documentation|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.drop.html]
 "This is a no-op if schema doesn’t contain the given column name(s)." So this 
is intended behavior.

I was sure Filter was a bug though. Normally, filtering on a nonexistent column 
_does_ throw an AnalysisException, but in a very specific case, Catalyst 
actually propagates the reference to be removed.

*Code*

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql.functions import col as col
spark = SparkSession.builder.appName('available_columns').getOrCreate()
df = spark.createDataFrame([{"oldcol": 1}, {"oldcol": 2}])
df = df.withColumnRenamed('oldcol', 'newcol')
df = df.filter(col("oldcol")!=2)
df.explain("extended") {code}
*Output*

 

 
{code:java}
+--+
|newcol|
+--+
| 1|
+--+

== Parsed Logical Plan ==
'Filter NOT ('oldcol = 2)
+- Project [oldcol#6L AS newcol#8L]
   +- LogicalRDD [oldcol#6L], false

== Analyzed Logical Plan ==
newcol: bigint
Project [newcol#8L]
+- Filter NOT (oldcol#6L = cast(2 as bigint))
   +- Project [oldcol#6L AS newcol#8L, oldcol#6L]
  +- LogicalRDD [oldcol#6L], false

== Optimized Logical Plan ==
Project [oldcol#6L AS newcol#8L]
+- Filter (isnotnull(oldcol#6L) AND NOT (oldcol#6L = 2))
   +- LogicalRDD [oldcol#6L], false

== Physical Plan ==
*(1) Project [oldcol#6L AS newcol#8L]
+- *(1) Filter (isnotnull(oldcol#6L) AND NOT (oldcol#6L = 2))
   +- *(1) Scan ExistingRDD[oldcol#6L] {code}
As you can see, in the Analysis step, Catalyst propagates oldcol#6L to the 
Project operator for seemingly no reason. Well, it turns out the reason is 
actually from SPARK-24781 ([PR)|https://github.com/apache/spark/pull/21745], 
where Analysis was failing on this construction.
{code:scala}
val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id")
df.select(df("name")).filter(df("id") === 0).show()
{code}
{noformat}
org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#6 missing from 
name#5 in operator !Filter (id#6 = 0).;;
!Filter (id#6 = 0)
   +- AnalysisBarrier
  +- Project [name#5]
 +- Project [_1#2 AS name#5, _2#3 AS id#6]
+- LocalRelation [_1#2, _2#3]{noformat}
So it looks like this PR was created to propagate references in the very 
specific case of a Filter after Project when we filter and select in the same 
step, but unfortunately we don't differentiate between
{code:java}
df.select(df("name")).filter(df("id") === 0).show()  {code}
and 
{code:java}
df = df.select(df("name"))
df = df.filter(df("id") === 0).show()  {code}
[~viirya] I see you were the one who fixed this problem in the first place. Any 
thoughts about how this could be solved? Unfortunately, creating a fix for this 
one is beyond my capabilities right now but I'm trying to learn!

 

> old columns should not be available after select or drop
> 
>
> Key: SPARK-37954
> URL: https://issues.apache.org/jira/browse/SPARK-37954
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.1
>Reporter: Jean Bon
>Priority: Major
>
>  
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import col as col
> spark = SparkSession.builder.appName('available_columns').getOrCreate()
> df = spark.range(5).select((col("id")+10).alias("id2"))
> assert df.columns==["id2"] #OK
> try:
>     df.select("id")
>     error_raise = False
> except:
>     error_raise = True
> assert error_raise #OK
> df = df.drop("id") #should raise an error
> df.filter(col("id")!=2).count() #returns 4, should raise an error
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37295) illegal reflective access operation has occurred; Please consider reporting this to the maintainers

2021-11-11 Thread Andrew Davidson (Jira)

Andrew Davidson created SPARK-37295:
---

 Summary: illegal reflective access operation has occurred; Please 
consider reporting this to the maintainers
 Key: SPARK-37295
 URL: https://issues.apache.org/jira/browse/SPARK-37295
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.2
 Environment: MacBook pro running mac OS 11.6

spark-3.1.2-bin-hadoop3.2

it is not clear to me how spark finds java. I believe I also have java 8 
installed somewhere

```

$ which java
~/anaconda3/envs/extraCellularRNA/bin/java

$ java -version
openjdk version "11.0.6" 2020-01-14
OpenJDK Runtime Environment (build 11.0.6+8-b765.1)
OpenJDK 64-Bit Server VM (build 11.0.6+8-b765.1, mixed mode)

```

 
Reporter: Andrew Davidson


```

   spark = SparkSession\
                .builder\
                .appName("TestEstimatedScalingFactors")\
                .getOrCreate()

```

generates the following warning

```

WARNING: An illegal reflective access operation has occurred

WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
(file:/Users/xxx/googleUCSC/kimLab/extraCellularRNA/terra/deseq/spark-3.1.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.2.jar)
 to constructor java.nio.DirectByteBuffer(long,int)

WARNING: Please consider reporting this to the maintainers of 
org.apache.spark.unsafe.Platform

WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations

WARNING: All illegal access operations will be denied in a future release

21/11/11 12:51:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable

```

I am using pyspark spark-3.1.2-bin-hadoop3.2 on a MacBook pro running mac OS 
11.6

 

My small unit test see to work okay how ever It fails when I try and run on 
3.2.0

 

I

 

Any idea how I track down this issue? Kind regards

 

Andy

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36649) Support Trigger.AvailableNow on Kafka data source

2021-09-07 Thread Andrew Olson (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411256#comment-17411256
 ] 

Andrew Olson commented on SPARK-36649:
--

[~kabhwan] I am not sure I fully understand the details of this.

I only have experience with using a Kafka source in batch mode 
([https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries])
 and minimal familiarity with most of the Spark code.

> Support Trigger.AvailableNow on Kafka data source
> -
>
> Key: SPARK-36649
> URL: https://issues.apache.org/jira/browse/SPARK-36649
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> SPARK-36533 introduces a new trigger Trigger.AvailableNow, but only 
> introduces the new functionality to the file stream source. Given that Kafka 
> data source is the one of major data sources being used in streaming query, 
> we should make Kafka data source support this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36576) Improve range split calculation for Kafka Source minPartitions option

2021-08-24 Thread Andrew Olson (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Olson updated SPARK-36576:
-
Description: 
While the 
[documentation|https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html]
 does contain a clear disclaimer,
{quote}Please note that this configuration is like a {{hint}}: the number of 
Spark tasks will be *approximately* {{minPartitions}}. It can be less or more 
depending on rounding errors or Kafka partitions that didn't receive any new 
data.
{quote}
there are cases where the calculated Kafka partition range splits can differ 
greatly from expectations. For evenly distributed data and most 
{{minPartitions}} values this would not be a major or commonly encountered 
concern. However when the distribution of data across partitions is very 
heavily skewed, somewhat surprising range split calculations can result.

For example, given the following input data:
 * 1 partition containing 10,000 messages
 * 1,000 partitions each containing 1 message

Spark processing code loading from this collection of 1,001 partitions may 
decide that it would like each task to read no more than 1,000 messages. 
Consequently, it could specify a {{minPartitions}} value of 1,010 - expecting 
the single large partition to be split into 10 equal chunks, along with the 
1,000 small partitions each having their own task. That is far from what 
actually occurs. The {{KafkaOffsetRangeCalculator}} algorithm ends up splitting 
the large partition into 918 chunks of 10 or 11 messages, two orders of 
magnitude from the desired maximum message count per task and nearly double the 
number of Spark tasks hinted in the configuration.

Proposing that the {{KafkaOffsetRangeCalculator}}'s range calculation logic be 
modified to exclude small (i.e. un-split) partitions from the overall 
proportional distribution math, in order to more reasonably divide the large 
partitions when they are accompanied by many small partitions, and to provide 
optimal behavior for cases where a {{minPartitions}} value is deliberately 
computed based on the volume of data being read.

  was:
While the 
[documentation|https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html]
 does contain a clear disclaimer,
{quote}Please note that this configuration is like a {{hint}}: the number of 
Spark tasks will be *approximately* {{minPartitions}}. It can be less or more 
depending on rounding errors or Kafka partitions that didn't receive any new 
data.
{quote}
there are cases where the calculated Kafka partition range splits can differ 
greatly from expectations. For evenly distributed data and most 
{{minPartitions}} values this would not be a major or commonly encountered 
concern. However when the distribution of data across partitions is very 
heavily skewed, somewhat surprising range split calculations can result.

For example, given the following input data:
 * 1 partition containing 10,000 messages
 * 1,000 partitions each containing 1 message

Spark processing code loading from this collection of 1,001 partitions may 
decide that it would like each task to read no more than 1,000 messages. 
Consequently, it could specify a {{minPartitions}} value of 1,010 - expecting 
the single large partition to be split into 10 equal chunks, along with the 
1,000 small partitions each having their own task. That is far from what 
actually occurs. The {{KafkaOffsetRangeCalculator}} algorithm ends up splitting 
the large partition into 918 chunks of 10 or 11 messages, two orders of 
magnitude from the desired maximum message count per task and nearly double the 
number of Spark tasks hinted in the configuration.

Proposing that range the calculation logic be modified to exclude small (i.e. 
un-split) partitions from the overall proportional distribution math, in order 
to more reasonably divide the large partitions when they are accompanied by 
many small partitions, and to provide optimal behavior for cases where a 
{{minPartitions}} value is deliberately computed based on the volume of data 
being read.


> Improve range split calculation for Kafka Source minPartitions option
> -
>
> Key: SPARK-36576
> URL: https://issues.apache.org/jira/browse/SPARK-36576
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.2
>Reporter: Andrew Olson
>Priority: Minor
>
> While the 
> [documentation|https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html]
>  does contain a clear disclaimer,
> {quote}Please note that this configuration is like a {{hint}}: the number of 
> Spark tasks will be *approximately* {{minPartitions}}. It can be less or more 
> depending on rounding errors or Kafka partitions

[jira] [Created] (SPARK-36576) Improve range split calculation for Kafka Source minPartitions option

2021-08-24 Thread Andrew Olson (Jira)

Andrew Olson created SPARK-36576:


 Summary: Improve range split calculation for Kafka Source 
minPartitions option
 Key: SPARK-36576
 URL: https://issues.apache.org/jira/browse/SPARK-36576
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.1.2
Reporter: Andrew Olson


While the 
[documentation|https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html]
 does contain a clear disclaimer,
{quote}Please note that this configuration is like a {{hint}}: the number of 
Spark tasks will be *approximately* {{minPartitions}}. It can be less or more 
depending on rounding errors or Kafka partitions that didn't receive any new 
data.
{quote}
there are cases where the calculated Kafka partition range splits can differ 
greatly from expectations. For evenly distributed data and most 
{{minPartitions}} values this would not be a major or commonly encountered 
concern. However when the distribution of data across partitions is very 
heavily skewed, somewhat surprising range split calculations can result.

For example, given the following input data:
 * 1 partition containing 10,000 messages
 * 1,000 partitions each containing 1 message

Spark processing code loading from this collection of 1,001 partitions may 
decide that it would like each task to read no more than 1,000 messages. 
Consequently, it could specify a {{minPartitions}} value of 1,010 - expecting 
the single large partition to be split into 10 equal chunks, along with the 
1,000 small partitions each having their own task. That is far from what 
actually occurs. The {{KafkaOffsetRangeCalculator}} algorithm ends up splitting 
the large partition into 918 chunks of 10 or 11 messages, two orders of 
magnitude from the desired maximum message count per task and nearly double the 
number of Spark tasks hinted in the configuration.

Proposing that range the calculation logic be modified to exclude small (i.e. 
un-split) partitions from the overall proportional distribution math, in order 
to more reasonably divide the large partitions when they are accompanied by 
many small partitions, and to provide optimal behavior for cases where a 
{{minPartitions}} value is deliberately computed based on the volume of data 
being read.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29223) Kafka source: offset by timestamp - allow specifying timestamp for "all partitions"

2021-08-23 Thread Andrew Grigorev (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403341#comment-17403341
 ] 

Andrew Grigorev commented on SPARK-29223:
-

Just an idea - couldn't this be implemented in "timestamp field filter 
pushdown"-like way?

> Kafka source: offset by timestamp - allow specifying timestamp for "all 
> partitions"
> ---
>
> Key: SPARK-29223
> URL: https://issues.apache.org/jira/browse/SPARK-29223
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 3.2.0
>
>
> This issue is a follow-up of SPARK-26848.
> In SPARK-26848, we decided to open possibility to let end users set 
> individual timestamp per partition. But in many cases, specifying timestamp 
> represents the intention that we would want to go back to specific timestamp 
> and reprocess records, which should be applied to all topics and partitions.
> According to the format of 
> `startingOffsetsByTimestamp`/`endingOffsetsByTimestamp`, while it's not 
> intuitive to provide an option to set a global timestamp across topic, it's 
> still intuitive to provide an option to set a global timestamp across 
> partitions in a topic.
> This issue tracks the efforts to deal with this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35919) Support pathlib.PurePath-like objects in DataFrameReader / DataFrameWriter

2021-06-28 Thread Andrew Grigorev (Jira)

Andrew Grigorev created SPARK-35919:
---

 Summary: Support pathlib.PurePath-like objects in DataFrameReader 
/ DataFrameWriter
 Key: SPARK-35919
 URL: https://issues.apache.org/jira/browse/SPARK-35919
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.1.2, 2.4.8
Reporter: Andrew Grigorev


It would be nice to support Path objects in 
`spark.\{read,write}.\{parquet,orc,csv,...etc}` methods.

Without pyspark source code changes it currently seems possible only by the 
ugly monkeypatching hacks - https://stackoverflow.com/q/68170685/2649222.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34723) Correct parameter type for subexpression elimination under whole-stage

2021-06-24 Thread Andrew Olson (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368894#comment-17368894
 ] 

Andrew Olson commented on SPARK-34723:
--

In case it might be helpful to anyone, the compilation failure's stack trace 
looks something like this.

{noformat}
[main] ERROR org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - 
failed to compile: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 32, Column 84: IDENTIFIER expected instead of '['
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 32, 
Column 84: IDENTIFIER expected instead of '['
at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:196)
at org.codehaus.janino.Parser.read(Parser.java:3705)
at org.codehaus.janino.Parser.parseQualifiedIdentifier(Parser.java:446)
at org.codehaus.janino.Parser.parseReferenceType(Parser.java:2569)
at org.codehaus.janino.Parser.parseType(Parser.java:2549)
at org.codehaus.janino.Parser.parseFormalParameter(Parser.java:1688)
at org.codehaus.janino.Parser.parseFormalParameterList(Parser.java:1639)
at org.codehaus.janino.Parser.parseFormalParameters(Parser.java:1620)
at 
org.codehaus.janino.Parser.parseMethodDeclarationRest(Parser.java:1518)
at 
org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:1028)
at org.codehaus.janino.Parser.parseClassBody(Parser.java:841)
at org.codehaus.janino.Parser.parseClassDeclarationRest(Parser.java:736)
at org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:941)
at 
org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:234)
at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:205)
at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1403)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1500)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1497)
at 
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at 
org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at 
org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at 
org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
at 
org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at 
org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1351)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:721)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:720)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at 
org.apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:92)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:112)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:184)
at

[jira] [Commented] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly

2019-12-09 Thread Andrew White (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991752#comment-16991752
 ] 

Andrew White commented on SPARK-22876:
--

This issue is most certainly unresolved, and the documentation is misleading at 
best.

[~choojoyq] - perhaps you could share your code on how you deal with this in 
general? The general solution seems non-obvious to me given all of the 
intricacies of Spark + Yarn interactions such as shutdown hooks, handling 
signals, client vs cluster mode. Maybe I'm over thinking the solution. 

This feature is critical for supporting long-running applications.

> spark.yarn.am.attemptFailuresValidityInterval does not work correctly
> -
>
> Key: SPARK-22876
> URL: https://issues.apache.org/jira/browse/SPARK-22876
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
> Environment: hadoop version 2.7.3
>Reporter: Jinhan Zhong
>Priority: Minor
>  Labels: bulk-closed
>
> I assume we can use spark.yarn.maxAppAttempts together with 
> spark.yarn.am.attemptFailuresValidityInterval to make a long running 
> application avoid stopping  after acceptable number of failures.
> But after testing, I found that the application always stops after failing n 
> times ( n is minimum value of spark.yarn.maxAppAttempts and 
> yarn.resourcemanager.am.max-attempts from client yarn-site.xml)
> for example, following setup will allow the application master to fail 20 
> times.
> * spark.yarn.am.attemptFailuresValidityInterval=1s
> * spark.yarn.maxAppAttempts=20
> * yarn client: yarn.resourcemanager.am.max-attempts=20
> * yarn resource manager: yarn.resourcemanager.am.max-attempts=3
> And after checking the source code, I found in source file 
> ApplicationMaster.scala 
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293
> there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
> if attempt id >= maxAppAttempts, it will try to unregister the application 
> and the application will finish.
> is this a expected design or a bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28085) Spark Scala API documentation URLs not working properly in Chrome

2019-08-08 Thread Andrew Leverentz (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903377#comment-16903377
 ] 

Andrew Leverentz commented on SPARK-28085:
--

In Chrome 76, this issue appears to be resolved.  Thanks to anyone out there 
who submitted bug reports :)

> Spark Scala API documentation URLs not working properly in Chrome
> -
>
> Key: SPARK-28085
> URL: https://issues.apache.org/jira/browse/SPARK-28085
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.3
>Reporter: Andrew Leverentz
>Priority: Minor
>
> In Chrome version 75, URLs in the Scala API documentation are not working 
> properly, which makes them difficult to bookmark.
> For example, URLs like the following get redirected to a generic "root" 
> package page:
> [https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html]
> [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset]
> Here's the URL that I get redirected to:
> [https://spark.apache.org/docs/latest/api/scala/index.html#package]
> This issue seems to have appeared between versions 74 and 75 of Chrome, but 
> the documentation URLs still work in Safari.  I suspect that this has 
> something to do with security-related changes to how Chrome 75 handles frames 
> and/or redirects.  I've reported this issue to the Chrome team via the 
> in-browser help menu, but I don't have any visibility into their response, so 
> it's not clear whether they'll consider this a bug or "working as intended".



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28085) Spark Scala API documentation URLs not working properly in Chrome

2019-07-22 Thread Andrew Leverentz (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890314#comment-16890314
 ] 

Andrew Leverentz commented on SPARK-28085:
--

This issue still remains, more than a month after the Chrome update that caused 
it.  It's not clear whether Google considers it a bug that needs fixing.  I've 
reported the issue to Google, as mentioned above, but if anyone else has a 
better way of contacting the Chrome team, I'd appreciate it if you could try to 
get in touch with them to see whether they are aware of this bug and planning 
to fix it.

> Spark Scala API documentation URLs not working properly in Chrome
> -
>
> Key: SPARK-28085
> URL: https://issues.apache.org/jira/browse/SPARK-28085
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.3
>Reporter: Andrew Leverentz
>Priority: Minor
>
> In Chrome version 75, URLs in the Scala API documentation are not working 
> properly, which makes them difficult to bookmark.
> For example, URLs like the following get redirected to a generic "root" 
> package page:
> [https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html]
> [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset]
> Here's the URL that I get redirected to:
> [https://spark.apache.org/docs/latest/api/scala/index.html#package]
> This issue seems to have appeared between versions 74 and 75 of Chrome, but 
> the documentation URLs still work in Safari.  I suspect that this has 
> something to do with security-related changes to how Chrome 75 handles frames 
> and/or redirects.  I've reported this issue to the Chrome team via the 
> in-browser help menu, but I don't have any visibility into their response, so 
> it's not clear whether they'll consider this a bug or "working as intended".



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28225) Unexpected behavior for Window functions

2019-07-22 Thread Andrew Leverentz (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Leverentz resolved SPARK-28225.
--
Resolution: Not A Problem

> Unexpected behavior for Window functions
> 
>
> Key: SPARK-28225
> URL: https://issues.apache.org/jira/browse/SPARK-28225
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Andrew Leverentz
>Priority: Major
>
> I've noticed some odd behavior when combining the "first" aggregate function 
> with an ordered Window.
> In particular, I'm working with columns created using the syntax
> {code}
> first($"y", ignoreNulls = true).over(Window.orderBy($"x"))
> {code}
> Below, I'm including some code which reproduces this issue in a Databricks 
> notebook.
> *Code:*
> {code:java}
> import org.apache.spark.sql.functions.first
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StructType,StructField,IntegerType}
> val schema = StructType(Seq(
>   StructField("x", IntegerType, false),
>   StructField("y", IntegerType, true),
>   StructField("z", IntegerType, true)
> ))
> val input =
>   spark.createDataFrame(sc.parallelize(Seq(
> Row(101, null, 11),
> Row(102, null, 12),
> Row(103, null, 13),
> Row(203, 24, null),
> Row(201, 26, null),
> Row(202, 25, null)
>   )), schema = schema)
> input.show
> val output = input
>   .withColumn("u1", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".asc_nulls_last)))
>   .withColumn("u2", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".asc)))
>   .withColumn("u3", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".desc_nulls_last)))
>   .withColumn("u4", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".desc)))
>   .withColumn("u5", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".asc_nulls_last)))
>   .withColumn("u6", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".asc)))
>   .withColumn("u7", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".desc_nulls_last)))
>   .withColumn("u8", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".desc)))
> output.show
> {code}
> *Expectation:*
> Based on my understanding of how ordered-Window and aggregate functions work, 
> the results I expected to see were:
>  * u1 = u2 = constant value of 26
>  * u3 = u4 = constant value of 24
>  * u5 = u6 = constant value of 11
>  * u7 = u8 = constant value of 13
> However, columns u1, u2, u7, and u8 contain some unexpected nulls. 
> *Results:*
> {code:java}
> +---+++++---+---+---+---+++
> |  x|   y|   z|  u1|  u2| u3| u4| u5| u6|  u7|  u8|
> +---+++++---+---+---+---+++
> |203|  24|null|  26|  26| 24| 24| 11| 11|null|null|
> |202|  25|null|  26|  26| 24| 24| 11| 11|null|null|
> |201|  26|null|  26|  26| 24| 24| 11| 11|null|null|
> |103|null|  13|null|null| 24| 24| 11| 11|  13|  13|
> |102|null|  12|null|null| 24| 24| 11| 11|  13|  13|
> |101|null|  11|null|null| 24| 24| 11| 11|  13|  13|
> +---+++++---+---+---+---+++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28225) Unexpected behavior for Window functions

2019-07-22 Thread Andrew Leverentz (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890310#comment-16890310
 ] 

Andrew Leverentz commented on SPARK-28225:
--

Marco, thanks for the explanation.  In this case, the workaround in Scala is to 
use

{{Window.orderBy($"x").rowsBetween(Window.unboundedPreceding, 
Window.unboundedFollowing)}}

This issue can be marked resolved.

> Unexpected behavior for Window functions
> 
>
> Key: SPARK-28225
> URL: https://issues.apache.org/jira/browse/SPARK-28225
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Andrew Leverentz
>Priority: Major
>
> I've noticed some odd behavior when combining the "first" aggregate function 
> with an ordered Window.
> In particular, I'm working with columns created using the syntax
> {code}
> first($"y", ignoreNulls = true).over(Window.orderBy($"x"))
> {code}
> Below, I'm including some code which reproduces this issue in a Databricks 
> notebook.
> *Code:*
> {code:java}
> import org.apache.spark.sql.functions.first
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StructType,StructField,IntegerType}
> val schema = StructType(Seq(
>   StructField("x", IntegerType, false),
>   StructField("y", IntegerType, true),
>   StructField("z", IntegerType, true)
> ))
> val input =
>   spark.createDataFrame(sc.parallelize(Seq(
> Row(101, null, 11),
> Row(102, null, 12),
> Row(103, null, 13),
> Row(203, 24, null),
> Row(201, 26, null),
> Row(202, 25, null)
>   )), schema = schema)
> input.show
> val output = input
>   .withColumn("u1", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".asc_nulls_last)))
>   .withColumn("u2", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".asc)))
>   .withColumn("u3", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".desc_nulls_last)))
>   .withColumn("u4", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".desc)))
>   .withColumn("u5", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".asc_nulls_last)))
>   .withColumn("u6", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".asc)))
>   .withColumn("u7", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".desc_nulls_last)))
>   .withColumn("u8", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".desc)))
> output.show
> {code}
> *Expectation:*
> Based on my understanding of how ordered-Window and aggregate functions work, 
> the results I expected to see were:
>  * u1 = u2 = constant value of 26
>  * u3 = u4 = constant value of 24
>  * u5 = u6 = constant value of 11
>  * u7 = u8 = constant value of 13
> However, columns u1, u2, u7, and u8 contain some unexpected nulls. 
> *Results:*
> {code:java}
> +---+++++---+---+---+---+++
> |  x|   y|   z|  u1|  u2| u3| u4| u5| u6|  u7|  u8|
> +---+++++---+---+---+---+++
> |203|  24|null|  26|  26| 24| 24| 11| 11|null|null|
> |202|  25|null|  26|  26| 24| 24| 11| 11|null|null|
> |201|  26|null|  26|  26| 24| 24| 11| 11|null|null|
> |103|null|  13|null|null| 24| 24| 11| 11|  13|  13|
> |102|null|  12|null|null| 24| 24| 11| 11|  13|  13|
> |101|null|  11|null|null| 24| 24| 11| 11|  13|  13|
> +---+++++---+---+---+---+++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28225) Unexpected behavior for Window functions

2019-07-22 Thread Andrew Leverentz (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890310#comment-16890310
 ] 

Andrew Leverentz edited comment on SPARK-28225 at 7/22/19 4:54 PM:
---

Marco, thanks for the explanation.  In this case, the solution in Scala is to 
use

{{Window.orderBy($"x").rowsBetween(Window.unboundedPreceding, 
Window.unboundedFollowing)}}

This issue can be marked resolved.


was (Author: alev_etx):
Marco, thanks for the explanation.  In this case, the workaround in Scala is to 
use

{{Window.orderBy($"x").rowsBetween(Window.unboundedPreceding, 
Window.unboundedFollowing)}}

This issue can be marked resolved.

> Unexpected behavior for Window functions
> 
>
> Key: SPARK-28225
> URL: https://issues.apache.org/jira/browse/SPARK-28225
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Andrew Leverentz
>Priority: Major
>
> I've noticed some odd behavior when combining the "first" aggregate function 
> with an ordered Window.
> In particular, I'm working with columns created using the syntax
> {code}
> first($"y", ignoreNulls = true).over(Window.orderBy($"x"))
> {code}
> Below, I'm including some code which reproduces this issue in a Databricks 
> notebook.
> *Code:*
> {code:java}
> import org.apache.spark.sql.functions.first
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StructType,StructField,IntegerType}
> val schema = StructType(Seq(
>   StructField("x", IntegerType, false),
>   StructField("y", IntegerType, true),
>   StructField("z", IntegerType, true)
> ))
> val input =
>   spark.createDataFrame(sc.parallelize(Seq(
> Row(101, null, 11),
> Row(102, null, 12),
> Row(103, null, 13),
> Row(203, 24, null),
> Row(201, 26, null),
> Row(202, 25, null)
>   )), schema = schema)
> input.show
> val output = input
>   .withColumn("u1", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".asc_nulls_last)))
>   .withColumn("u2", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".asc)))
>   .withColumn("u3", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".desc_nulls_last)))
>   .withColumn("u4", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".desc)))
>   .withColumn("u5", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".asc_nulls_last)))
>   .withColumn("u6", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".asc)))
>   .withColumn("u7", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".desc_nulls_last)))
>   .withColumn("u8", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".desc)))
> output.show
> {code}
> *Expectation:*
> Based on my understanding of how ordered-Window and aggregate functions work, 
> the results I expected to see were:
>  * u1 = u2 = constant value of 26
>  * u3 = u4 = constant value of 24
>  * u5 = u6 = constant value of 11
>  * u7 = u8 = constant value of 13
> However, columns u1, u2, u7, and u8 contain some unexpected nulls. 
> *Results:*
> {code:java}
> +---+++++---+---+---+---+++
> |  x|   y|   z|  u1|  u2| u3| u4| u5| u6|  u7|  u8|
> +---+++++---+---+---+---+++
> |203|  24|null|  26|  26| 24| 24| 11| 11|null|null|
> |202|  25|null|  26|  26| 24| 24| 11| 11|null|null|
> |201|  26|null|  26|  26| 24| 24| 11| 11|null|null|
> |103|null|  13|null|null| 24| 24| 11| 11|  13|  13|
> |102|null|  12|null|null| 24| 24| 11| 11|  13|  13|
> |101|null|  11|null|null| 24| 24| 11| 11|  13|  13|
> +---+++++---+---+---+---+++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28225) Unexpected behavior for Window functions

2019-07-01 Thread Andrew Leverentz (JIRA)

Andrew Leverentz created SPARK-28225:


 Summary: Unexpected behavior for Window functions
 Key: SPARK-28225
 URL: https://issues.apache.org/jira/browse/SPARK-28225
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Andrew Leverentz


I've noticed some odd behavior when combining the "first" aggregate function 
with an ordered Window.

In particular, I'm working with columns created using the syntax
{code}
first($"y", ignoreNulls = true).over(Window.orderBy($"x"))
{code}
Below, I'm including some code which reproduces this issue in a Databricks 
notebook.

*Code:*
{code:java}
import org.apache.spark.sql.functions.first
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,IntegerType}

val schema = StructType(Seq(
  StructField("x", IntegerType, false),
  StructField("y", IntegerType, true),
  StructField("z", IntegerType, true)
))

val input =
  spark.createDataFrame(sc.parallelize(Seq(
Row(101, null, 11),
Row(102, null, 12),
Row(103, null, 13),
Row(203, 24, null),
Row(201, 26, null),
Row(202, 25, null)
  )), schema = schema)

input.show

val output = input
  .withColumn("u1", first($"y", ignoreNulls = 
true).over(Window.orderBy($"x".asc_nulls_last)))
  .withColumn("u2", first($"y", ignoreNulls = 
true).over(Window.orderBy($"x".asc)))
  .withColumn("u3", first($"y", ignoreNulls = 
true).over(Window.orderBy($"x".desc_nulls_last)))
  .withColumn("u4", first($"y", ignoreNulls = 
true).over(Window.orderBy($"x".desc)))
  .withColumn("u5", first($"z", ignoreNulls = 
true).over(Window.orderBy($"x".asc_nulls_last)))
  .withColumn("u6", first($"z", ignoreNulls = 
true).over(Window.orderBy($"x".asc)))
  .withColumn("u7", first($"z", ignoreNulls = 
true).over(Window.orderBy($"x".desc_nulls_last)))
  .withColumn("u8", first($"z", ignoreNulls = 
true).over(Window.orderBy($"x".desc)))

output.show
{code}
*Expectation:*

Based on my understanding of how ordered-Window and aggregate functions work, 
the results I expected to see were:
 * u1 = u2 = constant value of 26
 * u3 = u4 = constant value of 24
 * u5 = u6 = constant value of 11
 * u7 = u8 = constant value of 13

However, columns u1, u2, u7, and u8 contain some unexpected nulls. 

*Results:*
{code:java}
+---+++++---+---+---+---+++
|  x|   y|   z|  u1|  u2| u3| u4| u5| u6|  u7|  u8|
+---+++++---+---+---+---+++
|203|  24|null|  26|  26| 24| 24| 11| 11|null|null|
|202|  25|null|  26|  26| 24| 24| 11| 11|null|null|
|201|  26|null|  26|  26| 24| 24| 11| 11|null|null|
|103|null|  13|null|null| 24| 24| 11| 11|  13|  13|
|102|null|  12|null|null| 24| 24| 11| 11|  13|  13|
|101|null|  11|null|null| 24| 24| 11| 11|  13|  13|
+---+++++---+---+---+---+++
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28085) Spark Scala API documentation URLs not working properly in Chrome

2019-06-18 Thread Andrew Leverentz (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Leverentz updated SPARK-28085:
-
Description: 
In Chrome version 75, URLs in the Scala API documentation are not working 
properly, which makes them difficult to bookmark.

For example, URLs like the following get redirected to a generic "root" package 
page:

[https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html]

[https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset]

Here's the URL that I get redirected to:

[https://spark.apache.org/docs/latest/api/scala/index.html#package]

This issue seems to have appeared between versions 74 and 75 of Chrome, but the 
documentation URLs still work in Safari.  I suspect that this has something to 
do with security-related changes to how Chrome 75 handles frames and/or 
redirects.  I've reported this issue to the Chrome team via the in-browser help 
menu, but I don't have any visibility into their response, so it's not clear 
whether they'll consider this a bug or "working as intended".

  was:
In Chrome version 75, URLs in the Scala API documentation are not working 
properly, which makes them difficult to bookmark.

For example, URLs like the following get redirected to a generic "root" package 
page:

[https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html]

[https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset]

Here's the URL that I get :

[https://spark.apache.org/docs/latest/api/scala/index.html#package]

This issue seems to have appeared between versions 74 and 75 of Chrome, but the 
documentation URLs still work in Safari.  I suspect that this has something to 
do with security-related changes to how Chrome 75 handles frames and/or 
redirects.  I've reported this issue to the Chrome team via the in-browser help 
menu, but I don't have any visibility into their response, so it's not clear 
whether they'll consider this a bug or "working as intended".


> Spark Scala API documentation URLs not working properly in Chrome
> -
>
> Key: SPARK-28085
> URL: https://issues.apache.org/jira/browse/SPARK-28085
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.3
>Reporter: Andrew Leverentz
>Priority: Minor
>
> In Chrome version 75, URLs in the Scala API documentation are not working 
> properly, which makes them difficult to bookmark.
> For example, URLs like the following get redirected to a generic "root" 
> package page:
> [https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html]
> [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset]
> Here's the URL that I get redirected to:
> [https://spark.apache.org/docs/latest/api/scala/index.html#package]
> This issue seems to have appeared between versions 74 and 75 of Chrome, but 
> the documentation URLs still work in Safari.  I suspect that this has 
> something to do with security-related changes to how Chrome 75 handles frames 
> and/or redirects.  I've reported this issue to the Chrome team via the 
> in-browser help menu, but I don't have any visibility into their response, so 
> it's not clear whether they'll consider this a bug or "working as intended".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28085) Spark Scala API documentation URLs not working properly in Chrome

2019-06-17 Thread Andrew Leverentz (JIRA)

Andrew Leverentz created SPARK-28085:


 Summary: Spark Scala API documentation URLs not working properly 
in Chrome
 Key: SPARK-28085
 URL: https://issues.apache.org/jira/browse/SPARK-28085
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.4.3
Reporter: Andrew Leverentz


In Chrome version 75, URLs in the Scala API documentation are not working 
properly, which makes them difficult to bookmark.

For example, URLs like the following get redirected to a generic "root" package 
page:

[https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html]

[https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset]

Here's the URL that I get :

[https://spark.apache.org/docs/latest/api/scala/index.html#package]

This issue seems to have appeared between versions 74 and 75 of Chrome, but the 
documentation URLs still work in Safari.  I suspect that this has something to 
do with security-related changes to how Chrome 75 handles frames and/or 
redirects.  I've reported this issue to the Chrome team via the in-browser help 
menu, but I don't have any visibility into their response, so it's not clear 
whether they'll consider this a bug or "working as intended".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28062) HuberAggregator copies coefficients vector every time an instance is added

2019-06-15 Thread Andrew Crosby (JIRA)

Andrew Crosby created SPARK-28062:
-

 Summary: HuberAggregator copies coefficients vector every time an 
instance is added
 Key: SPARK-28062
 URL: https://issues.apache.org/jira/browse/SPARK-28062
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 3.0.0
Reporter: Andrew Crosby


Every time an instance is added to the HuberAggregator, a copy of the 
coefficients vector is created (see code snippet below). This causes a 
performance degradation, which is particularly severe when the instances have 
long sparse feature vectors.

{code:scala}
def add(instance: Instance): HuberAggregator = {
instance match { case Instance(label, weight, features) =>
  require(numFeatures == features.size, s"Dimensions mismatch when adding 
new sample." +
s" Expecting $numFeatures but got ${features.size}.")
  require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0")

  if (weight == 0.0) return this
  val localFeaturesStd = bcFeaturesStd.value
  val localCoefficients = bcParameters.value.toArray.slice(0, numFeatures)
val localGradientSumArray = gradientSumArray

// Snip

}
{code}

The LeastSquaresAggregator class avoids this performance issue via the use of 
transient lazy class variables to store such reused values. Applying a similar 
approach to HuberAggregator gives a significant speed boost. Running the script 
below locally on my machine gives the following timing results:

{noformat}
Current implementation: 
Time(s): 540.1439919471741
Iterations: 26
Intercept: 0.518109382890512
Coefficients: [0.0, -0.2516936902000245, 0.0, 0.0, -0.19633887469839809, 
0.0, -0.39565545053893925, 0.0, -0.18617574426698882, 0.0478922416670529]

Modified implementation to match LeastSquaresAggregator:
Time(s): 46.82946586608887
Iterations: 26
Intercept: 0.5181093828893774
Coefficients: [0.0, -0.25169369020031357, 0.0, 0.0, -0.1963388746927919, 
0.0, -0.3956554505389966, 0.0, -0.18617574426702874, 0.04789224166878518]
{noformat}




{code:python}
from random import random, randint, seed
import time

from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.regression import LinearRegression
from pyspark.sql import SparkSession

seed(0)

spark = SparkSession.builder.appName('huber-speed-test').getOrCreate()
df = spark.createDataFrame([[randint(0, 10), random()] for i in 
range(10)],  ["category", "target"])
ohe = OneHotEncoder(inputCols=["category"], 
outputCols=["encoded_category"]).fit(df)
lr = LinearRegression(featuresCol="encoded_category", labelCol="target", 
loss="huber", regParam=1.0)

start = time.time()
model = lr.fit(ohe.transform(df))
end = time.time()

print("Time(s): " + str(end - start))
print("Iterations: " + str(model.summary.totalIterations))
print("Intercept: " + str(model.intercept))
print("Coefficients: " + str(list(model.coefficients)[0:10]))
{code}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27560) HashPartitioner uses Object.hashCode which is not seeded

2019-04-24 Thread Andrew McHarg (JIRA)

Andrew McHarg created SPARK-27560:
-

 Summary: HashPartitioner uses Object.hashCode which is not seeded
 Key: SPARK-27560
 URL: https://issues.apache.org/jira/browse/SPARK-27560
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.4.0
 Environment: Notebook is running spark v2.4.0 local[*]

Python 3.6.6 (default, Sep  6 2018, 13:10:03)
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] on darwin

I imagine this would reproduce on all operating systems and most versions of 
spark though.
Reporter: Andrew McHarg


Forgive the quality of the bug report here, I am a pyspark user and not super 
familiar with the internals of spark, yet it seems I have a strange corner case 
with the HashPartitioner.

This may already be known but repartition with HashPartitioner seems to assign 
everything the same partition if data that was partitioned by the same column 
is only partially read (say one partition). I suppose it is obvious concequence 
of Object.hashCode being deterministic but took some while to track down. 

Steps to repro:
 # Get dataframe with a bunch of uuids say 1
 # repartition(100, 'uuid_column')
 # save to parquet
 # read from parquet
 # collect()[:100] then filter using pyspark.sql.functions isin (yes I know 
this is bad and sampleBy should probably be used here)
 # repartition(10, 'uuid_column')
 # Resulting dataframe will have all of its data in one single partition

Jupyter notebook for the above: 
https://gist.github.com/robo-hamburger/4752a40cb643318464e58ab66cf7d23e

I think an easy fix would be to seed the HashPartitioner like many hashtable 
libraries do to avoid denial of service attacks. It also might be the case this 
is obvious behavior for more experienced spark users :)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer

2019-04-20 Thread Andrew Crosby (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822548#comment-16822548
 ] 

Andrew Crosby commented on SPARK-26970:
---

The code changes required for this looked relatively straightforward so I've 
had a go at creating a pull request myself 
(https://github.com/apache/spark/pull/24426)

[~huaxingao] apologies if I've duplicated work that you've already done.

> Can't load PipelineModel that was created in Scala with Python due to missing 
> Interaction transformer
> -
>
> Key: SPARK-26970
> URL: https://issues.apache.org/jira/browse/SPARK-26970
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Andrew Crosby
>Priority: Minor
>
> The Interaction transformer 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala]
>  is missing from the set of pyspark feature transformers 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py]
>  
> This means that it is impossible to create a model that includes an 
> Interaction transformer with pyspark. It also means that attempting to load a 
> PipelineModel created in Scala that includes an Interaction transformer with 
> pyspark fails with the following error:
> {code:java}
> AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27216) Kryo serialization with RoaringBitmap

2019-03-28 Thread Andrew Allen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804447#comment-16804447
 ] 

Andrew Allen commented on SPARK-27216:
--

Thanks for this bug report, [~cltlfcjin]. We we running into 
"FetchFailedException: Received a zero-size buffer for block shuffle_0_0_332 
from BlockManagerId" and this bug report gave us enough of the hint to try 
{{config.set("spark.kryo.unsafe", "false")}} which worked-around the issue.

> Kryo serialization with RoaringBitmap
> -
>
> Key: SPARK-27216
> URL: https://issues.apache.org/jira/browse/SPARK-27216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But 
> RoaringBitmap couldn't be ser/deser with unsafe KryoSerializer.
> We can use below UT to reproduce:
> {code}
>   test("kryo serialization with RoaringBitmap") {
> val bitmap = new RoaringBitmap
> bitmap.add(1787)
> val safeSer = new KryoSerializer(conf).newInstance()
> val bitmap2 : RoaringBitmap = 
> safeSer.deserialize(safeSer.serialize(bitmap))
> assert(bitmap2.equals(bitmap))
> conf.set("spark.kryo.unsafe", "true")
> val unsafeSer = new KryoSerializer(conf).newInstance()
> val bitmap3 : RoaringBitmap = 
> unsafeSer.deserialize(unsafeSer.serialize(bitmap))
> assert(bitmap3.equals(bitmap)) // this will fail
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer

2019-02-22 Thread Andrew Crosby (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Crosby updated SPARK-26970:
--
Description: 
The Interaction transformer 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala]
 is missing from the set of pyspark feature transformers 
[https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py]
 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}

  was:
The Interaction transformer 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
 is missing from the set of pyspark feature transformers 
[https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]
 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}


> Can't load PipelineModel that was created in Scala with Python due to missing 
> Interaction transformer
> -
>
> Key: SPARK-26970
> URL: https://issues.apache.org/jira/browse/SPARK-26970
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Andrew Crosby
>Priority: Major
>
> The Interaction transformer 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala]
>  is missing from the set of pyspark feature transformers 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py]
>  
> This means that it is impossible to create a model that includes an 
> Interaction transformer with pyspark. It also means that attempting to load a 
> PipelineModel created in Scala that includes an Interaction transformer with 
> pyspark fails with the following error:
> {code:java}
> AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer

2019-02-22 Thread Andrew Crosby (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Crosby updated SPARK-26970:
--
Description: 
The Interaction transformer 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
 is missing from the set of pyspark feature transformers 
[https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]
 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}

  was:
The Interaction transformer ( 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
 ) is missing from the set of pyspark feature transformers ( 
[https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]
 ).

 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}


> Can't load PipelineModel that was created in Scala with Python due to missing 
> Interaction transformer
> -
>
> Key: SPARK-26970
> URL: https://issues.apache.org/jira/browse/SPARK-26970
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Andrew Crosby
>Priority: Major
>
> The Interaction transformer 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
>  is missing from the set of pyspark feature transformers 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]
>  
> This means that it is impossible to create a model that includes an 
> Interaction transformer with pyspark. It also means that attempting to load a 
> PipelineModel created in Scala that includes an Interaction transformer with 
> pyspark fails with the following error:
> {code:java}
> AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer

2019-02-22 Thread Andrew Crosby (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Crosby updated SPARK-26970:
--
Description: 
The Interaction transformer ( 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
 ) is missing from the set of pyspark feature transformers ( 
[https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]
 ).

 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}

  was:
The Interaction transformer 
([https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
 is missing from the set of pyspark feature transformers 
([https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)].

 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}


> Can't load PipelineModel that was created in Scala with Python due to missing 
> Interaction transformer
> -
>
> Key: SPARK-26970
> URL: https://issues.apache.org/jira/browse/SPARK-26970
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Andrew Crosby
>Priority: Major
>
> The Interaction transformer ( 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
>  ) is missing from the set of pyspark feature transformers ( 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]
>  ).
>  
> This means that it is impossible to create a model that includes an 
> Interaction transformer with pyspark. It also means that attempting to load a 
> PipelineModel created in Scala that includes an Interaction transformer with 
> pyspark fails with the following error:
> {code:java}
> AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer

2019-02-22 Thread Andrew Crosby (JIRA)

Andrew Crosby created SPARK-26970:
-

 Summary: Can't load PipelineModel that was created in Scala with 
Python due to missing Interaction transformer
 Key: SPARK-26970
 URL: https://issues.apache.org/jira/browse/SPARK-26970
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 2.4.0
Reporter: Andrew Crosby


The Interaction transformer 
([https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
 is missing from the set of pyspark feature transformers 
([https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)].

 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22865) Publish Official Apache Spark Docker images

2018-11-21 Thread Andrew Korzhuev (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695338#comment-16695338
 ] 

Andrew Korzhuev commented on SPARK-22865:
-

I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them 
from [https://hub.docker.com/r/andrusha/spark-k8s/tags/,] scripts were also 
updated. I would love to help making this happen for Spark, but I need somebody 
to show me around Apache CI/CD infrastructure.

> Publish Official Apache Spark Docker images
> ---
>
> Key: SPARK-22865
> URL: https://issues.apache.org/jira/browse/SPARK-22865
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22865) Publish Official Apache Spark Docker images

2018-11-21 Thread Andrew Korzhuev (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695338#comment-16695338
 ] 

Andrew Korzhuev edited comment on SPARK-22865 at 11/21/18 11:00 PM:


I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them 
from [https://hub.docker.com/r/andrusha/spark-k8s/tags/] scripts were also 
updated. I would love to help making this happen for Spark, but I need somebody 
to show me around Apache CI/CD infrastructure.


was (Author: akorzhuev):
I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them 
from 
[https://hub.docker.com/r/andrusha/spark-k8s/tags/|https://hub.docker.com/r/andrusha/spark-k8s/tags/,]
 scripts were also updated. I would love to help making this happen for Spark, 
but I need somebody to show me around Apache CI/CD infrastructure.

> Publish Official Apache Spark Docker images
> ---
>
> Key: SPARK-22865
> URL: https://issues.apache.org/jira/browse/SPARK-22865
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22865) Publish Official Apache Spark Docker images

2018-11-21 Thread Andrew Korzhuev (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695338#comment-16695338
 ] 

Andrew Korzhuev edited comment on SPARK-22865 at 11/21/18 11:00 PM:


I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them 
from 
[https://hub.docker.com/r/andrusha/spark-k8s/tags/|https://hub.docker.com/r/andrusha/spark-k8s/tags/,]
 scripts were also updated. I would love to help making this happen for Spark, 
but I need somebody to show me around Apache CI/CD infrastructure.


was (Author: akorzhuev):
I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them 
from 
[https://hub.docker.com/r/andrusha/spark-k8s/tags/|https://hub.docker.com/r/andrusha/spark-k8s/tags/,]
 scripts were also updated. I would love to help making this happen for Spark, 
but I need somebody to show me around Apache CI/CD infrastructure.

> Publish Official Apache Spark Docker images
> ---
>
> Key: SPARK-22865
> URL: https://issues.apache.org/jira/browse/SPARK-22865
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22865) Publish Official Apache Spark Docker images

2018-11-21 Thread Andrew Korzhuev (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695338#comment-16695338
 ] 

Andrew Korzhuev edited comment on SPARK-22865 at 11/21/18 10:59 PM:


I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them 
from 
[https://hub.docker.com/r/andrusha/spark-k8s/tags/|https://hub.docker.com/r/andrusha/spark-k8s/tags/,]
 scripts were also updated. I would love to help making this happen for Spark, 
but I need somebody to show me around Apache CI/CD infrastructure.


was (Author: akorzhuev):
I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them 
from [https://hub.docker.com/r/andrusha/spark-k8s/tags/,] scripts were also 
updated. I would love to help making this happen for Spark, but I need somebody 
to show me around Apache CI/CD infrastructure.

> Publish Official Apache Spark Docker images
> ---
>
> Key: SPARK-22865
> URL: https://issues.apache.org/jira/browse/SPARK-22865
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version

2018-11-19 Thread Andrew Otto (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691787#comment-16691787
 ] 

Andrew Otto commented on SPARK-14492:
-

I found my issue: we were loading some Hive 1.1.0 jars manually using 
spark.driver.extraClassPath in order to initiate a Hive JDBC directly to Hive, 
instead of using spark.sql() to work around 
https://issues.apache.org/jira/browse/SPARK-23890.  The Hive 1.1.0 classes were 
loaded before the ones included with Spark, and as such they failed referencing 
a Hive configuration that didn't exist in 1.1.0.

 

> Spark SQL 1.6.0 does not work with external Hive metastore version lower than 
> 1.2.0; its not backwards compatible with earlier version
> --
>
> Key: SPARK-14492
> URL: https://issues.apache.org/jira/browse/SPARK-14492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunil Rangwani
>Priority: Critical
>
> Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
> java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
> because this field was introduced in Hive 1.2.0 so its not possible to use 
> Hive metastore version lower than 1.2.0 with Spark. The details of the Hive 
> changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 
> {code:java}
> Exception in thread "main" java.lang.NoSuchFieldError: 
> METASTORE_CLIENT_SOCKET_LIFETIME
>   at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version

2018-11-13 Thread Andrew Otto (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685375#comment-16685375
 ] 

Andrew Otto commented on SPARK-14492:
-

FWIW, I am experiencing the same problem with Spark 2.3.1 and Hive 1.1.0 (from 
CDH 5.15.0).  I've tried setting both spark.sql.hive.metastore.version and 
spark.sql.hive.metastore.jars (although I'm not sure I've got the right 
classpath for that one), and am still experiencing this problem.
{code:java}
18/11/13 14:31:12 ERROR ApplicationMaster: User class threw exception: 
java.lang.NoSuchFieldError: METASTORE_CLIENT_SOCKET_LIFETIME
java.lang.NoSuchFieldError: METASTORE_CLIENT_SOCKET_LIFETIME
 at 
org.apache.spark.sql.hive.HiveUtils$.formatTimeVarsForHiveClient(HiveUtils.scala:195)
 at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:286)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:195)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194)
 at 
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
 at 
org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
 at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)
 at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)
 at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
 at 
org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1.(HiveSessionStateBuilder.scala:69)
 at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.analyzer(HiveSessionStateBuilder.scala:69)
 at 
org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
 at 
org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
 at 
org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:79)
 at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:79)
 at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
 at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
 at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
 at org.apache.spark.sql.Dataset.(Dataset.scala:172)
 at org.apache.spark.sql.Dataset.(Dataset.scala:178)
 at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
 at 
org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:488){code}

> Spark SQL 1.6.0 does not work with external Hive metastore version lower than 
> 1.2.0; its not backwards compatible with earlier version
> --
>
> Key: SPARK-14492
> URL: https://issues.apache.org/jira/browse/SPARK-14492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunil Rangwani
>Priority: Critical
>
> Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
> java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
> because this field was introduced in Hive 1.2.0 so its not possible to use 
> Hive metastore version lower than 1.2.0 with Spark. The details of the Hive 
> changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 
> {code:java}
> Exception in thread "main" java.lang.NoSuchFieldError: 
> METASTORE_CLIENT_SOCKET_LIFETIME
>   at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at

[jira] [Commented] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling

2018-10-01 Thread Andrew Crosby (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634268#comment-16634268
 ] 

Andrew Crosby commented on SPARK-25544:
---

SPARK-23537 contains what might be another occurrence of this issue. The model 
in that case contains only binary features, so standardization shouldn't really 
be used. However, turning standardization off causes the model to take 4992 
iterations to converge as opposed to 37 iterations when standardization is 
turned on.

> Slow/failed convergence in Spark ML models due to internal predictor scaling
> 
>
> Key: SPARK-25544
> URL: https://issues.apache.org/jira/browse/SPARK-25544
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.2
> Environment: Databricks runtime 4.2: Spark 2.3.1, Scala 2.11
>Reporter: Andrew Crosby
>Priority: Major
>
> The LinearRegression and LogisticRegression estimators in Spark ML can take a 
> large number of iterations to converge, or fail to converge altogether, when 
> trained using the l-bfgs method with standardization turned off.
> *Details:*
> LinearRegression and LogisticRegression standardize their input features by 
> default. In SPARK-8522 the option to disable standardization was added. This 
> is implemented internally by changing the effective strength of 
> regularization rather than disabling the feature scaling. Mathematically, 
> both changing the effective regularizaiton strength, and disabling feature 
> scaling should give the same solution, but they can have very different 
> convergence properties.
> The normal justication given for scaling features is that it ensures that all 
> covariances are O(1) and should improve numerical convergence, but this 
> argument does not account for the regularization term. This doesn't cause any 
> issues if standardization is set to true, since all features will have an 
> O(1) regularization strength. But it does cause issues when standardization 
> is set to false, since the effecive regularization strength of feature i is 
> now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. 
> This means that predictors with small standard deviations (which can occur 
> legitimately e.g. via one hot encoding) will have very large effective 
> regularization strengths and consequently lead to very large gradients and 
> thus poor convergence in the solver.
> *Example code to recreate:*
> To demonstrate just how bad these convergence issues can be, here is a very 
> simple test case which builds a linear regression model with a categorical 
> feature, a numerical feature and their interaction. When fed the specified 
> training data, this model will fail to converge before it hits the maximum 
> iteration limit. In this case, it is the interaction between category "2" and 
> the numeric feature that leads to a feature with a small standard deviation.
> Training data:
> ||category||numericFeature||label||
> |1|1.0|0.5|
> |1|0.5|1.0|
> |2|0.01|2.0|
>  
> {code:java}
> val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 
> 2.0)).toDF("category", "numericFeature", "label")
> val indexer = new StringIndexer().setInputCol("category") 
> .setOutputCol("categoryIndex")
> val encoder = new 
> OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
> val interaction = new Interaction().setInputCols(Array("categoryEncoded", 
> "numericFeature")).setOutputCol("interaction")
> val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", 
> "interaction")).setOutputCol("features")
> val model = new 
> LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
> val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, 
> assembler, model))
> val pipelineModel  = pipeline.fit(df)
> val numIterations = 
> pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code}
>  *Possible fix:*
> These convergence issues can be fixed by turning off feature scaling when 
> standardization is set to false rather than using an effective regularization 
> strength. This can be hacked into LinearRegression.scala by simply replacing 
> line 423
> {code:java}
> val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
> {code}
> with
> {code:java}
> val featuresStd = if ($(standardization)) 
> featuresSummarizer.variance.toArray.map(math.sqrt) else 
> featuresSummarizer.variance.toArray.map(x => 1.0)
> {code}
> Rerunning the above test code with that hack in place, will lead to 
> convergence after just 4 iterations instead of hitting the max iterations 
> limit!
> *Impact:*
> I

[jira] [Commented] (SPARK-23537) Logistic Regression without standardization

2018-10-01 Thread Andrew Crosby (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634263#comment-16634263
 ] 

Andrew Crosby commented on SPARK-23537:
---

The different results for standardization=True vs standardization=False are to 
be expected. The resason for this difference is that the two settings lead to 
different effective regularization strengths.  With standardization=True, the 
regularization is applied to the scaled model coefficients. Whereas, with 
standardization=False, the regularization is applied to the unscaled model 
coefficients.

As it's implemented in Spark, the features actually get scaled regardless of 
whether standardization is set to true or false, but when standardization=False 
the strength of the regularization in the scaled space is altered to account 
for this. See the comment at 
[https://github.com/apache/spark/blob/a802c69b130b69a35b372ffe1b01289577f6fafb/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L685.]

 

As an aside, your results show a very slow rate of convergence when 
standardization is set to false. I believe this to be an issue caused by the 
continued application of feature scaling when standardization=False which can 
lead to very large gradients from the regularization terms in the solver. I've 
recently raised SPARK-25544 to cover this issue.

> Logistic Regression without standardization
> ---
>
> Key: SPARK-23537
> URL: https://issues.apache.org/jira/browse/SPARK-23537
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Optimizer
>Affects Versions: 2.0.2, 2.2.1
>Reporter: Jordi
>Priority: Major
> Attachments: non-standardization.log, standardization.log
>
>
> I'm trying to train a Logistic Regression model, using Spark 2.2.1. I prefer 
> to not use standardization since all my features are binary, using the 
> hashing trick (2^20 sparse vector).
> I trained two models to compare results, I've been expecting to end with two 
> similar models since it seems that internally the optimizer performs 
> standardization and "de-standardization" (when it's deactivated) in order to 
> improve the convergence.
> Here you have the code I used:
> {code:java}
> val lr = new org.apache.spark.ml.classification.LogisticRegression()
> .setRegParam(0.05)
> .setElasticNetParam(0.0)
> .setFitIntercept(true)
> .setMaxIter(5000)
> .setStandardization(false)
> val model = lr.fit(data)
> {code}
> The results are disturbing me, I end with two significantly different models.
> *Standardization:*
> Training time: 8min.
> Iterations: 37
> Intercept: -4.386090107224499
> Max weight: 4.724752299455218
> Min weight: -3.560570478164854
> Mean weight: -0.049325201841722795
> l1 norm: 116710.39522171849
> l2 norm: 402.2581552373957
> Non zero weights: 128084
> Non zero ratio: 0.12215042114257812
> Last 10 LBFGS Val and Grand Norms:
> {code:java}
> 18/02/27 17:14:45 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 8.00e-07) 
> 0.000559057
> 18/02/27 17:14:50 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 3.94e-07) 
> 0.000267527
> 18/02/27 17:14:54 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.62e-07) 
> 0.000205888
> 18/02/27 17:14:59 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.36e-07) 
> 0.000144173
> 18/02/27 17:15:04 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 7.74e-08) 
> 0.000140296
> 18/02/27 17:15:09 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.52e-08) 
> 0.000122709
> 18/02/27 17:15:13 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.78e-08) 
> 3.08789e-05
> 18/02/27 17:15:18 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.66e-09) 
> 2.23806e-05
> 18/02/27 17:15:23 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 4.31e-09) 
> 1.47422e-05
> 18/02/27 17:15:28 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 9.17e-10) 
> 2.37442e-05
> {code}
> *No standardization:*
> Training time: 7h 14 min.
> Iterations: 4992
> Intercept: -4.216690468849263
> Max weight: 0.41930559767624725
> Min weight: -0.5949182537565524
> Mean weight: -1.2659769019012E-6
> l1 norm: 14.262025330648694
> l2 norm: 1.2508777025612263
> Non zero weights: 128955
> Non zero ratio: 0.12298107147216797
> Last 10 LBFGS Val and Grand Norms:
> {code:java}
> 18/02/28 00:28:56 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 2.17e-07) 
> 0.217581
> 18/02/28 00:29:01 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.88e-07) 
> 0.185812
> 18/02/28 00:29:06 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.33e-07) 
> 0.214570
> 18/02/28 00:29:11 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 8.62e-08) 
> 0.489464
> 18/02/28 00:29:16 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.90e-07) 
> 0.178448
> 18/02/28 00:29:21 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 7.91e-08) 
> 0.172527
> 18/02/28 00:29:26 INFO LBFGS: Val and Grad Norm:

[jira] [Updated] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling

2018-09-26 Thread Andrew Crosby (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Crosby updated SPARK-25544:
--
Description: 
The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. Mathematically, both changing the 
effective regularizaiton strength, and disabling feature scaling should give 
the same solution, but they can have very different convergence properties.

The normal justication given for scaling features is that it ensures that all 
covariances are O(1) and should improve numerical convergence, but this 
argument does not account for the regularization term. This doesn't cause any 
issues if standardization is set to true, since all features will have an O(1) 
regularization strength. But it does cause issues when standardization is set 
to false, since the effecive regularization strength of feature i is now O(1/ 
sigma_i^2) where sigma_i is the standard deviation of the feature. This means 
that predictors with small standard deviations (which can occur legitimately 
e.g. via one hot encoding) will have very large effective regularization 
strengths and consequently lead to very large gradients and thus poor 
convergence in the solver.

*Example code to recreate:*

To demonstrate just how bad these convergence issues can be, here is a very 
simple test case which builds a linear regression model with a categorical 
feature, a numerical feature and their interaction. When fed the specified 
training data, this model will fail to converge before it hits the maximum 
iteration limit. In this case, it is the interaction between category "2" and 
the numeric feature that leads to a feature with a small standard deviation.

Training data:
||category||numericFeature||label||
|1|1.0|0.5|
|1|0.5|1.0|
|2|0.01|2.0|

 
{code:java}
val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 
2.0)).toDF("category", "numericFeature", "label")

val indexer = new StringIndexer().setInputCol("category") 
.setOutputCol("categoryIndex")
val encoder = new 
OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
val interaction = new Interaction().setInputCols(Array("categoryEncoded", 
"numericFeature")).setOutputCol("interaction")
val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", 
"interaction")).setOutputCol("features")
val model = new 
LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, 
assembler, model))

val pipelineModel  = pipeline.fit(df)

val numIterations = 
pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code}
 *Possible fix:*

These convergence issues can be fixed by turning off feature scaling when 
standardization is set to false rather than using an effective regularization 
strength. This can be hacked into LinearRegression.scala by simply replacing 
line 423
{code:java}
val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
{code}
with
{code:java}
val featuresStd = if ($(standardization)) 
featuresSummarizer.variance.toArray.map(math.sqrt) else 
featuresSummarizer.variance.toArray.map(x => 1.0)
{code}
Rerunning the above test code with that hack in place, will lead to convergence 
after just 4 iterations instead of hitting the max iterations limit!

*Impact:*

I can't speak for other people, but I've personally encountered these 
convergence issues several times when building production scale Spark ML 
models, and have resorted to writing my only implementation of LinearRegression 
with the above hack in place. The issue is made worse by the fact that Spark 
does not raise an error when the maximum number of iterations is hit, so the 
first time you encounter the issue it can take a while to figure out what is 
going on.

 

  was:
The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling.

[jira] [Updated] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling

2018-09-26 Thread Andrew Crosby (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Crosby updated SPARK-25544:
--
Description: 
The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. Mathematically, both changing the 
effective regularizaiton strength, and disabling feature scaling should give 
the same solution, but they can have very different convergence properties.

The normal justication given for scaling features is that it ensures that all 
covariances are O(1) and should improve numerical convergence, but this 
argument does not account for the regularization term. This doesn't cause any 
issues if standardization is set to true, since all features will have an O(1) 
regularization strength. But it does cause issues when standardization is set 
to false, since the effecive regularization strength of feature i is now O(1/ 
sigma_i^2) where sigma_i is the standard deviation of the feature. This means 
that predictors with small standard deviations will have very large effective 
regularization strengths and consequently lead to very large gradients and thus 
poor convergence in the solver.

*Example code to recreate:*

To demonstrate just how bad these convergence issues can be, here is a very 
simple test case which builds a linear regression model with a categorical 
feature, a numerical feature and their interaction. When fed the specified 
training data, this model will fail to converge before it hits the maximum 
iteration limit. In this case, it is the interaction between category "2" and 
the numeric feature that leads to a feature with a small standard deviation.

Training data:
||category||numericFeature||label||
|1|1.0|0.5|
|1|0.5|1.0|
|2|0.01|2.0|

 
{code:java}
val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 
2.0)).toDF("category", "numericFeature", "label")

val indexer = new StringIndexer().setInputCol("category") 
.setOutputCol("categoryIndex")
val encoder = new 
OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
val interaction = new Interaction().setInputCols(Array("categoryEncoded", 
"numericFeature")).setOutputCol("interaction")
val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", 
"interaction")).setOutputCol("features")
val model = new 
LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, 
assembler, model))

val pipelineModel  = pipeline.fit(df)

val numIterations = 
pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code}
 *Possible fix:*

These convergence issues can be fixed by turning off feature scaling when 
standardization is set to false rather than using an effective regularization 
strength. This can be hacked into LinearRegression.scala by simply replacing 
line 423
{code:java}
val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
{code}
with
{code:java}
val featuresStd = if ($(standardization)) 
featuresSummarizer.variance.toArray.map(math.sqrt) else 
featuresSummarizer.variance.toArray.map(x => 1.0)
{code}
Rerunning the above test code with that hack in place, will lead to convergence 
after just 4 iterations instead of hitting the max iterations limit!

*Impact:*

I can't speak for other people, but I've personally encountered these 
convergence issues several times when building production scale Spark ML 
models, and have resorted to writing my only implementation of LinearRegression 
with the above hack in place. The issue is made worse by the fact that Spark 
does not raise an error when the maximum number of iterations is hit, so the 
first time you encounter the issue it can take a while to figure out what is 
going on.

 

  was:
The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. Mathematically, both changing the 
effective regularizaiton

[jira] [Updated] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling

2018-09-26 Thread Andrew Crosby (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Crosby updated SPARK-25544:
--
Description: 
The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. Mathematically, both changing the 
effective regularizaiton strength, and disabling feature scaling should give 
the same solution, but they can have very different convergence properties.

The normal justication given for scaling features is that it ensures that all 
covariances are O(1) and should improve numerical convergence, but this 
argument does not account for the regularization term. This doesn't cause any 
issues if standardization is set to true, since all features will have an O(1) 
regularization strength. But it does cause issues when standardization is set 
to false, since the effecive regularization strength of feature i is now O(1/ 
sigma_i^2) where sigma_i is the standard deviation of the feature. This means 
that predictors with small standard deviations will have very large effective 
regularization strengths and consequently lead to very large gradients and thus 
poor convergence in the solver.

*Example code to recreate:*

To demonstrate just how bad these convergence issues can be, here is a very 
simple test case which builds a linear regression model with a categorical 
feature, a numerical feature and their interaction. When fed the specified 
training data, this model will fail to converge before it hits the maximum 
iteration limit.

Training data:
||category||numericFeature||label||
|1|1.0|0.5|
|1|0.5|1.0|
|2|0.01|2.0|

 
{code:java}
val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 
2.0)).toDF("category", "numericFeature", "label")

val indexer = new StringIndexer().setInputCol("category") 
.setOutputCol("categoryIndex")
val encoder = new 
OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
val interaction = new Interaction().setInputCols(Array("categoryEncoded", 
"numericFeature")).setOutputCol("interaction")
val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", 
"interaction")).setOutputCol("features")
val model = new 
LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, 
assembler, model))

val pipelineModel  = pipeline.fit(df)

val numIterations = 
pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code}
 *Possible fix:*

These convergence issues can be fixed by turning off feature scaling when 
standardization is set to false rather than using an effective regularization 
strength. This can be hacked into LinearRegression.scala by simply replacing 
line 423
{code:java}
val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
{code}
with
{code:java}
val featuresStd = if ($(standardization)) 
featuresSummarizer.variance.toArray.map(math.sqrt) else 
featuresSummarizer.variance.toArray.map(x => 1.0)
{code}
Rerunning the above test code with that hack in place, will lead to convergence 
after just 4 iterations instead of hitting the max iterations limit!

*Impact:*

I can't speak for other people, but I've personally encountered these 
convergence issues several times when building production scale Spark ML 
models, and have resorted to writing my only implementation of LinearRegression 
with the above hack in place. The issue is made worse by the fact that Spark 
does not raise an error when the maximum number of iterations is hit, so the 
first time you encounter the issue it can take a while to figure out what is 
going on.

 

  was:
The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. Mathematically, both changing the 
effective regularizaiton strength, and disabling feature scaling should give 
the same solution, but they can have very different convergence properties.

The normal

[jira] [Created] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling

2018-09-26 Thread Andrew Crosby (JIRA)

Andrew Crosby created SPARK-25544:
-

 Summary: Slow/failed convergence in Spark ML models due to 
internal predictor scaling
 Key: SPARK-25544
 URL: https://issues.apache.org/jira/browse/SPARK-25544
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.3.2
 Environment: Databricks runtime 4.2: Spark 2.3.1, Scala 2.11
Reporter: Andrew Crosby


The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. Mathematically, both changing the 
effective regularizaiton strength, and disabling feature scaling should give 
the same solution, but they can have very different convergence properties.

The normal justication given for scaling features is that it ensures that all 
covariances are O(1) and should improve numerical convergence, but this 
argument does not account for the regularization term. This doesn't cause any 
issues if standardization is set to true, since all features will have an O(1) 
regularization strength. But it does cause issues when standardization is set 
to false, since the effecive regularization strength of feature i is now O(1/ 
sigma_i^2) where sigma_i is the standard deviation of the feature. This means 
that predictors with small standard deviations will have very large effective 
regularization strengths and consequently lead to very large gradients and thus 
poor convergence in the solver.

*Example code to recreate:*

To demonstrate just how bad these convergence issues can be, here is a very 
simple test case which builds a linear regression model with a categorical 
feature, a numerical feature and their interaction. When fed the specified 
training data, this model will fail to converge before it hits the maximum 
iteration limit.

Training data:
||category||numericFeature||label||
|1|1.0|0.5|
|1|0.5|1.0|
|2|0.01|2.0|

 
{code:java}
val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 
2.0)).toDF("category", "numericFeature", "label")

val indexer = new StringIndexer().setInputCol("category") 
.setOutputCol("categoryIndex")
val encoder = new 
OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
val interaction = new Interaction().setInputCols(Array("categoryEncoded", 
"numericFeature")).setOutputCol("interaction")
val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", 
"interaction")).setOutputCol("features")
val model = new 
LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, 
assembler, model))

val pipelineModel  = pipeline.fit(df)

val numIterations = 
pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code}
 *Possible fix:*

These convergence issues can be fixed by turning of feature scaling when 
standardization is set to false rather than using an effective regularization 
strength. This can be hacked into LinearRegression.scala by simply replacing 
line 423
{code:java}
val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
{code}
with
{code:java}
val featuresStd = if ($(standardization)) 
featuresSummarizer.variance.toArray.map(math.sqrt) else 
featuresSummarizer.variance.toArray.map(x => 1.0)
{code}
Rerunning the above test code with that hack in place, will lead to convergence 
after just 4 iterations instead of hitting the max iterations limit!

*Impact:*

I can't speak for other people, but I've personally encountered these 
convergence issues several times when building production scale Spark ML 
models, and have resorted to writing my only implementation of LinearRegression 
with the above hack in place. The issue is made worse by the fact that Spark 
does not raise an error when the maximum number of iterations is hit, so the 
first time you encounter the issue it can take a while to figure out what is 
going on.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21962) Distributed Tracing in Spark

2018-07-12 Thread Andrew Ash (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541265#comment-16541265
 ] 

Andrew Ash commented on SPARK-21962:


Note that HTrace is now being removed from Hadoop – HADOOP-15566 

> Distributed Tracing in Spark
> 
>
> Key: SPARK-21962
> URL: https://issues.apache.org/jira/browse/SPARK-21962
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>Priority: Major
>
> Spark should support distributed tracing, which is the mechanism, widely 
> popularized by Google in the [Dapper 
> Paper|https://research.google.com/pubs/pub36356.html], where network requests 
> have additional metadata used for tracing requests between services.
> This would be useful for me since I have OpenZipkin style tracing in my 
> distributed application up to the Spark driver, and from the executors out to 
> my other services, but the link is broken in Spark between driver and 
> executor since the Span IDs aren't propagated across that link.
> An initial implementation could instrument the most important network calls 
> with trace ids (like launching and finishing tasks), and incrementally add 
> more tracing to other calls (torrent block distribution, external shuffle 
> service, etc) as the feature matures.
> Search keywords: Dapper, Brave, OpenZipkin, HTrace



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24422) Add JDK9+ in our Jenkins' build servers

2018-07-03 Thread Andrew Korzhuev (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531241#comment-16531241
 ] 

Andrew Korzhuev commented on SPARK-24422:
-

Also `.travis.yml` needs to be fixed in the following way:
{code:java}
# 2. Choose language and target JDKs for parallel builds.
language: java
jdk:
  - openjdk8
  - openjdk9
  - openjdk10
{code}

> Add JDK9+ in our Jenkins' build servers
> ---
>
> Key: SPARK-24422
> URL: https://issues.apache.org/jira/browse/SPARK-24422
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24422) Add JDK9+ in our Jenkins' build servers

2018-07-03 Thread Andrew Korzhuev (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531241#comment-16531241
 ] 

Andrew Korzhuev edited comment on SPARK-24422 at 7/3/18 11:46 AM:
--

Also `.travis.yml` needs to be fixed in the following way:
{code:java}
# 2. Choose language and target JDKs for parallel builds.
language: java
jdk:
  - openjdk8
  - openjdk9
{code}


was (Author: akorzhuev):
Also `.travis.yml` needs to be fixed in the following way:
{code:java}
# 2. Choose language and target JDKs for parallel builds.
language: java
jdk:
  - openjdk8
  - openjdk9
  - openjdk10
{code}

> Add JDK9+ in our Jenkins' build servers
> ---
>
> Key: SPARK-24422
> URL: https://issues.apache.org/jira/browse/SPARK-24422
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24421) sun.misc.Unsafe in JDK9+

2018-07-03 Thread Andrew Korzhuev (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531235#comment-16531235
 ] 

Andrew Korzhuev edited comment on SPARK-24421 at 7/3/18 11:44 AM:
--

If I understand this correctly, then the only deprecated JDK9+ API Spark is 
using is `sun.misc.Cleaner` (while `sun.misc.Unsafe` is still accessible) in 
`[common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java|https://github.com/andrusha/spark/commit/7da06d3c725169f9764225f5a29886eb56bee191#diff-c7483c7efce631c783676f014ba2b0ed]`,
 which is fixable in the following way:
{code:java}
@@ -22,7 +22,7 @@
import java.lang.reflect.Method;
import java.nio.ByteBuffer;

-import sun.misc.Cleaner;
+import java.lang.ref.Cleaner;
import sun.misc.Unsafe;

public final class Platform {
@@ -169,7 +169,8 @@ public static ByteBuffer allocateDirectBuffer(int size) {
cleanerField.setAccessible(true);
long memory = allocateMemory(size);
ByteBuffer buffer = (ByteBuffer) constructor.newInstance(memory, size);
- Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory));
+ Cleaner cleaner = Cleaner.create();
+ cleaner.register(buffer, () -> freeMemory(memory));
cleanerField.set(buffer, cleaner);
return buffer;
{code}
[https://github.com/andrusha/spark/commit/7da06d3c725169f9764225f5a29886eb56bee191#diff-c7483c7efce631c783676f014ba2b0ed]

 


was (Author: akorzhuev):
If I understand this correctly, then the only deprecated JDK9+ API Spark is 
using is `sun.misc.Cleaner` (while `sun.misc.Unsafe` is still accessible) in 
`[common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java|https://github.com/andrusha/spark/commit/7da06d3c725169f9764225f5a29886eb56bee191#diff-c7483c7efce631c783676f014ba2b0ed]`,
 which is fixable in the following way:

 
{code:java}
@@ -22,7 +22,7 @@
import java.lang.reflect.Method;
import java.nio.ByteBuffer;

-import sun.misc.Cleaner;
+import java.lang.ref.Cleaner;
import sun.misc.Unsafe;

public final class Platform {
@@ -169,7 +169,8 @@ public static ByteBuffer allocateDirectBuffer(int size) {
cleanerField.setAccessible(true);
long memory = allocateMemory(size);
ByteBuffer buffer = (ByteBuffer) constructor.newInstance(memory, size);
- Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory));
+ Cleaner cleaner = Cleaner.create();
+ cleaner.register(buffer, () -> freeMemory(memory));
cleanerField.set(buffer, cleaner);
return buffer;
{code}
[https://github.com/andrusha/spark/commit/7da06d3c725169f9764225f5a29886eb56bee191#diff-c7483c7efce631c783676f014ba2b0ed]

 

> sun.misc.Unsafe in JDK9+
> 
>
> Key: SPARK-24421
> URL: https://issues.apache.org/jira/browse/SPARK-24421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> Many internal APIs such as unsafe are encapsulated in JDK9+, see 
> http://openjdk.java.net/jeps/260 for detail.
> To use Unsafe, we need to add *jdk.unsupported* to our code’s module 
> declaration:
> {code:java}
> module java9unsafe {
> requires jdk.unsupported;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24421) sun.misc.Unsafe in JDK9+

2018-07-03 Thread Andrew Korzhuev (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531235#comment-16531235
 ] 

Andrew Korzhuev commented on SPARK-24421:
-

If I understand this correctly, then the only deprecated JDK9+ API Spark is 
using is `sun.misc.Cleaner` (while `sun.misc.Unsafe` is still accessible) in 
`[common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java|https://github.com/andrusha/spark/commit/7da06d3c725169f9764225f5a29886eb56bee191#diff-c7483c7efce631c783676f014ba2b0ed]`,
 which is fixable in the following way:

 
{code:java}
@@ -22,7 +22,7 @@
import java.lang.reflect.Method;
import java.nio.ByteBuffer;

-import sun.misc.Cleaner;
+import java.lang.ref.Cleaner;
import sun.misc.Unsafe;

public final class Platform {
@@ -169,7 +169,8 @@ public static ByteBuffer allocateDirectBuffer(int size) {
cleanerField.setAccessible(true);
long memory = allocateMemory(size);
ByteBuffer buffer = (ByteBuffer) constructor.newInstance(memory, size);
- Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory));
+ Cleaner cleaner = Cleaner.create();
+ cleaner.register(buffer, () -> freeMemory(memory));
cleanerField.set(buffer, cleaner);
return buffer;
{code}
[https://github.com/andrusha/spark/commit/7da06d3c725169f9764225f5a29886eb56bee191#diff-c7483c7efce631c783676f014ba2b0ed]

 

> sun.misc.Unsafe in JDK9+
> 
>
> Key: SPARK-24421
> URL: https://issues.apache.org/jira/browse/SPARK-24421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> Many internal APIs such as unsafe are encapsulated in JDK9+, see 
> http://openjdk.java.net/jeps/260 for detail.
> To use Unsafe, we need to add *jdk.unsupported* to our code’s module 
> declaration:
> {code:java}
> module java9unsafe {
> requires jdk.unsupported;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB

2018-06-07 Thread Andrew Conegliano (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504965#comment-16504965
 ] 

Andrew Conegliano commented on SPARK-24481:
---

Thanks Marco.

Forgot to mention, this error doesn't happen in 2.0.2 and 2.2.0. And for 2.3.0, 
even though the error pops up, the code will still run because it disables 
wholestagecodegen to run it. The main problem is that in a spark streaming 
context, the error pops up for every message so logs fill disk very quickly.

> GeneratedIteratorForCodegenStage1 grows beyond 64 KB
> 
>
> Key: SPARK-24481
> URL: https://issues.apache.org/jira/browse/SPARK-24481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Emr 5.13.0 and Databricks Cloud 4.0
>Reporter: Andrew Conegliano
>Priority: Major
> Attachments: log4j-active(1).log
>
>
> Similar to other "grows beyond 64 KB" errors.  Happens with large case 
> statement:
> {code:java}
> import org.apache.spark.sql.functions._
> import scala.collection.mutable
> import org.apache.spark.sql.Column
> var rdd = sc.parallelize(Array("""{
> "event":
> {
> "timestamp": 1521086591110,
> "event_name": "yu",
> "page":
> {
> "page_url": "https://;,
> "page_name": "es"
> },
> "properties":
> {
> "id": "87",
> "action": "action",
> "navigate_action": "navigate_action"
> }
> }
> }
> """))
> var df = spark.read.json(rdd)
> df = 
> df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
> .toDF("id","event_time","url","action","page_name","event_name","navigation_action")
> var a = "case "
> for(i <- 1 to 300){
>   a = a + s"when action like '$i%' THEN '$i' "
> }
> a = a + " else null end as task_id"
> val expression = expr(a)
> df = df.filter("id is not null and id <> '' and event_time is not null")
> val transformationExpressions: mutable.HashMap[String, Column] = 
> mutable.HashMap(
> "action" -> expr("coalesce(action, navigation_action) as action"),
> "task_id" -> expression
> )
> for((col, expr) <- transformationExpressions)
> df = df.withColumn(col, expr)
> df = df.filter("(action is not null and action <> '') or (page_name is not 
> null and page_name <> '')")
> df.show
> {code}
>  
> Exception:
> {code:java}
> 18/06/07 01:06:34 ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method 
> "project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method 
> "project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
>   at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>   at 
> org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
>   at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
>   at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1444)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1523)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1520)
>   at 
> com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3522)
>   at 
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2315)
>   at 
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:3932)
>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806)
>   at

[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB

2018-06-06 Thread Andrew Conegliano (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Conegliano updated SPARK-24481:
--
Description: 
Similar to other "grows beyond 64 KB" errors.  Happens with large case 
statement:
{code:java}
import org.apache.spark.sql.functions._
import scala.collection.mutable
import org.apache.spark.sql.Column

var rdd = sc.parallelize(Array("""{
"event":
{
"timestamp": 1521086591110,
"event_name": "yu",
"page":
{
"page_url": "https://;,
"page_name": "es"
},
"properties":
{
"id": "87",
"action": "action",
"navigate_action": "navigate_action"
}
}
}
"""))

var df = spark.read.json(rdd)
df = 
df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
.toDF("id","event_time","url","action","page_name","event_name","navigation_action")

var a = "case "
for(i <- 1 to 300){
  a = a + s"when action like '$i%' THEN '$i' "
}
a = a + " else null end as task_id"

val expression = expr(a)

df = df.filter("id is not null and id <> '' and event_time is not null")

val transformationExpressions: mutable.HashMap[String, Column] = 
mutable.HashMap(
"action" -> expr("coalesce(action, navigation_action) as action"),
"task_id" -> expression
)

for((col, expr) <- transformationExpressions)
df = df.withColumn(col, expr)

df = df.filter("(action is not null and action <> '') or (page_name is not null 
and page_name <> '')")

df.show
{code}
 

Exception:
{code:java}
18/06/07 01:06:34 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code 
of method 
"project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
 grows beyond 64 KB
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code 
of method 
"project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
 grows beyond 64 KB
at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
at 
org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
at 
org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
at 
org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1444)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1523)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1520)
at 
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3522)
at 
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2315)
at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)
at com.google.common.cache.LocalCache.get(LocalCache.java:3932)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936)
at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1392)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:579)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:578)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$3.apply(SparkPlan.scala:167)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:164)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:61)
at

[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB

2018-06-06 Thread Andrew Conegliano (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Conegliano updated SPARK-24481:
--
Description: 
Similar to other "grows beyond 64 KB" errors.  Happens with large case 
statement:
{code:java}
import org.apache.spark.sql.functions._
import scala.collection.mutable
import org.apache.spark.sql.Column

var rdd = sc.parallelize(Array("""{
"event":
{
"timestamp": 1521086591110,
"event_name": "yu",
"page":
{
"page_url": "https://;,
"page_name": "es"
},
"properties":
{
"id": "87",
"action": "action",
"navigate_action": "navigate_action"
}
}
}
"""))

var df = spark.read.json(rdd)
df = 
df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
.toDF("id","event_time","url","action","page_name","event_name","navigation_action")

var a = "case "
for(i <- 1 to 300){
  a = a + s"when action like '$i%' THEN '$i' "
}
a = a + " else null end as task_id"

val expression = expr(a)

df = df.filter("id is not null and id <> '' and event_time is not null")

val transformationExpressions: mutable.HashMap[String, Column] = 
mutable.HashMap(
"action" -> expr("coalesce(action, navigation_action) as action"),
"task_id" -> expression
)

for((col, expr) <- transformationExpressions)
df = df.withColumn(col, expr)

df = df.filter("(action is not null and action <> '') or (page_name is not null 
and page_name <> '')")

df.show
{code}
 

Exception:
{code:java}
18/06/07 01:06:34 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code 
of method 
"project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
 grows beyond 64 KB
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code 
of method 
"project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
 grows beyond 64 KB{code}
 

Log file is attached

  was:
Similar to other "grows beyond 64 KB" errors.  Happens with large case 
statement:
{code:java}
import org.apache.spark.sql.functions._
import scala.collection.mutable
import org.apache.spark.sql.Column

var rdd = sc.parallelize(Array("""{
"event":
{
"timestamp": 1521086591110,
"event_name": "yu",
"page":
{
"page_url": "https://;,
"page_name": "es"
},
"properties":
{
"id": "87",
"action": "action",
"navigate_action": "navigate_action"
}
}
}
"""))

var df = spark.read.json(rdd)
df = 
df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
.toDF("id","event_time","url","action","page_name","event_name","navigation_action")

var a = "case "
for(i <- 1 to 300){
  a = a + s"when action like '$i%' THEN '$i' "
}
a = a + " else null end as task_id"

val expression = expr(a)

df = df.filter("id is not null and id <> '' and event_time is not null")

val transformationExpressions: mutable.HashMap[String, Column] = 
mutable.HashMap(
"action" -> expr("coalesce(action, navigation_action) as action"),
"task_id" -> expression
)

for((col, expr) <- transformationExpressions)
df = df.withColumn(col, expr)

df = df.filter("(action is not null and action <> '') or (page_name is not null 
and page_name <> '')")

df.show
{code}
Log file is attached


> GeneratedIteratorForCodegenStage1 grows beyond 64 KB
> 
>
> Key: SPARK-24481
> URL: https://issues.apache.org/jira/browse/SPARK-24481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Emr 5.13.0 and Databricks Cloud 4.0
>Reporter: Andrew Conegliano
>Priority: Major
> Attachments: log4j-active(1).log
>
>
> Similar to other "grows beyond 64 KB" errors.  Happens with large case 
> statement:
> {code:java}
> import org.apache.spark.sql.functions._
> import scala.collection.mutable
> import org.apache.spark.sql.Column
> var rdd = sc.parallelize(Array("""{
> "event":
> {
> "timestamp": 1521086591110,
> "event_name": "yu",
> "page":
> {
> "page_url": "https://;,
> "page_name": "es"
> },
> "properties":
> {
> "id": "87",
> "action": "action",
> "navigate_action": "navigate_action"
> }
> }
> }
> """))
> var df = spark.read.json(rdd)
> df = 
>

[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB

2018-06-06 Thread Andrew Conegliano (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Conegliano updated SPARK-24481:
--
Environment: Emr 5.13.0 and Databricks Cloud 4.0  (was: Emr 5.13.0)

> GeneratedIteratorForCodegenStage1 grows beyond 64 KB
> 
>
> Key: SPARK-24481
> URL: https://issues.apache.org/jira/browse/SPARK-24481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Emr 5.13.0 and Databricks Cloud 4.0
>Reporter: Andrew Conegliano
>Priority: Major
> Attachments: log4j-active(1).log
>
>
> Similar to other "grows beyond 64 KB" errors.  Happens with large case 
> statement:
> {code:java}
> import org.apache.spark.sql.functions._
> import scala.collection.mutable
> import org.apache.spark.sql.Column
> var rdd = sc.parallelize(Array("""{
> "event":
> {
> "timestamp": 1521086591110,
> "event_name": "yu",
> "page":
> {
> "page_url": "https://;,
> "page_name": "es"
> },
> "properties":
> {
> "id": "87",
> "action": "action",
> "navigate_action": "navigate_action"
> }
> }
> }
> """))
> var df = spark.read.json(rdd)
> df = 
> df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
> .toDF("id","event_time","url","action","page_name","event_name","navigation_action")
> var a = "case "
> for(i <- 1 to 300){
>   a = a + s"when action like '$i%' THEN '$i' "
> }
> a = a + " else null end as task_id"
> val expression = expr(a)
> df = df.filter("id is not null and id <> '' and event_time is not null")
> val transformationExpressions: mutable.HashMap[String, Column] = 
> mutable.HashMap(
> "action" -> expr("coalesce(action, navigation_action) as action"),
> "task_id" -> expression
> )
> for((col, expr) <- transformationExpressions)
> df = df.withColumn(col, expr)
> df = df.filter("(action is not null and action <> '') or (page_name is not 
> null and page_name <> '')")
> df.show
> {code}
> Log file is attached



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB

2018-06-06 Thread Andrew Conegliano (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Conegliano updated SPARK-24481:
--
Description: 
Similar to other "grows beyond 64 KB" errors.  Happens with large case 
statement:
{code:java}
import org.apache.spark.sql.functions._
import scala.collection.mutable
import org.apache.spark.sql.Column

var rdd = sc.parallelize(Array("""{
"event":
{
"timestamp": 1521086591110,
"event_name": "yu",
"page":
{
"page_url": "https://;,
"page_name": "es"
},
"properties":
{
"id": "87",
"action": "action",
"navigate_action": "navigate_action"
}
}
}
"""))

var df = spark.read.json(rdd)
df = 
df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
.toDF("id","event_time","url","action","page_name","event_name","navigation_action")

var a = "case "
for(i <- 1 to 300){
  a = a + s"when action like '$i%' THEN '$i' "
}
a = a + " else null end as task_id"

val expression = expr(a)

df = df.filter("id is not null and id <> '' and event_time is not null")

val transformationExpressions: mutable.HashMap[String, Column] = 
mutable.HashMap(
"action" -> expr("coalesce(action, navigation_action) as action"),
"task_id" -> expression
)

for((col, expr) <- transformationExpressions)
df = df.withColumn(col, expr)

df = df.filter("(action is not null and action <> '') or (page_name is not null 
and page_name <> '')")

df.show
{code}
Log file is attached

  was:
Similar to other "grows beyond 64 KB" errors.  Happens with large case 
statement:
{code:java}
// Databricks notebook source
import org.apache.spark.sql.functions._
import scala.collection.mutable
import org.apache.spark.sql.Column

var rdd = sc.parallelize(Array("""{
"event":
{
"timestamp": 1521086591110,
"event_name": "yu",
"page":
{
"page_url": "https://;,
"page_name": "es"
},
"properties":
{
"id": "87",
"action": "action",
"navigate_action": "navigate_action"
}
}
}
"""))

var df = spark.read.json(rdd)
df = 
df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
.toDF("id","event_time","url","action","page_name","event_name","navigation_action")

var a = "case "
for(i <- 1 to 300){
  a = a + s"when action like '$i%' THEN '$i' "
}
a = a + " else null end as task_id"

val expression = expr(a)

df = df.filter("id is not null and id <> '' and event_time is not null")

val transformationExpressions: mutable.HashMap[String, Column] = 
mutable.HashMap(
"action" -> expr("coalesce(action, navigation_action) as action"),
"task_id" -> expression
)

for((col, expr) <- transformationExpressions)
df = df.withColumn(col, expr)

df = df.filter("(action is not null and action <> '') or (page_name is not null 
and page_name <> '')")

df.show
{code}
Log file is attached


> GeneratedIteratorForCodegenStage1 grows beyond 64 KB
> 
>
> Key: SPARK-24481
> URL: https://issues.apache.org/jira/browse/SPARK-24481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Emr 5.13.0
>Reporter: Andrew Conegliano
>Priority: Major
> Attachments: log4j-active(1).log
>
>
> Similar to other "grows beyond 64 KB" errors.  Happens with large case 
> statement:
> {code:java}
> import org.apache.spark.sql.functions._
> import scala.collection.mutable
> import org.apache.spark.sql.Column
> var rdd = sc.parallelize(Array("""{
> "event":
> {
> "timestamp": 1521086591110,
> "event_name": "yu",
> "page":
> {
> "page_url": "https://;,
> "page_name": "es"
> },
> "properties":
> {
> "id": "87",
> "action": "action",
> "navigate_action": "navigate_action"
> }
> }
> }
> """))
> var df = spark.read.json(rdd)
> df = 
> df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
> .toDF("id","event_time","url","action","page_name","event_name","navigation_action")
> var a = "case "
> for(i <- 1 to 300){
>   a = a + s"when action like '$i%' THEN '$i' "
> }
> a = a + " else null end as task_id"
> val expression = expr(a)
> df = df.filter("id is not null and id <> '' and event_time is not null")
> val transformationExpressions: mutable.HashMap[String, Column] = 
> mutable.HashMap(
> "action" -> expr("coalesce(action, navigation_action) as action"),
> "task_id" -> expression
> )
> for((col, expr) <- transformationExpressions)
> df = df.withColumn(col, expr)
> df = df.filter("(action is not null and action <> '') or (page_name is not 
> null and page_name <> '')")
> df.show
> {code}
> Log file is attached



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB

2018-06-06 Thread Andrew Conegliano (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Conegliano updated SPARK-24481:
--
Description: 
Similar to other "grows beyond 64 KB" errors.  Happens with large case 
statement:
{code:java}
// Databricks notebook source
import org.apache.spark.sql.functions._
import scala.collection.mutable
import org.apache.spark.sql.Column

var rdd = sc.parallelize(Array("""{
"event":
{
"timestamp": 1521086591110,
"event_name": "yu",
"page":
{
"page_url": "https://;,
"page_name": "es"
},
"properties":
{
"id": "87",
"action": "action",
"navigate_action": "navigate_action"
}
}
}
"""))

var df = spark.read.json(rdd)
df = 
df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
.toDF("id","event_time","url","action","page_name","event_name","navigation_action")

var a = "case "
for(i <- 1 to 300){
  a = a + s"when action like '$i%' THEN '$i' "
}
a = a + " else null end as task_id"

val expression = expr(a)

df = df.filter("id is not null and id <> '' and event_time is not null")

val transformationExpressions: mutable.HashMap[String, Column] = 
mutable.HashMap(
"action" -> expr("coalesce(action, navigation_action) as action"),
"task_id" -> expression
)

for((col, expr) <- transformationExpressions)
df = df.withColumn(col, expr)

df = df.filter("(action is not null and action <> '') or (page_name is not null 
and page_name <> '')")

df.show
{code}
Log file is attached

  was:
Similar to other "grows beyond 64 KB" errors.  Happens with large case 
statement:
{code:java}
// Databricks notebook source
import org.apache.spark.sql.functions._
import scala.collection.mutable
import org.apache.spark.sql.Column

var rdd = sc.parallelize(Array("""{
"event":
{
"timestamp": 1521086591110,
"event_name": "yu",
"page":
{
"page_url": "https://;,
"page_name": "es"
},
"properties":
{
"id": "87",
"action": "action",
"navigate_action": "navigate_action"
}
}
}
"""))

var df = spark.read.json(rdd)
df = 
df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
.toDF("id","event_time","url","action","page_name","event_name","navigation_action")

var a = "case "
for(i <- 1 to 300)
a = a + s"when action like '$i%' THEN '$i' "
a = a + " else null end as task_id"

val expression = expr(a)

df = df.filter("id is not null and id <> '' and event_time is not null")

val transformationExpressions: mutable.HashMap[String, Column] = 
mutable.HashMap(
"action" -> expr("coalesce(action, navigation_action) as action"),
"task_id" -> expression
)

for((col, expr) <- transformationExpressions)
df = df.withColumn(col, expr)

df = df.filter("(action is not null and action <> '') or (page_name is not null 
and page_name <> '')")

df.show
{code}
Log file is attached


> GeneratedIteratorForCodegenStage1 grows beyond 64 KB
> 
>
> Key: SPARK-24481
> URL: https://issues.apache.org/jira/browse/SPARK-24481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Emr 5.13.0
>Reporter: Andrew Conegliano
>Priority: Major
> Attachments: log4j-active(1).log
>
>
> Similar to other "grows beyond 64 KB" errors.  Happens with large case 
> statement:
> {code:java}
> // Databricks notebook source
> import org.apache.spark.sql.functions._
> import scala.collection.mutable
> import org.apache.spark.sql.Column
> var rdd = sc.parallelize(Array("""{
> "event":
> {
> "timestamp": 1521086591110,
> "event_name": "yu",
> "page":
> {
> "page_url": "https://;,
> "page_name": "es"
> },
> "properties":
> {
> "id": "87",
> "action": "action",
> "navigate_action": "navigate_action"
> }
> }
> }
> """))
> var df = spark.read.json(rdd)
> df = 
> df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
> .toDF("id","event_time","url","action","page_name","event_name","navigation_action")
> var a = "case "
> for(i <- 1 to 300){
>   a = a + s"when action like '$i%' THEN '$i' "
> }
> a = a + " else null end as task_id"
> val expression = expr(a)
> df = df.filter("id is not null and id <> '' and event_time is not null")
> val transformationExpressions: mutable.HashMap[String, Column] = 
> mutable.HashMap(
> "action" -> expr("coalesce(action, navigation_action) as action"),
> "task_id" -> expression
> )
> for((col, expr) <- transformationExpressions)
> df = df.withColumn(col, expr)
> df = df.filter("(action is not null and action <> '') or (page_name is not 
> null and page_name <> '')")
> df.show
> {code}
> Log file is attached



--

[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB

2018-06-06 Thread Andrew Conegliano (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Conegliano updated SPARK-24481:
--
Attachment: log4j-active(1).log

> GeneratedIteratorForCodegenStage1 grows beyond 64 KB
> 
>
> Key: SPARK-24481
> URL: https://issues.apache.org/jira/browse/SPARK-24481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Emr 5.13.0
>Reporter: Andrew Conegliano
>Priority: Major
> Attachments: log4j-active(1).log
>
>
> Similar to other "grows beyond 64 KB" errors.  Happens with large case 
> statement:
> {code:java}
> // Databricks notebook source
> import org.apache.spark.sql.functions._
> import scala.collection.mutable
> import org.apache.spark.sql.Column
> var rdd = sc.parallelize(Array("""{
> "event":
> {
> "timestamp": 1521086591110,
> "event_name": "yu",
> "page":
> {
> "page_url": "https://;,
> "page_name": "es"
> },
> "properties":
> {
> "id": "87",
> "action": "action",
> "navigate_action": "navigate_action"
> }
> }
> }
> """))
> var df = spark.read.json(rdd)
> df = 
> df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
> .toDF("id","event_time","url","action","page_name","event_name","navigation_action")
> var a = "case "
> for(i <- 1 to 300)
> a = a + s"when action like '$i%' THEN '$i' "
> a = a + " else null end as task_id"
> val expression = expr(a)
> df = df.filter("id is not null and id <> '' and event_time is not null")
> val transformationExpressions: mutable.HashMap[String, Column] = 
> mutable.HashMap(
> "action" -> expr("coalesce(action, navigation_action) as action"),
> "task_id" -> expression
> )
> for((col, expr) <- transformationExpressions)
> df = df.withColumn(col, expr)
> df = df.filter("(action is not null and action <> '') or (page_name is not 
> null and page_name <> '')")
> df.show
> {code}
> Log file is attached



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB

2018-06-06 Thread Andrew Conegliano (JIRA)

Andrew Conegliano created SPARK-24481:
-

 Summary: GeneratedIteratorForCodegenStage1 grows beyond 64 KB
 Key: SPARK-24481
 URL: https://issues.apache.org/jira/browse/SPARK-24481
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
 Environment: Emr 5.13.0
Reporter: Andrew Conegliano
 Attachments: log4j-active(1).log

Similar to other "grows beyond 64 KB" errors.  Happens with large case 
statement:
{code:java}
// Databricks notebook source
import org.apache.spark.sql.functions._
import scala.collection.mutable
import org.apache.spark.sql.Column

var rdd = sc.parallelize(Array("""{
"event":
{
"timestamp": 1521086591110,
"event_name": "yu",
"page":
{
"page_url": "https://;,
"page_name": "es"
},
"properties":
{
"id": "87",
"action": "action",
"navigate_action": "navigate_action"
}
}
}
"""))

var df = spark.read.json(rdd)
df = 
df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
.toDF("id","event_time","url","action","page_name","event_name","navigation_action")

var a = "case "
for(i <- 1 to 300)
a = a + s"when action like '$i%' THEN '$i' "
a = a + " else null end as task_id"

val expression = expr(a)

df = df.filter("id is not null and id <> '' and event_time is not null")

val transformationExpressions: mutable.HashMap[String, Column] = 
mutable.HashMap(
"action" -> expr("coalesce(action, navigation_action) as action"),
"task_id" -> expression
)

for((col, expr) <- transformationExpressions)
df = df.withColumn(col, expr)

df = df.filter("(action is not null and action <> '') or (page_name is not null 
and page_name <> '')")

df.show
{code}
Log file is attached



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982

2018-04-17 Thread Andrew Clegg (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440683#comment-16440683
 ] 

Andrew Clegg edited comment on SPARK-22371 at 4/17/18 9:53 AM:
---

Another data point -- I've seen this happen (in 2.3.0) during cleanup after a 
task failure:

{code:none}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Exception while getting task result: java.lang.IllegalStateException: Attempted 
to access garbage collected accumulator 365
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
... 31 more
{code}


was (Author: aclegg):
Another data point -- I've seen this happen (in 2.3.0) during cleanup after a 
task failure:

{{Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Exception while getting task result: java.lang.IllegalStateException: Attempted 
to access garbage collected accumulator 365}}
{{ at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)}}
{{ at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)}}
{{ at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)}}
{{ at scala.Option.foreach(Option.scala:257)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)}}
{{ at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)}}
{{ at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)}}
{{ at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)}}
{{ at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)}}
{{ at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)}}
{{ at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)}}
{{ at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)}}
{{ ... 31 more}}

 

> dag-scheduler-event-loop thread stopped with error  Attempted to access 
> garbage collected accumulator 5605982
> -
>
> Key: SPARK-22371
> URL: https://issues.apache.org/jira/browse/SPARK-22371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Mayank Agarwal
>Priority: Major
> Attachments: Helper.scala, ShuffleIssue.java, 
> driver-thread-dump-spark2.1.txt, sampledata
>
>
> Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler 
> thread is stopped because of *Attempted to access garbage collected 
> accumulator 5605982*.
> from our

[jira] [Commented] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982

2018-04-17 Thread Andrew Clegg (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440683#comment-16440683
 ] 

Andrew Clegg commented on SPARK-22371:
--

Another data point -- I've seen this happen (in 2.3.0) during cleanup after a 
task failure:

{{Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Exception while getting task result: java.lang.IllegalStateException: Attempted 
to access garbage collected accumulator 365}}
{{ at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)}}
{{ at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)}}
{{ at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)}}
{{ at scala.Option.foreach(Option.scala:257)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)}}
{{ at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)}}
{{ at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)}}
{{ at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)}}
{{ at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)}}
{{ at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)}}
{{ at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)}}
{{ at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)}}
{{ ... 31 more}}

 

> dag-scheduler-event-loop thread stopped with error  Attempted to access 
> garbage collected accumulator 5605982
> -
>
> Key: SPARK-22371
> URL: https://issues.apache.org/jira/browse/SPARK-22371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Mayank Agarwal
>Priority: Major
> Attachments: Helper.scala, ShuffleIssue.java, 
> driver-thread-dump-spark2.1.txt, sampledata
>
>
> Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler 
> thread is stopped because of *Attempted to access garbage collected 
> accumulator 5605982*.
> from our investigation it look like accumulator is cleaned by GC first and 
> same accumulator is used for merging the results from executor on task 
> completion event.
> As the error java.lang.IllegalAccessError is LinkageError which is treated as 
> FatalError so dag-scheduler loop is finished with below exception.
> ---ERROR stack trace --
> Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 5605982
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> I am attaching the thread dump of driver as well 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For

[jira] [Comment Edited] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2018-04-09 Thread Andrew Otto (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431099#comment-16431099
 ] 

Andrew Otto edited comment on SPARK-23890 at 4/9/18 7:32 PM:
-

Hah! As a temporary workaround, we are [instantiating a JDBC connection to 
Hive|https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala]
 to get around Spark 2's restriction...halp!  Don't make us do this!  :)

 

 


was (Author: ottomata):
Hah! As a temporary workaround, we are [instantiating a JDBC connection to 
Hive|https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala]|http://example.com/]
 to get around Spark 2's restriction...halp!  Don't make us do this!  :)

 

 

> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Otto
>Priority: Major
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
> can be sent to Hive will block us.  I believe this is fixable by adding an 
> exception in 
> [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
>  to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
> destination type are both struct types, and the destination type only adds 
> new fields.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2018-04-09 Thread Andrew Otto (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431099#comment-16431099
 ] 

Andrew Otto commented on SPARK-23890:
-

Hah! As a temporary workaround, we are [instantiating a JDBC connection to 
Hive|http://example.com]https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala
 to get around Spark 2's restriction...halp!  Don't make us do this!  :)

 

 

> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Otto
>Priority: Major
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
> can be sent to Hive will block us.  I believe this is fixable by adding an 
> exception in 
> [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
>  to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
> destination type are both struct types, and the destination type only adds 
> new fields.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2018-04-09 Thread Andrew Otto (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431099#comment-16431099
 ] 

Andrew Otto edited comment on SPARK-23890 at 4/9/18 7:31 PM:
-

Hah! As a temporary workaround, we are [[instantiating a JDBC connection to 
Hive|https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala]|http://example.com/]
 to get around Spark 2's restriction...halp!  Don't make us do this!  :)

 

 


was (Author: ottomata):
Hah! As a temporary workaround, we are [instantiating a JDBC connection to 
Hive|http://example.com/] to get around Spark 2's restriction...halp!  Don't 
make us do this!  :)

 

 

> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Otto
>Priority: Major
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
> can be sent to Hive will block us.  I believe this is fixable by adding an 
> exception in 
> [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
>  to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
> destination type are both struct types, and the destination type only adds 
> new fields.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2018-04-09 Thread Andrew Otto (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431099#comment-16431099
 ] 

Andrew Otto edited comment on SPARK-23890 at 4/9/18 7:31 PM:
-

Hah! As a temporary workaround, we are [instantiating a JDBC connection to 
Hive|https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala]|http://example.com/]
 to get around Spark 2's restriction...halp!  Don't make us do this!  :)

 

 


was (Author: ottomata):
Hah! As a temporary workaround, we are [[instantiating a JDBC connection to 
Hive|https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala]|http://example.com/]
 to get around Spark 2's restriction...halp!  Don't make us do this!  :)

 

 

> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Otto
>Priority: Major
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
> can be sent to Hive will block us.  I believe this is fixable by adding an 
> exception in 
> [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
>  to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
> destination type are both struct types, and the destination type only adds 
> new fields.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2018-04-09 Thread Andrew Otto (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431099#comment-16431099
 ] 

Andrew Otto edited comment on SPARK-23890 at 4/9/18 7:31 PM:
-

Hah! As a temporary workaround, we are [instantiating a JDBC connection to 
Hive|http://example.com/] to get around Spark 2's restriction...halp!  Don't 
make us do this!  :)

 

 


was (Author: ottomata):
Hah! As a temporary workaround, we are [instantiating a JDBC connection to 
Hive|http://example.com]https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala
 to get around Spark 2's restriction...halp!  Don't make us do this!  :)

 

 

> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Otto
>Priority: Major
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
> can be sent to Hive will block us.  I believe this is fixable by adding an 
> exception in 
> [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
>  to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
> destination type are both struct types, and the destination type only adds 
> new fields.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2018-04-09 Thread Andrew Otto (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Otto updated SPARK-23890:

External issue URL: https://github.com/apache/spark/pull/21012

> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Otto
>Priority: Major
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
> can be sent to Hive will block us.  I believe this is fixable by adding an 
> exception in 
> [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
>  to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
> destination type are both struct types, and the destination type only adds 
> new fields.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2018-04-06 Thread Andrew Otto (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Otto updated SPARK-23890:

Description: 
As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
CHANGE COLUMN commands to Hive.  This restriction was loosened in 
[https://github.com/apache/spark/pull/12714] to allow for those commands if 
they only change the column comment.

Wikimedia has been evolving Parquet backed Hive tables with data originally 
from JSON events by adding newly found columns to the Hive table schema, via a 
Spark job we call 'Refine'.  We do this by recursively merging an input 
DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
then issuing an ALTER TABLE statement to add the columns.  However, because we 
allow for nested data types in the incoming JSON data, we make extensive use of 
struct type fields.  In order to add newly detected fields in a nested data 
type, we must alter the struct column and append the nested struct field.  This 
requires CHANGE COLUMN that alters the column type.  In reality, the 'type' of 
the column is not changing, it just just a new field being added to the struct, 
but to SQL, this looks like a type change.

We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
can be sent to Hive will block us.  I believe this is fixable by adding an 
exception in 
[command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
 to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
destination type are both struct types, and the destination type only adds new 
fields.

 

 

  was:
As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
CHANGE COLUMN commands to Hive.  This restriction was loosened in 
[https://github.com/apache/spark/pull/12714] to allow for those commands if 
they only change the column comment.

Wikimedia has been evolving Parquet backed Hive tables with data originally 
come from JSON data by adding newly found columns to the Hive table schema, via 
a Spark job we call 'Refine'.  We do this by recursively merging an input 
DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
then issuing an ALTER TABLE statement to add the columns.  However, because we 
allow for nested data types in the incoming JSON data, we make extensive use of 
struct type fields.  In order to add newly detected fields in a nested data 
type, we must alter the struct column and append the nested struct field.  This 
requires CHANGE COLUMN that alters the column type.  In reality, the 'type' of 
the column is not changing, it just just a new field being added to the struct, 
but to SQL, this looks like a type change.

We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
can be sent to Hive will block us.  I believe this is fixable by adding an 
exception in 
[command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
 to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
destination type are both struct types, and the destination type only adds new 
fields.

 

 


> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Otto
>Priority: Major
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
>

[jira] [Updated] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2018-04-06 Thread Andrew Otto (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Otto updated SPARK-23890:

Description: 
As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
CHANGE COLUMN commands to Hive.  This restriction was loosened in 
[https://github.com/apache/spark/pull/12714] to allow for those commands if 
they only change the column comment.

Wikimedia has been evolving Parquet backed Hive tables with data originally 
come from JSON data by adding newly found columns to the Hive table schema, via 
a Spark job we call 'Refine'.  We do this by recursively merging an input 
DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
then issuing an ALTER TABLE statement to add the columns.  However, because we 
allow for nested data types in the incoming JSON data, we make extensive use of 
struct type fields.  In order to add newly detected fields in a nested data 
type, we must alter the struct column and append the nested struct field.  This 
requires CHANGE COLUMN that alters the column type.  In reality, the 'type' of 
the column is not changing, it just just a new field being added to the struct, 
but to SQL, this looks like a type change.

We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
can be sent to Hive will block us.  I believe this is fixable by adding an 
exception in 
[command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
 to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
destination type are both struct types, and the destination type only adds new 
fields.

 

 

  was:
As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
CHANGE COLUMN commands to Hive.  This was expanded in 
[https://github.com/apache/spark/pull/12714] to allow for those commands if 
they only change the column comment.

Wikimedia has been evolving Parquet backed Hive tables with data originally 
come from JSON data by adding newly found columns to the Hive table schema, via 
a Spark job we call 'Refine'.  We do this by recursively merging an input 
DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
then issuing an ALTER TABLE statement to add the columns.  However, because we 
allow for nested data types in the incoming JSON data, we make extensive use of 
struct type fields.  In order to add newly detected fields in a nested data 
type, we must alter the struct column and append the nested struct field.  This 
requires CHANGE COLUMN that alters the column type.  In reality, the 'type' of 
the column is not changing, it just just a new field being added to the struct, 
but to SQL, this looks like a type change.

We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
can be sent to Hive will block us.  I believe this is fixable by adding an 
exception in 
[command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
 to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
destination type are both struct types, and the destination type only adds new 
fields.

 

 


> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Otto
>Priority: Major
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> come from JSON data by adding newly found columns to the Hive table schema, 
> via a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
> can be

[jira] [Created] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2018-04-06 Thread Andrew Otto (JIRA)

Andrew Otto created SPARK-23890:
---

 Summary: Hive ALTER TABLE CHANGE COLUMN for struct type no longer 
works
 Key: SPARK-23890
 URL: https://issues.apache.org/jira/browse/SPARK-23890
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Otto


As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
CHANGE COLUMN commands to Hive.  This was expanded in 
[https://github.com/apache/spark/pull/12714] to allow for those commands if 
they only change the column comment.

Wikimedia has been evolving Parquet backed Hive tables with data originally 
come from JSON data by adding newly found columns to the Hive table schema, via 
a Spark job we call 'Refine'.  We do this by recursively merging an input 
DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
then issuing an ALTER TABLE statement to add the columns.  However, because we 
allow for nested data types in the incoming JSON data, we make extensive use of 
struct type fields.  In order to add newly detected fields in a nested data 
type, we must alter the struct column and append the nested struct field.  This 
requires CHANGE COLUMN that alters the column type.  In reality, the 'type' of 
the column is not changing, it just just a new field being added to the struct, 
but to SQL, this looks like a type change.

We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
can be sent to Hive will block us.  I believe this is fixable by adding an 
exception in 
[command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
 to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
destination type are both struct types, and the destination type only adds new 
fields.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23878) unable to import col() or lit()

2018-04-06 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428517#comment-16428517
 ] 

Andrew Davidson commented on SPARK-23878:
-

yes please mark resolved

 

> unable to import col() or lit()
> ---
>
> Key: SPARK-23878
> URL: https://issues.apache.org/jira/browse/SPARK-23878
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: eclipse 4.7.3
> pyDev 6.3.2
> pyspark==2.3.0
>Reporter: Andrew Davidson
>Priority: Major
> Attachments: eclipsePyDevPySparkConfig.png
>
>
> I have some code I am moving from a jupyter notebook to separate python 
> modules. My notebook uses col() and list() and works fine
> when I try to work with module files in my IDE I get the following errors. I 
> am also not able to run my unit tests.
> {color:#FF}Description Resource Path Location Type{color}
> {color:#FF}Unresolved import: lit load.py 
> /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color}
> {color:#FF}Description Resource Path Location Type{color}
> {color:#FF}Unresolved import: col load.py 
> /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color}
> I suspect that when you run pyspark it is generating the col and lit 
> functions?
> I found a discription of the problem @ 
> [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark]
>  I do not understand how to make this work in my IDE. I am not running 
> pyspark just an editor
> is there some sort of workaround or replacement for these missing functions?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23878) unable to import col() or lit()

2018-04-06 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428509#comment-16428509
 ] 

Andrew Davidson commented on SPARK-23878:
-

added screen shot to  https://www.pinterest.com/pin/842454674019358931/

> unable to import col() or lit()
> ---
>
> Key: SPARK-23878
> URL: https://issues.apache.org/jira/browse/SPARK-23878
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: eclipse 4.7.3
> pyDev 6.3.2
> pyspark==2.3.0
>Reporter: Andrew Davidson
>Priority: Major
> Attachments: eclipsePyDevPySparkConfig.png
>
>
> I have some code I am moving from a jupyter notebook to separate python 
> modules. My notebook uses col() and list() and works fine
> when I try to work with module files in my IDE I get the following errors. I 
> am also not able to run my unit tests.
> {color:#FF}Description Resource Path Location Type{color}
> {color:#FF}Unresolved import: lit load.py 
> /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color}
> {color:#FF}Description Resource Path Location Type{color}
> {color:#FF}Unresolved import: col load.py 
> /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color}
> I suspect that when you run pyspark it is generating the col and lit 
> functions?
> I found a discription of the problem @ 
> [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark]
>  I do not understand how to make this work in my IDE. I am not running 
> pyspark just an editor
> is there some sort of workaround or replacement for these missing functions?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23878) unable to import col() or lit()

2018-04-06 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428494#comment-16428494
 ] 

Andrew Davidson commented on SPARK-23878:
-

Hi Hyukjin

many thanks! I am not a python expert. I did not know about dynamic name 
spaces. Did a little googling

to configure dynamic namespace in eclipse

preferences ->pydev -> interpreter ->python interpreters 

add pyspark to 'forced build'

 

 

I attached a screen shot. Any idea how this can be added to the documentation 
so others do not have to waste times on this detail

p.s. you sent me a link to "support vitualenv in PySpark" 
https://issues.apache.org/jira/browse/SPARK-13587 

Once I made the pydev configuration change I am able to use pyspark in a 
virtualenv. I use this environment in eclipse pydev IDE with out any problems. 
I am able to run Juypter notebooks with out an problem

 

> unable to import col() or lit()
> ---
>
> Key: SPARK-23878
> URL: https://issues.apache.org/jira/browse/SPARK-23878
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: eclipse 4.7.3
> pyDev 6.3.2
> pyspark==2.3.0
>Reporter: Andrew Davidson
>Priority: Major
> Attachments: eclipsePyDevPySparkConfig.png
>
>
> I have some code I am moving from a jupyter notebook to separate python 
> modules. My notebook uses col() and list() and works fine
> when I try to work with module files in my IDE I get the following errors. I 
> am also not able to run my unit tests.
> {color:#FF}Description Resource Path Location Type{color}
> {color:#FF}Unresolved import: lit load.py 
> /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color}
> {color:#FF}Description Resource Path Location Type{color}
> {color:#FF}Unresolved import: col load.py 
> /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color}
> I suspect that when you run pyspark it is generating the col and lit 
> functions?
> I found a discription of the problem @ 
> [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark]
>  I do not understand how to make this work in my IDE. I am not running 
> pyspark just an editor
> is there some sort of workaround or replacement for these missing functions?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23878) unable to import col() or lit()

2018-04-06 Thread Andrew Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-23878:

Attachment: eclipsePyDevPySparkConfig.png

> unable to import col() or lit()
> ---
>
> Key: SPARK-23878
> URL: https://issues.apache.org/jira/browse/SPARK-23878
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: eclipse 4.7.3
> pyDev 6.3.2
> pyspark==2.3.0
>Reporter: Andrew Davidson
>Priority: Major
> Attachments: eclipsePyDevPySparkConfig.png
>
>
> I have some code I am moving from a jupyter notebook to separate python 
> modules. My notebook uses col() and list() and works fine
> when I try to work with module files in my IDE I get the following errors. I 
> am also not able to run my unit tests.
> {color:#FF}Description Resource Path Location Type{color}
> {color:#FF}Unresolved import: lit load.py 
> /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color}
> {color:#FF}Description Resource Path Location Type{color}
> {color:#FF}Unresolved import: col load.py 
> /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color}
> I suspect that when you run pyspark it is generating the col and lit 
> functions?
> I found a discription of the problem @ 
> [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark]
>  I do not understand how to make this work in my IDE. I am not running 
> pyspark just an editor
> is there some sort of workaround or replacement for these missing functions?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23878) unable to import col() or lit()

2018-04-05 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427883#comment-16427883
 ] 

Andrew Davidson commented on SPARK-23878:
-

Hi Hyukjin

you are correct. Most IDE's are primarily language aware editors and builders. 
For example consider eclipse or IntelJ for developing a javascript website, or 
java servlet. The editor functionality knows about the syntax of the language 
you are working with along with the libraries and packages you are using. Often 
the IDE does some sort of continuous build or code analysis to help you find 
bugs without having to deploy 

Often the IDE makes it easy build, package, to actually deploy on some sort of 
test server and debug and or run unit tests.

So if pyspark is generating functions at turn time that going to cause problems 
for the IDE. the functions are not defined in the edit session. 

[http://www.learn4master.com/algorithms/pyspark-unit-test-set-up-sparkcontext] 
describes how to write unititests for pyspark that you can run from your 
command line and or from with in elipse.  I think a side effect is that they 
might cause the functions lit() and col() to be generated?

 

I could not find a work around for col() and lit().

 

    ret = df.select(

                col(columnName).cast("string").alias("key"),

                lit(value).alias("source")

            )

 

Kind regards

 

Andy  

> unable to import col() or lit()
> ---
>
> Key: SPARK-23878
> URL: https://issues.apache.org/jira/browse/SPARK-23878
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: eclipse 4.7.3
> pyDev 6.3.2
> pyspark==2.3.0
>Reporter: Andrew Davidson
>Priority: Major
>
> I have some code I am moving from a jupyter notebook to separate python 
> modules. My notebook uses col() and list() and works fine
> when I try to work with module files in my IDE I get the following errors. I 
> am also not able to run my unit tests.
> {color:#FF}Description Resource Path Location Type{color}
> {color:#FF}Unresolved import: lit load.py 
> /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color}
> {color:#FF}Description Resource Path Location Type{color}
> {color:#FF}Unresolved import: col load.py 
> /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color}
> I suspect that when you run pyspark it is generating the col and lit 
> functions?
> I found a discription of the problem @ 
> [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark]
>  I do not understand how to make this work in my IDE. I am not running 
> pyspark just an editor
> is there some sort of workaround or replacement for these missing functions?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23878) unable to import col() or lit()

2018-04-05 Thread Andrew Davidson (JIRA)

Andrew Davidson created SPARK-23878:
---

 Summary: unable to import col() or lit()
 Key: SPARK-23878
 URL: https://issues.apache.org/jira/browse/SPARK-23878
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
 Environment: eclipse 4.7.3

pyDev 6.3.2

pyspark==2.3.0
Reporter: Andrew Davidson


I have some code I am moving from a jupyter notebook to separate python 
modules. My notebook uses col() and list() and works fine

when I try to work with module files in my IDE I get the following errors. I am 
also not able to run my unit tests.

{color:#FF}Description Resource Path Location Type{color}
{color:#FF}Unresolved import: lit load.py 
/adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color}

{color:#FF}Description Resource Path Location Type{color}
{color:#FF}Unresolved import: col load.py 
/adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color}

I suspect that when you run pyspark it is generating the col and lit functions?

I found a discription of the problem @ 
[https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark]
 I do not understand how to make this work in my IDE. I am not running pyspark 
just an editor

is there some sort of workaround or replacement for these missing functions?

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23682) Memory issue with Spark structured streaming

2018-03-28 Thread Andrew Korzhuev (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417474#comment-16417474
 ] 

Andrew Korzhuev edited comment on SPARK-23682 at 3/28/18 3:00 PM:
--

I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on:
 * AWS S3 checkpoint
 * Spark 2.3.0 on k8s
 * Structured stream - stream join

I managed to track the leak down to

[https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_
{code:java}
private lazy val loadedMaps = new mutable.HashMap[Long, MapType]{code}
_,_ which appears not to clean up _UnsafeRow_ coming from:
{code:java}
type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, 
UnsafeRow]{code}
I noticed that memory leaks slower if data is buffered to disk:
{code:java}
spark.hadoop.fs.s3a.fast.upload   true
spark.hadoop.fs.s3a.fast.upload.bufferdisk
{code}
It also seems that the state persisted to S3 is never cleaned up, as both 
number of objects and volume grows indefinitely.

Before worker dies:

!screen_shot_2018-03-20_at_15.23.29.png!

Heap dump of worker running for some time:

!Screen Shot 2018-03-28 at 16.44.20.png!


was (Author: akorzhuev):
I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on:
 * AWS S3 checkpoint
 * Spark 2.3.0 on k8s
 * Structured stream - stream join

I managed to track the leak down to

[https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_
{code:java}
private lazy val loadedMaps = new mutable.HashMap[Long, MapType]{code}
_,_ which appears not to clean up _UnsafeRow_ coming from:
{code:java}
type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, 
UnsafeRow]{code}
I noticed that memory leaks slower if data is buffered to disk:
{code:java}
spark.hadoop.fs.s3a.fast.upload   true
spark.hadoop.fs.s3a.fast.upload.bufferdisk
{code}
It also seems that the state persisted to S3 is never cleaned up, as both 
number of objects and volume grows indefinitely.

!Screen Shot 2018-03-28 at 16.44.20.png!

> Memory issue with Spark structured streaming
> 
>
> Key: SPARK-23682
> URL: https://issues.apache.org/jira/browse/SPARK-23682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
> Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3
> |spark.blacklist.decommissioning.enabled|true|
> |spark.blacklist.decommissioning.timeout|1h|
> |spark.cleaner.periodicGC.interval|10min|
> |spark.default.parallelism|18|
> |spark.dynamicAllocation.enabled|false|
> |spark.eventLog.enabled|true|
> |spark.executor.cores|3|
> |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'|
> |spark.executor.id|driver|
> |spark.executor.instances|3|
> |spark.executor.memory|22G|
> |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2|
> |spark.hadoop.parquet.enable.summary-metadata|false|
> |spark.hadoop.yarn.timeline-service.enabled|false|
> |spark.jars| |
> |spark.master|yarn|
> |spark.memory.fraction|0.9|
> |spark.memory.storageFraction|0.3|
> |spark.memory.useLegacyMode|false|
> |spark.rdd.compress|true|
> |spark.resourceManager.cleanupExpiredHost|true|
> |spark.scheduler.mode|FIFO|
> |spark.serializer|org.apache.spark.serializer.KryoSerializer|
> |spark.shuffle.service.enabled|true|
> |spark.speculation|false|
> |spark.sql.parquet.filterPushdown|true|
> |spark.sql.parquet.mergeSchema|false|
> |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse|
> |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true|
> |spark.submit.deployMode|client|
> |spark.yarn.am.cores|1|
> |spark.yarn.am.memory|2G|
> |spark.yarn.am.memoryOverhead|1G|
> |spark.yarn.executor.memoryOverhead|3G|
>Reporter: Yuriy Bondaruk
>Priority: Major
>  Labels: Memory, memory, memory-leak
> Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot 
> 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen 
> Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, 
> Spark executors GC time.png, image-2018-03-22-14-46-31-960.png, 
> screen_shot_2018-03-20_at_15.23.29.png
>
>
> It seems like there is an issue with memory in structured streaming. A stream 
> with aggregation (dropDuplicates()) and data partitioning constantly 
> increases memory usage and finally executors fails with exit code 137:
>

[jira] [Updated] (SPARK-23682) Memory issue with Spark structured streaming

2018-03-28 Thread Andrew Korzhuev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Korzhuev updated SPARK-23682:

Attachment: screen_shot_2018-03-20_at_15.23.29.png

> Memory issue with Spark structured streaming
> 
>
> Key: SPARK-23682
> URL: https://issues.apache.org/jira/browse/SPARK-23682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
> Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3
> |spark.blacklist.decommissioning.enabled|true|
> |spark.blacklist.decommissioning.timeout|1h|
> |spark.cleaner.periodicGC.interval|10min|
> |spark.default.parallelism|18|
> |spark.dynamicAllocation.enabled|false|
> |spark.eventLog.enabled|true|
> |spark.executor.cores|3|
> |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'|
> |spark.executor.id|driver|
> |spark.executor.instances|3|
> |spark.executor.memory|22G|
> |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2|
> |spark.hadoop.parquet.enable.summary-metadata|false|
> |spark.hadoop.yarn.timeline-service.enabled|false|
> |spark.jars| |
> |spark.master|yarn|
> |spark.memory.fraction|0.9|
> |spark.memory.storageFraction|0.3|
> |spark.memory.useLegacyMode|false|
> |spark.rdd.compress|true|
> |spark.resourceManager.cleanupExpiredHost|true|
> |spark.scheduler.mode|FIFO|
> |spark.serializer|org.apache.spark.serializer.KryoSerializer|
> |spark.shuffle.service.enabled|true|
> |spark.speculation|false|
> |spark.sql.parquet.filterPushdown|true|
> |spark.sql.parquet.mergeSchema|false|
> |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse|
> |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true|
> |spark.submit.deployMode|client|
> |spark.yarn.am.cores|1|
> |spark.yarn.am.memory|2G|
> |spark.yarn.am.memoryOverhead|1G|
> |spark.yarn.executor.memoryOverhead|3G|
>Reporter: Yuriy Bondaruk
>Priority: Major
>  Labels: Memory, memory, memory-leak
> Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot 
> 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen 
> Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, 
> Spark executors GC time.png, image-2018-03-22-14-46-31-960.png, 
> screen_shot_2018-03-20_at_15.23.29.png
>
>
> It seems like there is an issue with memory in structured streaming. A stream 
> with aggregation (dropDuplicates()) and data partitioning constantly 
> increases memory usage and finally executors fails with exit code 137:
> {quote}ExecutorLostFailure (executor 2 exited caused by one of the running 
> tasks) Reason: Container marked as failed: 
> container_1520214726510_0001_01_03 on host: 
> ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal{quote}
> Stream creating looks something like this:
> {quote}session
> .readStream()
> .schema(inputSchema)
> .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
> .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
> .csv("s3://test-bucket/input")
> .as(Encoders.bean(TestRecord.class))
> .flatMap(mf, Encoders.bean(TestRecord.class))
> .dropDuplicates("testId", "testName")
> .withColumn("year", 
> functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), 
> ""))
> .writeStream()
> .option("path", "s3://test-bucket/output")
> .option("checkpointLocation", "s3://test-bucket/checkpoint")
> .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS))
> .partitionBy("year")
> .format("parquet")
> .outputMode(OutputMode.Append())
> .queryName("test-stream")
> .start();{quote}
> Analyzing the heap dump I found that most of the memory used by 
> {{org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider}}
>  that is referenced from 
> [StateStore|https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L196]
>  
> On the first glance it looks normal since that is how Spark keeps aggregation 
> keys in memory. However I did my testing by renaming files in source folder, 
> so that they could be picked up by spark again. Since input records are the 
> same all further rows should be rejected as duplicates and memory consumption 
> shouldn't increase but it's not true. Moreover, GC time took more than 30% of 
> total processing time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (SPARK-23682) Memory issue with Spark structured streaming

2018-03-28 Thread Andrew Korzhuev (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417474#comment-16417474
 ] 

Andrew Korzhuev edited comment on SPARK-23682 at 3/28/18 2:56 PM:
--

I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on:
 * AWS S3 checkpoint
 * Spark 2.3.0 on k8s
 * Structured stream - stream join

I managed to track the leak down to

[https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_
{code:java}
private lazy val loadedMaps = new mutable.HashMap[Long, MapType]{code}
_,_ which appears not to clean up _UnsafeRow_ coming from:
{code:java}
type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, 
UnsafeRow]{code}
I noticed that memory leaks slower if data is buffered to disk:
{code:java}
spark.hadoop.fs.s3a.fast.upload   true
spark.hadoop.fs.s3a.fast.upload.bufferdisk
{code}
It also seems that the state persisted to S3 is never cleaned up, as both 
number of objects and volume grows indefinitely.

!Screen Shot 2018-03-28 at 16.44.20.png!


was (Author: akorzhuev):
I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on:
 * AWS S3 checkpoint
 * Spark 2.3.0 on k8s
 * Structured stream - stream join

I managed to track the leak down to

[https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_
{code:java}
private lazy val loadedMaps = new mutable.HashMap[Long, MapType]{code}
_,_ which appears not to clean up _UnsafeRow_s coming from: 
{code:java}
type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, 
UnsafeRow]{code}
I noticed that memory leaks slower if data is buffered to disk:
{code:java}
spark.hadoop.fs.s3a.fast.upload   true
spark.hadoop.fs.s3a.fast.upload.bufferdisk
{code}
It also seems that the state persisted to S3 is never cleaned up, as both 
number of objects and volume grows indefinitely.

!Screen Shot 2018-03-28 at 16.44.20.png!

> Memory issue with Spark structured streaming
> 
>
> Key: SPARK-23682
> URL: https://issues.apache.org/jira/browse/SPARK-23682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
> Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3
> |spark.blacklist.decommissioning.enabled|true|
> |spark.blacklist.decommissioning.timeout|1h|
> |spark.cleaner.periodicGC.interval|10min|
> |spark.default.parallelism|18|
> |spark.dynamicAllocation.enabled|false|
> |spark.eventLog.enabled|true|
> |spark.executor.cores|3|
> |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'|
> |spark.executor.id|driver|
> |spark.executor.instances|3|
> |spark.executor.memory|22G|
> |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2|
> |spark.hadoop.parquet.enable.summary-metadata|false|
> |spark.hadoop.yarn.timeline-service.enabled|false|
> |spark.jars| |
> |spark.master|yarn|
> |spark.memory.fraction|0.9|
> |spark.memory.storageFraction|0.3|
> |spark.memory.useLegacyMode|false|
> |spark.rdd.compress|true|
> |spark.resourceManager.cleanupExpiredHost|true|
> |spark.scheduler.mode|FIFO|
> |spark.serializer|org.apache.spark.serializer.KryoSerializer|
> |spark.shuffle.service.enabled|true|
> |spark.speculation|false|
> |spark.sql.parquet.filterPushdown|true|
> |spark.sql.parquet.mergeSchema|false|
> |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse|
> |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true|
> |spark.submit.deployMode|client|
> |spark.yarn.am.cores|1|
> |spark.yarn.am.memory|2G|
> |spark.yarn.am.memoryOverhead|1G|
> |spark.yarn.executor.memoryOverhead|3G|
>Reporter: Yuriy Bondaruk
>Priority: Major
>  Labels: Memory, memory, memory-leak
> Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot 
> 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen 
> Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, 
> Spark executors GC time.png, image-2018-03-22-14-46-31-960.png
>
>
> It seems like there is an issue with memory in structured streaming. A stream 
> with aggregation (dropDuplicates()) and data partitioning constantly 
> increases memory usage and finally executors fails with exit code 137:
> {quote}ExecutorLostFailure (executor 2 exited caused by one of the running 
> tasks) Reason: Container marked as failed: 
>

[jira] [Comment Edited] (SPARK-23682) Memory issue with Spark structured streaming

2018-03-28 Thread Andrew Korzhuev (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417474#comment-16417474
 ] 

Andrew Korzhuev edited comment on SPARK-23682 at 3/28/18 2:55 PM:
--

I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on:
 * AWS S3 checkpoint
 * Spark 2.3.0 on k8s
 * Structured stream - stream join

I managed to track the leak down to

[https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_
{code:java}
private lazy val loadedMaps = new mutable.HashMap[Long, MapType]{code}
_,_ which appears not to clean up _UnsafeRow_s coming from: 
{code:java}
type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, 
UnsafeRow]{code}
I noticed that memory leaks slower if data is buffered to disk:
{code:java}
spark.hadoop.fs.s3a.fast.upload   true
spark.hadoop.fs.s3a.fast.upload.bufferdisk
{code}
It also seems that the state persisted to S3 is never cleaned up, as both 
number of objects and volume grows indefinitely.

!Screen Shot 2018-03-28 at 16.44.20.png!


was (Author: akorzhuev):
I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on:
 * AWS S3 checkpoint
 * Spark 2.3.0 on k8s
 * Structured stream - stream join

I managed to track the leak down to

[https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_

 

_private lazy val loadedMaps = new mutable.HashMap[Long, MapType]_

 

_,_ which appears not to clean up _UnsafeRow_s coming from:

 
{code:java}
type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, 
UnsafeRow]{code}
I noticed that memory leaks slower if data is buffered to disk:
{code:java}
spark.hadoop.fs.s3a.fast.upload   true
spark.hadoop.fs.s3a.fast.upload.bufferdisk
{code}
It also seems that the state persisted to S3 is never cleaned up, as both 
number of objects and volume grows indefinitely.

!Screen Shot 2018-03-28 at 16.44.20.png!

> Memory issue with Spark structured streaming
> 
>
> Key: SPARK-23682
> URL: https://issues.apache.org/jira/browse/SPARK-23682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
> Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3
> |spark.blacklist.decommissioning.enabled|true|
> |spark.blacklist.decommissioning.timeout|1h|
> |spark.cleaner.periodicGC.interval|10min|
> |spark.default.parallelism|18|
> |spark.dynamicAllocation.enabled|false|
> |spark.eventLog.enabled|true|
> |spark.executor.cores|3|
> |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'|
> |spark.executor.id|driver|
> |spark.executor.instances|3|
> |spark.executor.memory|22G|
> |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2|
> |spark.hadoop.parquet.enable.summary-metadata|false|
> |spark.hadoop.yarn.timeline-service.enabled|false|
> |spark.jars| |
> |spark.master|yarn|
> |spark.memory.fraction|0.9|
> |spark.memory.storageFraction|0.3|
> |spark.memory.useLegacyMode|false|
> |spark.rdd.compress|true|
> |spark.resourceManager.cleanupExpiredHost|true|
> |spark.scheduler.mode|FIFO|
> |spark.serializer|org.apache.spark.serializer.KryoSerializer|
> |spark.shuffle.service.enabled|true|
> |spark.speculation|false|
> |spark.sql.parquet.filterPushdown|true|
> |spark.sql.parquet.mergeSchema|false|
> |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse|
> |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true|
> |spark.submit.deployMode|client|
> |spark.yarn.am.cores|1|
> |spark.yarn.am.memory|2G|
> |spark.yarn.am.memoryOverhead|1G|
> |spark.yarn.executor.memoryOverhead|3G|
>Reporter: Yuriy Bondaruk
>Priority: Major
>  Labels: Memory, memory, memory-leak
> Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot 
> 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen 
> Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, 
> Spark executors GC time.png, image-2018-03-22-14-46-31-960.png
>
>
> It seems like there is an issue with memory in structured streaming. A stream 
> with aggregation (dropDuplicates()) and data partitioning constantly 
> increases memory usage and finally executors fails with exit code 137:
> {quote}ExecutorLostFailure (executor 2 exited caused by one of the running 
> tasks) Reason: Container marked as failed: 
>

[jira] [Commented] (SPARK-23682) Memory issue with Spark structured streaming

2018-03-28 Thread Andrew Korzhuev (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417474#comment-16417474
 ] 

Andrew Korzhuev commented on SPARK-23682:
-

I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on:
 * AWS S3 checkpoint
 * Spark 2.3.0 on k8s
 * Structured stream - stream join

I managed to track the leak down to

[https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_

 

_private lazy val loadedMaps = new mutable.HashMap[Long, MapType]_

 

_,_ which appears not to clean up _UnsafeRow_s coming from:

 
{code:java}
type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, 
UnsafeRow]{code}
I noticed that memory leaks slower if data is buffered to disk:
{code:java}
spark.hadoop.fs.s3a.fast.upload   true
spark.hadoop.fs.s3a.fast.upload.bufferdisk
{code}
It also seems that the state persisted to S3 is never cleaned up, as both 
number of objects and volume grows indefinitely.

!Screen Shot 2018-03-28 at 16.44.20.png!

> Memory issue with Spark structured streaming
> 
>
> Key: SPARK-23682
> URL: https://issues.apache.org/jira/browse/SPARK-23682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
> Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3
> |spark.blacklist.decommissioning.enabled|true|
> |spark.blacklist.decommissioning.timeout|1h|
> |spark.cleaner.periodicGC.interval|10min|
> |spark.default.parallelism|18|
> |spark.dynamicAllocation.enabled|false|
> |spark.eventLog.enabled|true|
> |spark.executor.cores|3|
> |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'|
> |spark.executor.id|driver|
> |spark.executor.instances|3|
> |spark.executor.memory|22G|
> |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2|
> |spark.hadoop.parquet.enable.summary-metadata|false|
> |spark.hadoop.yarn.timeline-service.enabled|false|
> |spark.jars| |
> |spark.master|yarn|
> |spark.memory.fraction|0.9|
> |spark.memory.storageFraction|0.3|
> |spark.memory.useLegacyMode|false|
> |spark.rdd.compress|true|
> |spark.resourceManager.cleanupExpiredHost|true|
> |spark.scheduler.mode|FIFO|
> |spark.serializer|org.apache.spark.serializer.KryoSerializer|
> |spark.shuffle.service.enabled|true|
> |spark.speculation|false|
> |spark.sql.parquet.filterPushdown|true|
> |spark.sql.parquet.mergeSchema|false|
> |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse|
> |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true|
> |spark.submit.deployMode|client|
> |spark.yarn.am.cores|1|
> |spark.yarn.am.memory|2G|
> |spark.yarn.am.memoryOverhead|1G|
> |spark.yarn.executor.memoryOverhead|3G|
>Reporter: Yuriy Bondaruk
>Priority: Major
>  Labels: Memory, memory, memory-leak
> Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot 
> 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen 
> Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, 
> Spark executors GC time.png, image-2018-03-22-14-46-31-960.png
>
>
> It seems like there is an issue with memory in structured streaming. A stream 
> with aggregation (dropDuplicates()) and data partitioning constantly 
> increases memory usage and finally executors fails with exit code 137:
> {quote}ExecutorLostFailure (executor 2 exited caused by one of the running 
> tasks) Reason: Container marked as failed: 
> container_1520214726510_0001_01_03 on host: 
> ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal{quote}
> Stream creating looks something like this:
> {quote}session
> .readStream()
> .schema(inputSchema)
> .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
> .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
> .csv("s3://test-bucket/input")
> .as(Encoders.bean(TestRecord.class))
> .flatMap(mf, Encoders.bean(TestRecord.class))
> .dropDuplicates("testId", "testName")
> .withColumn("year", 
> functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), 
> ""))
> .writeStream()
> .option("path", "s3://test-bucket/output")
> .option("checkpointLocation", "s3://test-bucket/checkpoint")
> .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS))
> .partitionBy("year")
> .format("parquet")
> .outputMode(OutputMode.Append())
>

[jira] [Updated] (SPARK-23682) Memory issue with Spark structured streaming

2018-03-28 Thread Andrew Korzhuev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Korzhuev updated SPARK-23682:

Attachment: Screen Shot 2018-03-28 at 16.44.20.png

> Memory issue with Spark structured streaming
> 
>
> Key: SPARK-23682
> URL: https://issues.apache.org/jira/browse/SPARK-23682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
> Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3
> |spark.blacklist.decommissioning.enabled|true|
> |spark.blacklist.decommissioning.timeout|1h|
> |spark.cleaner.periodicGC.interval|10min|
> |spark.default.parallelism|18|
> |spark.dynamicAllocation.enabled|false|
> |spark.eventLog.enabled|true|
> |spark.executor.cores|3|
> |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'|
> |spark.executor.id|driver|
> |spark.executor.instances|3|
> |spark.executor.memory|22G|
> |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2|
> |spark.hadoop.parquet.enable.summary-metadata|false|
> |spark.hadoop.yarn.timeline-service.enabled|false|
> |spark.jars| |
> |spark.master|yarn|
> |spark.memory.fraction|0.9|
> |spark.memory.storageFraction|0.3|
> |spark.memory.useLegacyMode|false|
> |spark.rdd.compress|true|
> |spark.resourceManager.cleanupExpiredHost|true|
> |spark.scheduler.mode|FIFO|
> |spark.serializer|org.apache.spark.serializer.KryoSerializer|
> |spark.shuffle.service.enabled|true|
> |spark.speculation|false|
> |spark.sql.parquet.filterPushdown|true|
> |spark.sql.parquet.mergeSchema|false|
> |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse|
> |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true|
> |spark.submit.deployMode|client|
> |spark.yarn.am.cores|1|
> |spark.yarn.am.memory|2G|
> |spark.yarn.am.memoryOverhead|1G|
> |spark.yarn.executor.memoryOverhead|3G|
>Reporter: Yuriy Bondaruk
>Priority: Major
>  Labels: Memory, memory, memory-leak
> Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot 
> 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen 
> Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, 
> Spark executors GC time.png, image-2018-03-22-14-46-31-960.png
>
>
> It seems like there is an issue with memory in structured streaming. A stream 
> with aggregation (dropDuplicates()) and data partitioning constantly 
> increases memory usage and finally executors fails with exit code 137:
> {quote}ExecutorLostFailure (executor 2 exited caused by one of the running 
> tasks) Reason: Container marked as failed: 
> container_1520214726510_0001_01_03 on host: 
> ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal{quote}
> Stream creating looks something like this:
> {quote}session
> .readStream()
> .schema(inputSchema)
> .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
> .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
> .csv("s3://test-bucket/input")
> .as(Encoders.bean(TestRecord.class))
> .flatMap(mf, Encoders.bean(TestRecord.class))
> .dropDuplicates("testId", "testName")
> .withColumn("year", 
> functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), 
> ""))
> .writeStream()
> .option("path", "s3://test-bucket/output")
> .option("checkpointLocation", "s3://test-bucket/checkpoint")
> .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS))
> .partitionBy("year")
> .format("parquet")
> .outputMode(OutputMode.Append())
> .queryName("test-stream")
> .start();{quote}
> Analyzing the heap dump I found that most of the memory used by 
> {{org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider}}
>  that is referenced from 
> [StateStore|https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L196]
>  
> On the first glance it looks normal since that is how Spark keeps aggregation 
> keys in memory. However I did my testing by renaming files in source folder, 
> so that they could be picked up by spark again. Since input records are the 
> same all further rows should be rejected as duplicates and memory consumption 
> shouldn't increase but it's not true. Moreover, GC time took more than 30% of 
> total processing time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-

[jira] [Updated] (SPARK-23682) Memory issue with Spark structured streaming

2018-03-28 Thread Andrew Korzhuev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Korzhuev updated SPARK-23682:

Attachment: Screen Shot 2018-03-28 at 16.44.20.png

> Memory issue with Spark structured streaming
> 
>
> Key: SPARK-23682
> URL: https://issues.apache.org/jira/browse/SPARK-23682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
> Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3
> |spark.blacklist.decommissioning.enabled|true|
> |spark.blacklist.decommissioning.timeout|1h|
> |spark.cleaner.periodicGC.interval|10min|
> |spark.default.parallelism|18|
> |spark.dynamicAllocation.enabled|false|
> |spark.eventLog.enabled|true|
> |spark.executor.cores|3|
> |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'|
> |spark.executor.id|driver|
> |spark.executor.instances|3|
> |spark.executor.memory|22G|
> |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2|
> |spark.hadoop.parquet.enable.summary-metadata|false|
> |spark.hadoop.yarn.timeline-service.enabled|false|
> |spark.jars| |
> |spark.master|yarn|
> |spark.memory.fraction|0.9|
> |spark.memory.storageFraction|0.3|
> |spark.memory.useLegacyMode|false|
> |spark.rdd.compress|true|
> |spark.resourceManager.cleanupExpiredHost|true|
> |spark.scheduler.mode|FIFO|
> |spark.serializer|org.apache.spark.serializer.KryoSerializer|
> |spark.shuffle.service.enabled|true|
> |spark.speculation|false|
> |spark.sql.parquet.filterPushdown|true|
> |spark.sql.parquet.mergeSchema|false|
> |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse|
> |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true|
> |spark.submit.deployMode|client|
> |spark.yarn.am.cores|1|
> |spark.yarn.am.memory|2G|
> |spark.yarn.am.memoryOverhead|1G|
> |spark.yarn.executor.memoryOverhead|3G|
>Reporter: Yuriy Bondaruk
>Priority: Major
>  Labels: Memory, memory, memory-leak
> Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot 
> 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen 
> Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, 
> Spark executors GC time.png, image-2018-03-22-14-46-31-960.png
>
>
> It seems like there is an issue with memory in structured streaming. A stream 
> with aggregation (dropDuplicates()) and data partitioning constantly 
> increases memory usage and finally executors fails with exit code 137:
> {quote}ExecutorLostFailure (executor 2 exited caused by one of the running 
> tasks) Reason: Container marked as failed: 
> container_1520214726510_0001_01_03 on host: 
> ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal{quote}
> Stream creating looks something like this:
> {quote}session
> .readStream()
> .schema(inputSchema)
> .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
> .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
> .csv("s3://test-bucket/input")
> .as(Encoders.bean(TestRecord.class))
> .flatMap(mf, Encoders.bean(TestRecord.class))
> .dropDuplicates("testId", "testName")
> .withColumn("year", 
> functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), 
> ""))
> .writeStream()
> .option("path", "s3://test-bucket/output")
> .option("checkpointLocation", "s3://test-bucket/checkpoint")
> .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS))
> .partitionBy("year")
> .format("parquet")
> .outputMode(OutputMode.Append())
> .queryName("test-stream")
> .start();{quote}
> Analyzing the heap dump I found that most of the memory used by 
> {{org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider}}
>  that is referenced from 
> [StateStore|https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L196]
>  
> On the first glance it looks normal since that is how Spark keeps aggregation 
> keys in memory. However I did my testing by renaming files in source folder, 
> so that they could be picked up by spark again. Since input records are the 
> same all further rows should be rejected as duplicates and memory consumption 
> shouldn't increase but it's not true. Moreover, GC time took more than 30% of 
> total processing time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-

[jira] [Updated] (SPARK-23682) Memory issue with Spark structured streaming

2018-03-28 Thread Andrew Korzhuev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Korzhuev updated SPARK-23682:

Attachment: Screen Shot 2018-03-28 at 16.44.20.png

> Memory issue with Spark structured streaming
> 
>
> Key: SPARK-23682
> URL: https://issues.apache.org/jira/browse/SPARK-23682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
> Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3
> |spark.blacklist.decommissioning.enabled|true|
> |spark.blacklist.decommissioning.timeout|1h|
> |spark.cleaner.periodicGC.interval|10min|
> |spark.default.parallelism|18|
> |spark.dynamicAllocation.enabled|false|
> |spark.eventLog.enabled|true|
> |spark.executor.cores|3|
> |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'|
> |spark.executor.id|driver|
> |spark.executor.instances|3|
> |spark.executor.memory|22G|
> |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2|
> |spark.hadoop.parquet.enable.summary-metadata|false|
> |spark.hadoop.yarn.timeline-service.enabled|false|
> |spark.jars| |
> |spark.master|yarn|
> |spark.memory.fraction|0.9|
> |spark.memory.storageFraction|0.3|
> |spark.memory.useLegacyMode|false|
> |spark.rdd.compress|true|
> |spark.resourceManager.cleanupExpiredHost|true|
> |spark.scheduler.mode|FIFO|
> |spark.serializer|org.apache.spark.serializer.KryoSerializer|
> |spark.shuffle.service.enabled|true|
> |spark.speculation|false|
> |spark.sql.parquet.filterPushdown|true|
> |spark.sql.parquet.mergeSchema|false|
> |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse|
> |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true|
> |spark.submit.deployMode|client|
> |spark.yarn.am.cores|1|
> |spark.yarn.am.memory|2G|
> |spark.yarn.am.memoryOverhead|1G|
> |spark.yarn.executor.memoryOverhead|3G|
>Reporter: Yuriy Bondaruk
>Priority: Major
>  Labels: Memory, memory, memory-leak
> Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot 
> 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Spark 
> executors GC time.png, image-2018-03-22-14-46-31-960.png
>
>
> It seems like there is an issue with memory in structured streaming. A stream 
> with aggregation (dropDuplicates()) and data partitioning constantly 
> increases memory usage and finally executors fails with exit code 137:
> {quote}ExecutorLostFailure (executor 2 exited caused by one of the running 
> tasks) Reason: Container marked as failed: 
> container_1520214726510_0001_01_03 on host: 
> ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal{quote}
> Stream creating looks something like this:
> {quote}session
> .readStream()
> .schema(inputSchema)
> .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
> .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
> .csv("s3://test-bucket/input")
> .as(Encoders.bean(TestRecord.class))
> .flatMap(mf, Encoders.bean(TestRecord.class))
> .dropDuplicates("testId", "testName")
> .withColumn("year", 
> functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), 
> ""))
> .writeStream()
> .option("path", "s3://test-bucket/output")
> .option("checkpointLocation", "s3://test-bucket/checkpoint")
> .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS))
> .partitionBy("year")
> .format("parquet")
> .outputMode(OutputMode.Append())
> .queryName("test-stream")
> .start();{quote}
> Analyzing the heap dump I found that most of the memory used by 
> {{org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider}}
>  that is referenced from 
> [StateStore|https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L196]
>  
> On the first glance it looks normal since that is how Spark keeps aggregation 
> keys in memory. However I did my testing by renaming files in source folder, 
> so that they could be picked up by spark again. Since input records are the 
> same all further rows should be rejected as duplicates and memory consumption 
> shouldn't increase but it's not true. Moreover, GC time took more than 30% of 
> total processing time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional

[jira] [Commented] (SPARK-22865) Publish Official Apache Spark Docker images

2018-03-22 Thread Andrew Korzhuev (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409609#comment-16409609
 ] 

Andrew Korzhuev commented on SPARK-22865:
-

Here is an example how this can be done with Travis CI and prebuilt Spark 
binaries: [https://github.com/andrusha/spark-k8s-docker]

Published here: [https://hub.docker.com/r/andrusha/spark-k8s/]

> Publish Official Apache Spark Docker images
> ---
>
> Key: SPARK-22865
> URL: https://issues.apache.org/jira/browse/SPARK-22865
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3547 matches

Mail list logo