date:20210325



 [ 
https://issues.apache.org/jira/browse/SPARK-34701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34701.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31933
[https://github.com/apache/spark/pull/31933]

> Remove analyzing temp view again in CreateViewCommand
> -
>
> Key: SPARK-34701
> URL: https://issues.apache.org/jira/browse/SPARK-34701
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.2.0
>
>
> Remove analyzing temp view again in CreateViewCommand. This can be done once 
> all the caller passes analyzed plan to CreateViewCommand.
> Reference:
> https://github.com/apache/spark/pull/31652/files#r58959
> https://github.com/apache/spark/pull/31273/files#r581592786



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34701) Remove analyzing temp view again in CreateViewCommand



 [ 
https://issues.apache.org/jira/browse/SPARK-34701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34701:
---

Assignee: Terry Kim

> Remove analyzing temp view again in CreateViewCommand
> -
>
> Key: SPARK-34701
> URL: https://issues.apache.org/jira/browse/SPARK-34701
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>
> Remove analyzing temp view again in CreateViewCommand. This can be done once 
> all the caller passes analyzed plan to CreateViewCommand.
> Reference:
> https://github.com/apache/spark/pull/31652/files#r58959
> https://github.com/apache/spark/pull/31273/files#r581592786



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34859) Vectorized parquet reader needs synchronization among pages for column index

2021-03-25 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308435#comment-17308435
 ] 

Yuming Wang commented on SPARK-34859:
-

[~lxian2] Thank you for reporting this issue. Would you like work on this?

> Vectorized parquet reader needs synchronization among pages for column index
> 
>
> Key: SPARK-34859
> URL: https://issues.apache.org/jira/browse/SPARK-34859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Li Xian
>Priority: Major
>
> the current implementation has a problem. the pages returned by 
> `readNextFilteredRowGroup` may not be aligned, some columns may have more 
> rows than others.
> Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` 
> with `rowIndexes` to make sure that rows are aligned. 
> Currently `VectorizedParquetRecordReader` doesn't have such synchronizing 
> among pages from different columns. Using `readNextFilteredRowGroup` may 
> result in incorrect result.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34850) Multiply day-time interval by numeric

2021-03-25 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-34850.
--
Resolution: Fixed

Issue resolved by pull request 31951
[https://github.com/apache/spark/pull/31951]

> Multiply day-time interval by numeric
> -
>
> Key: SPARK-34850
> URL: https://issues.apache.org/jira/browse/SPARK-34850
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Support the multiply op over day-time interval by numeric types including:
> # ByteType
> # ShortType
> # IntegerType
> # LongType
> # FloatType
> # DoubleType
> # DecimalType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34345) Allow several properties files

2021-03-25 Thread Arseniy Tashoyan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307644#comment-17307644
 ] 

Arseniy Tashoyan edited comment on SPARK-34345 at 3/25/21, 7:52 AM:


Typesafe Config seems attractive, as it provides the following features:
 * _include_ directives
 * environment variable injections

This is convenient when running an application in Kubernetes and providing the 
configuration in ConfigMaps.

As an alternative, a Scala wrapper like 
[Pureconfig|https://pureconfig.github.io/] could be used.

Implementation details:
 * Deliver within Spark libraries reference files _reference.conf_ containing 
default (reference) values for all Spark settings.
 ** Different Spark libraries may have their respective files _reference.conf_. 
For example, _spark-sql/reference.conf_ contains settings specific to Spark SQL 
and so on.
 * Use Config API to get values
 ** Cleanup default values, coded in Scala or Java
 * Introduce new command-line argument for spark-submit: _--config-file_
 ** When both _\-\-config-file_ and _--properties-file_ specified, ignore the 
latter and print a warning
 ** When only _--properties-file_ specified, use the legacy way and print a 
deprecation warning.


was (Author: tashoyan):
Typesafe Config seems attractive, as it provides the following features:
 * _include_ directives
 * environment variable injections

This is convenient when running an application in Kubernetes and providing the 
configuration in ConfigMaps.

As an alternative, a Scala wrapper like 
[Pureconfig|https://pureconfig.github.io] could be used.

Implementation details:
 * Deliver within Spark libraries reference files _reference.conf_ containing 
default (reference) values for all Spark settings.
 ** Different Spark libraries may have their respective files _reference.conf_. 
For example, _spark-sql/reference.conf_ contains settings specific to Spark SQL 
and so on.
 * Use Config API to get values
 ** Cleanup default values, coded in Scala or Java
 * Introduce new command-line argument for spark-submit: _--config-file_
 ** When both _--config-file_ and _--properties-file_ specified, ignore the 
latter and print a warning
 ** When only _--properties-file_ specified, use the legacy way and print a 
deprecation warning.

> Allow several properties files
> --
>
> Key: SPARK-34345
> URL: https://issues.apache.org/jira/browse/SPARK-34345
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Arseniy Tashoyan
>Priority: Major
>
> Example: we have 2 applications A and B. These applications have some common 
> Spark settings and some application-specific settings. The idea is to run 
> them like this:
> {code:bash}
> spark-submit --properties-files common.properties,a.properties A
> spark-submit --properties-files common.properties,b.properties B
> {code}
> Benefits:
>  - Common settings can be extracted to a common file _common.properties_, no 
> need to copy them over _a.properties_ and _b.properties_
>  - Applications can override common settings in their respective custom 
> properties files
> Currently the following mechanism works in SparkSubmitArguments.scala: 
> console arguments like _--conf key=value_ overwrite settings in the 
> properties file. This is not enough, because console arguments should be 
> specified in the launcher script; de-facto they belong to the binary 
> distribution rather than the configuration.
> Consider the following scenario: Spark on Kubernetes, the configuration is 
> provided as a ConfigMap. We could have the following ConfigMaps:
>  - _a.properties_ // mount to the Pod with application A
>  - _b.properties_ // mount to the Pod with application B
>  - _common.properties_ // mount to both Pods with A and B
>  Meanwhile the launcher script _app-submit.sh_ is the same for both 
> applications A and B, since it contains none configuration settings:
> {code:bash}
> spark-submit --properties-files common.properties,${app_name}.properties ...
> {code}
> *Alternate solution*
> Use Typesafe Config for Spark settings instead of properties files. Typesafe 
> Config allows including files.
>  For example, settings for the application A - _a.conf_:
> {code:yaml}
> include required("common.conf")
> spark.sql.shuffle.partitions = 240
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34857) AtLeastNNonNulls does not show up correctly in explain



 [ 
https://issues.apache.org/jira/browse/SPARK-34857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34857.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31956
[https://github.com/apache/spark/pull/31956]

> AtLeastNNonNulls does not show up correctly in explain
> --
>
> Key: SPARK-34857
> URL: https://issues.apache.org/jira/browse/SPARK-34857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.2
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Minor
> Fix For: 3.2.0
>
>
> I see this in an explain plan
> {noformat}
> (12) Filter
> Input [3]: [c1#2410L, c2#2419, c3#2422]
> Condition : AtLeastNNulls(n, c1#2410L)
> I expect it to be AtLeastNNonNulls and n to have the actual value.
> {noformat}
> Proposed fix is to change 
> https://github.com/databricks/runtime/blob/09ad623c458c40cbaa0dbb4b9f600d96531d35f5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala#L384
>  to
> {noformat}
> override def toString: String = s"AtLeastNNonNulls(${n}, 
> ${children.mkString(",")})"
> {noformat}
> Or maybe it's OK to remove and use a default implementation?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34857) AtLeastNNonNulls does not show up correctly in explain



 [ 
https://issues.apache.org/jira/browse/SPARK-34857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34857:


Assignee: Tim Armstrong

> AtLeastNNonNulls does not show up correctly in explain
> --
>
> Key: SPARK-34857
> URL: https://issues.apache.org/jira/browse/SPARK-34857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.2
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Minor
>
> I see this in an explain plan
> {noformat}
> (12) Filter
> Input [3]: [c1#2410L, c2#2419, c3#2422]
> Condition : AtLeastNNulls(n, c1#2410L)
> I expect it to be AtLeastNNonNulls and n to have the actual value.
> {noformat}
> Proposed fix is to change 
> https://github.com/databricks/runtime/blob/09ad623c458c40cbaa0dbb4b9f600d96531d35f5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala#L384
>  to
> {noformat}
> override def toString: String = s"AtLeastNNonNulls(${n}, 
> ${children.mkString(",")})"
> {noformat}
> Or maybe it's OK to remove and use a default implementation?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34861) Support nested column in Spark vectorized readers



 [ 
https://issues.apache.org/jira/browse/SPARK-34861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-34861:
-
Summary: Support nested column in Spark vectorized readers  (was: Support 
nested column in spark vectorized reader)

> Support nested column in Spark vectorized readers
> -
>
> Key: SPARK-34861
> URL: https://issues.apache.org/jira/browse/SPARK-34861
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> This is the umbrella task to track the overall progress. The task is to 
> support nested column type in Spark vectorized reader, namely Parquet and 
> ORC. Currently both Parquet and ORC vectorized readers do not support nested 
> column type (struct, array and map). We implemented nested column vectorized 
> reader for FB-ORC in our internal fork of Spark. We are seeing performance 
> improvement compared to non-vectorized reader when reading nested columns. In 
> addition, this can also help improve the non-nested column performance when 
> reading non-nested and nested columns together in one query.
>  
> Parquet: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L173]
>  
> ORC:
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L138]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34861) Support nested column in spark vectorized reader

Cheng Su created SPARK-34861:


 Summary: Support nested column in spark vectorized reader
 Key: SPARK-34861
 URL: https://issues.apache.org/jira/browse/SPARK-34861
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 3.2.0
Reporter: Cheng Su


This is the umbrella task to track the overall progress. The task is to support 
nested column type in Spark vectorized reader, namely Parquet and ORC. 
Currently both Parquet and ORC vectorized readers do not support nested column 
type (struct, array and map). We implemented nested column vectorized reader 
for FB-ORC in our internal fork of Spark. We are seeing performance improvement 
compared to non-vectorized reader when reading nested columns. In addition, 
this can also help improve the non-nested column performance when reading 
non-nested and nested columns together in one query.

 

Parquet: 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L173]

 

ORC:

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L138]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34861) Support nested column in Spark vectorized readers



[ 
https://issues.apache.org/jira/browse/SPARK-34861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308501#comment-17308501
 ] 

Cheng Su commented on SPARK-34861:
--

Just FYI we are working on each individual tasks now. The target date is for 
Spark 3.2. Thanks.

> Support nested column in Spark vectorized readers
> -
>
> Key: SPARK-34861
> URL: https://issues.apache.org/jira/browse/SPARK-34861
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> This is the umbrella task to track the overall progress. The task is to 
> support nested column type in Spark vectorized reader, namely Parquet and 
> ORC. Currently both Parquet and ORC vectorized readers do not support nested 
> column type (struct, array and map). We implemented nested column vectorized 
> reader for FB-ORC in our internal fork of Spark. We are seeing performance 
> improvement compared to non-vectorized reader when reading nested columns. In 
> addition, this can also help improve the non-nested column performance when 
> reading non-nested and nested columns together in one query.
>  
> Parquet: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L173]
>  
> ORC:
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L138]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34862) Support nested column in Spark ORC vectorized readers

Cheng Su created SPARK-34862:


 Summary: Support nested column in Spark ORC vectorized readers
 Key: SPARK-34862
 URL: https://issues.apache.org/jira/browse/SPARK-34862
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Cheng Su


The task is to support nested column type in Spark ORC vectorized reader. 
Currently both ORC vectorized reader does not support nested column type 
(struct, array and map). We implemented nested column vectorized reader for 
FB-ORC in our internal fork of Spark. We are seeing performance improvement 
compared to non-vectorized reader when reading nested columns. In addition, 
this can also help improve the non-nested column performance when reading 
non-nested and nested columns together in one query.

 

ORC:

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L138]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34862) Support nested column in Spark ORC vectorized readers



 [ 
https://issues.apache.org/jira/browse/SPARK-34862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-34862:
-
Description: 
The task is to support nested column type in Spark ORC vectorized reader. 
Currently ORC vectorized reader does not support nested column type (struct, 
array and map). We implemented nested column vectorized reader for FB-ORC in 
our internal fork of Spark. We are seeing performance improvement compared to 
non-vectorized reader when reading nested columns. In addition, this can also 
help improve the non-nested column performance when reading non-nested and 
nested columns together in one query.

 

ORC:

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L138]

  was:
The task is to support nested column type in Spark ORC vectorized reader. 
Currently both ORC vectorized reader does not support nested column type 
(struct, array and map). We implemented nested column vectorized reader for 
FB-ORC in our internal fork of Spark. We are seeing performance improvement 
compared to non-vectorized reader when reading nested columns. In addition, 
this can also help improve the non-nested column performance when reading 
non-nested and nested columns together in one query.

 

ORC:

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L138]


> Support nested column in Spark ORC vectorized readers
> -
>
> Key: SPARK-34862
> URL: https://issues.apache.org/jira/browse/SPARK-34862
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> The task is to support nested column type in Spark ORC vectorized reader. 
> Currently ORC vectorized reader does not support nested column type (struct, 
> array and map). We implemented nested column vectorized reader for FB-ORC in 
> our internal fork of Spark. We are seeing performance improvement compared to 
> non-vectorized reader when reading nested columns. In addition, this can also 
> help improve the non-nested column performance when reading non-nested and 
> nested columns together in one query.
>  
> ORC:
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L138]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34862) Support nested column in Spark ORC vectorized readers



[ 
https://issues.apache.org/jira/browse/SPARK-34862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308504#comment-17308504
 ] 

Cheng Su commented on SPARK-34862:
--

Just FYI I am working on it now.

> Support nested column in Spark ORC vectorized readers
> -
>
> Key: SPARK-34862
> URL: https://issues.apache.org/jira/browse/SPARK-34862
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> The task is to support nested column type in Spark ORC vectorized reader. 
> Currently ORC vectorized reader does not support nested column type (struct, 
> array and map). We implemented nested column vectorized reader for FB-ORC in 
> our internal fork of Spark. We are seeing performance improvement compared to 
> non-vectorized reader when reading nested columns. In addition, this can also 
> help improve the non-nested column performance when reading non-nested and 
> nested columns together in one query.
>  
> ORC:
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L138]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34863) Support nested column in Spark Parquet vectorized readers

Cheng Su created SPARK-34863:


 Summary: Support nested column in Spark Parquet vectorized readers
 Key: SPARK-34863
 URL: https://issues.apache.org/jira/browse/SPARK-34863
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Cheng Su


The task is to support nested column type in Spark Parquet vectorized reader. 
Currently Parquet vectorized reader does not support nested column type 
(struct, array and map). We implemented nested column vectorized reader for 
FB-ORC in our internal fork of Spark. We are seeing performance improvement 
compared to non-vectorized reader when reading nested columns. In addition, 
this can also help improve the non-nested column performance when reading 
non-nested and nested columns together in one query.

 

Parquet: 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L173]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34863) Support nested column in Spark Parquet vectorized readers



[ 
https://issues.apache.org/jira/browse/SPARK-34863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308505#comment-17308505
 ] 

Cheng Su commented on SPARK-34863:
--

Just FYI I am working on it now.

> Support nested column in Spark Parquet vectorized readers
> -
>
> Key: SPARK-34863
> URL: https://issues.apache.org/jira/browse/SPARK-34863
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> The task is to support nested column type in Spark Parquet vectorized reader. 
> Currently Parquet vectorized reader does not support nested column type 
> (struct, array and map). We implemented nested column vectorized reader for 
> FB-ORC in our internal fork of Spark. We are seeing performance improvement 
> compared to non-vectorized reader when reading nested columns. In addition, 
> this can also help improve the non-nested column performance when reading 
> non-nested and nested columns together in one query.
>  
> Parquet: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L173]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34862) Support nested column in Spark ORC vectorized readers



 [ 
https://issues.apache.org/jira/browse/SPARK-34862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34862:


Assignee: (was: Apache Spark)

> Support nested column in Spark ORC vectorized readers
> -
>
> Key: SPARK-34862
> URL: https://issues.apache.org/jira/browse/SPARK-34862
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> The task is to support nested column type in Spark ORC vectorized reader. 
> Currently ORC vectorized reader does not support nested column type (struct, 
> array and map). We implemented nested column vectorized reader for FB-ORC in 
> our internal fork of Spark. We are seeing performance improvement compared to 
> non-vectorized reader when reading nested columns. In addition, this can also 
> help improve the non-nested column performance when reading non-nested and 
> nested columns together in one query.
>  
> ORC:
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L138]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34862) Support nested column in Spark ORC vectorized readers



 [ 
https://issues.apache.org/jira/browse/SPARK-34862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34862:


Assignee: Apache Spark

> Support nested column in Spark ORC vectorized readers
> -
>
> Key: SPARK-34862
> URL: https://issues.apache.org/jira/browse/SPARK-34862
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Minor
>
> The task is to support nested column type in Spark ORC vectorized reader. 
> Currently ORC vectorized reader does not support nested column type (struct, 
> array and map). We implemented nested column vectorized reader for FB-ORC in 
> our internal fork of Spark. We are seeing performance improvement compared to 
> non-vectorized reader when reading nested columns. In addition, this can also 
> help improve the non-nested column performance when reading non-nested and 
> nested columns together in one query.
>  
> ORC:
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L138]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30497) migrate DESCRIBE TABLE to the new framework



 [ 
https://issues.apache.org/jira/browse/SPARK-30497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30497.
-
Fix Version/s: 3.1.1
   Resolution: Fixed

> migrate DESCRIBE TABLE to the new framework
> ---
>
> Key: SPARK-30497
> URL: https://issues.apache.org/jira/browse/SPARK-30497
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.1.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34862) Support nested column in Spark ORC vectorized readers



[ 
https://issues.apache.org/jira/browse/SPARK-34862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308524#comment-17308524
 ] 

Apache Spark commented on SPARK-34862:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/31958

> Support nested column in Spark ORC vectorized readers
> -
>
> Key: SPARK-34862
> URL: https://issues.apache.org/jira/browse/SPARK-34862
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> The task is to support nested column type in Spark ORC vectorized reader. 
> Currently ORC vectorized reader does not support nested column type (struct, 
> array and map). We implemented nested column vectorized reader for FB-ORC in 
> our internal fork of Spark. We are seeing performance improvement compared to 
> non-vectorized reader when reading nested columns. In addition, this can also 
> help improve the non-nested column performance when reading non-nested and 
> nested columns together in one query.
>  
> ORC:
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L138]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34864) cannot resolve methods when & col in spark java

unical1988 created SPARK-34864:
--

 Summary: cannot resolve methods when & col in spark java
 Key: SPARK-34864
 URL: https://issues.apache.org/jira/browse/SPARK-34864
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 3.1.1
Reporter: unical1988


Hello,

 

I am using Spark 3.1.1 with JAVA and the method when of Column and also col but 
when i compile it says cannot resolve method when & cannot resolve method col.

 

What is the matter ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34865) cannot resolve methods when & col in spark java

unical1988 created SPARK-34865:
--

 Summary: cannot resolve methods when & col in spark java
 Key: SPARK-34865
 URL: https://issues.apache.org/jira/browse/SPARK-34865
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 3.1.1
Reporter: unical1988


Hello,

 

I am using Spark 3.1.1 with JAVA and the method when of Column and also col but 
when i compile it says cannot resolve method when & cannot resolve method col.

 

What is the matter ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34866) cannot resolve method when of Column() spark java

unical1988 created SPARK-34866:
--

 Summary: cannot resolve method when of Column() spark java
 Key: SPARK-34866
 URL: https://issues.apache.org/jira/browse/SPARK-34866
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 3.1.1
Reporter: unical1988


cannot resolve method when of Column, spark java and also method col, what's 
the problem ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34867) cannot resolve method when of Column() spark java

unical1988 created SPARK-34867:
--

 Summary: cannot resolve method when of Column() spark java
 Key: SPARK-34867
 URL: https://issues.apache.org/jira/browse/SPARK-34867
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 3.1.1
Reporter: unical1988


The error says cannot resolve method main (of Column()) and col in Spark 3.1.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34867) cannot resolve method when of Column() spark java

2021-03-25 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308644#comment-17308644
 ] 

Yuming Wang commented on SPARK-34867:
-

How to reproduce this issue?

> cannot resolve method when of Column() spark java
> -
>
> Key: SPARK-34867
> URL: https://issues.apache.org/jira/browse/SPARK-34867
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.1.1
>Reporter: unical1988
>Priority: Major
>
> The error says cannot resolve method main (of Column()) and col in Spark 3.1.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26663) Cannot query a Hive table with subdirectories

2021-03-25 Thread Surendar Thina G (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308648#comment-17308648
 ] 

Surendar Thina G commented on SPARK-26663:
--

A similar issue is documented here, that is specific to ORC files SPARK-28098.

> Cannot query a Hive table with subdirectories
> -
>
> Key: SPARK-26663
> URL: https://issues.apache.org/jira/browse/SPARK-26663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Aäron
>Priority: Major
>
> Hello,
>  
> I want to report the following issue (my first one :) )
> When I create a table in Hive based on a union all then Spark 2.4 is unable 
> to query this table.
> To reproduce:
> *Hive 1.2.1*
> {code:java}
> hive> creat table a(id int);
> insert into a values(1);
> hive> creat table b(id int);
> insert into b values(2);
> hive> create table c(id int) as select id from a union all select id from b;
> {code}
>  
> *Spark 2.3.1*
>  
> {code:java}
> scala> spark.table("c").show
> +---+
> | id|
> +---+
> | 1|
> | 2|
> +---+
> scala> spark.table("c").count
> res5: Long = 2
>  {code}
>  
> *Spark 2.4.0*
> {code:java}
> scala> spark.table("c").show
> 19/01/18 17:00:49 WARN HiveMetastoreCatalog: Unable to infer schema for table 
> perftest_be.c from file format ORC (inference mode: INFER_AND_SAVE). Using 
> metastore schema.
> +---+
> | id|
> +---+
> +---+
> scala> spark.table("c").count
> res3: Long = 0
> {code}
> I did not find an existing issue for this.  Might be important to investigate.
>  
> +Extra info:+ Spark 2.3.1 and 2.4.0 use the same spark-defaults.conf.
>  
> Kind regards.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34867) cannot resolve method when of Column() spark java



[ 
https://issues.apache.org/jira/browse/SPARK-34867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308653#comment-17308653
 ] 

unical1988 commented on SPARK-34867:


{code:java}
firstDF.select(firstDeltaDF.apply("col4").when(firstDeltaDF.apply("col4").equalTo("II259"),
 "II999")
 .when(col("gender").equalTo("female"), 1)
  .otherwise(""));{code}

> cannot resolve method when of Column() spark java
> -
>
> Key: SPARK-34867
> URL: https://issues.apache.org/jira/browse/SPARK-34867
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.1.1
>Reporter: unical1988
>Priority: Major
>
> The error says cannot resolve method main (of Column()) and col in Spark 3.1.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34867) cannot resolve method when of Column() spark java



[ 
https://issues.apache.org/jira/browse/SPARK-34867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308653#comment-17308653
 ] 

unical1988 edited comment on SPARK-34867 at 3/25/21, 12:30 PM:
---

{code:java}
// I tried to specify a column before using when on it (with 
firstDF.apply("col4"))
firstDF.select(firstDF.apply("col4").when(firstDeltaDF.apply("col4").equalTo("II259"),
 "II999")
 .when(col("gender").equalTo("female"), 1)
  .otherwise(""));

{code}


was (Author: unical1988):
{code:java}
firstDF.select(firstDeltaDF.apply("col4").when(firstDeltaDF.apply("col4").equalTo("II259"),
 "II999")
 .when(col("gender").equalTo("female"), 1)
  .otherwise(""));{code}

> cannot resolve method when of Column() spark java
> -
>
> Key: SPARK-34867
> URL: https://issues.apache.org/jira/browse/SPARK-34867
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.1.1
>Reporter: unical1988
>Priority: Major
>
> The error says cannot resolve method main (of Column()) and col in Spark 3.1.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34868) Divide year-month interval by numeric

2021-03-25 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-34868:
-
Description: 
Support the divide op of year-month interval by numeric types including:
# ByteType
# ShortType
# IntegerType
# LongType
# FloatType
# DoubleType
# DecimalType

  was:
Support the multiply op over year-month interval by numeric types including:
# ByteType
# ShortType
# IntegerType
# LongType
# FloatType
# DoubleType
# DecimalType


> Divide year-month interval by numeric
> -
>
> Key: SPARK-34868
> URL: https://issues.apache.org/jira/browse/SPARK-34868
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Support the divide op of year-month interval by numeric types including:
> # ByteType
> # ShortType
> # IntegerType
> # LongType
> # FloatType
> # DoubleType
> # DecimalType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34868) Divide year-month interval by numeric

2021-03-25 Thread Max Gekk (Jira)

Max Gekk created SPARK-34868:


 Summary: Divide year-month interval by numeric
 Key: SPARK-34868
 URL: https://issues.apache.org/jira/browse/SPARK-34868
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Max Gekk
Assignee: Max Gekk
 Fix For: 3.2.0


Support the multiply op over year-month interval by numeric types including:
# ByteType
# ShortType
# IntegerType
# LongType
# FloatType
# DoubleType
# DecimalType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34869) Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output

Attila Zsolt Piros created SPARK-34869:
--

 Summary: Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with 
pod describe output
 Key: SPARK-34869
 URL: https://issues.apache.org/jira/browse/SPARK-34869
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Tests
Affects Versions: 3.2.0
Reporter: Attila Zsolt Piros






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34869) Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output



 [ 
https://issues.apache.org/jira/browse/SPARK-34869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34869:
---
Description: 

As came up in https://github.com/apache/spark/pull/31923#issuecomment-806706204 
we can extend 


> Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output
> 
>
> Key: SPARK-34869
> URL: https://issues.apache.org/jira/browse/SPARK-34869
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Minor
>
> As came up in 
> https://github.com/apache/spark/pull/31923#issuecomment-806706204 we can 
> extend 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34869) Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output



 [ 
https://issues.apache.org/jira/browse/SPARK-34869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34869:
---
Description: 
As came up in https://github.com/apache/spark/pull/31923#issuecomment-806706204 
we can extend  "EXTRA LOGS FOR THE FAILED TEST" section with pod describe 
output.


  was:

As came up in https://github.com/apache/spark/pull/31923#issuecomment-806706204 
we can extend 



> Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output
> 
>
> Key: SPARK-34869
> URL: https://issues.apache.org/jira/browse/SPARK-34869
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Minor
>
> As came up in 
> https://github.com/apache/spark/pull/31923#issuecomment-806706204 we can 
> extend  "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34869) Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output



 [ 
https://issues.apache.org/jira/browse/SPARK-34869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros reassigned SPARK-34869:
--

Assignee: Attila Zsolt Piros

> Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output
> 
>
> Key: SPARK-34869
> URL: https://issues.apache.org/jira/browse/SPARK-34869
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Minor
>
> As came up in 
> https://github.com/apache/spark/pull/31923#issuecomment-806706204 we can 
> extend  "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34869) Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output



[ 
https://issues.apache.org/jira/browse/SPARK-34869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308679#comment-17308679
 ] 

Attila Zsolt Piros commented on SPARK-34869:


I am working on this.

> Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output
> 
>
> Key: SPARK-34869
> URL: https://issues.apache.org/jira/browse/SPARK-34869
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Minor
>
> As came up in 
> https://github.com/apache/spark/pull/31923#issuecomment-806706204 we can 
> extend  "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34833) Apply right-padding correctly for correlated subqueries



[ 
https://issues.apache.org/jira/browse/SPARK-34833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308683#comment-17308683
 ] 

Apache Spark commented on SPARK-34833:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/31959

> Apply right-padding correctly for correlated subqueries
> ---
>
> Key: SPARK-34833
> URL: https://issues.apache.org/jira/browse/SPARK-34833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.2.0, 3.1.2
>
>
> This ticket aim at  fixing the bug that does not apply right-padding for char 
> types inside correlated subquries.
> For example,  a query below returns nothing in master, but a correct result 
> is `c`.
> {code}
> scala> sql(s"CREATE TABLE t1(v VARCHAR(3), c CHAR(5)) USING parquet")
> scala> sql(s"CREATE TABLE t2(v VARCHAR(5), c CHAR(7)) USING parquet")
> scala> sql("INSERT INTO t1 VALUES ('c', 'b')")
> scala> sql("INSERT INTO t2 VALUES ('a', 'b')")
> scala> val df = sql("""
>   |SELECT v FROM t1
>   |WHERE 'a' IN (SELECT v FROM t2 WHERE t2.c = t1.c )""".stripMargin)
> scala> df.show()
> +---+
> |  v|
> +---+
> +---+
> {code}
> This is because `ApplyCharTypePadding`  does not handle the case above to 
> apply right-padding into `'abc'`. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34833) Apply right-padding correctly for correlated subqueries



[ 
https://issues.apache.org/jira/browse/SPARK-34833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308685#comment-17308685
 ] 

Apache Spark commented on SPARK-34833:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/31959

> Apply right-padding correctly for correlated subqueries
> ---
>
> Key: SPARK-34833
> URL: https://issues.apache.org/jira/browse/SPARK-34833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.2.0, 3.1.2
>
>
> This ticket aim at  fixing the bug that does not apply right-padding for char 
> types inside correlated subquries.
> For example,  a query below returns nothing in master, but a correct result 
> is `c`.
> {code}
> scala> sql(s"CREATE TABLE t1(v VARCHAR(3), c CHAR(5)) USING parquet")
> scala> sql(s"CREATE TABLE t2(v VARCHAR(5), c CHAR(7)) USING parquet")
> scala> sql("INSERT INTO t1 VALUES ('c', 'b')")
> scala> sql("INSERT INTO t2 VALUES ('a', 'b')")
> scala> val df = sql("""
>   |SELECT v FROM t1
>   |WHERE 'a' IN (SELECT v FROM t2 WHERE t2.c = t1.c )""".stripMargin)
> scala> df.show()
> +---+
> |  v|
> +---+
> +---+
> {code}
> This is because `ApplyCharTypePadding`  does not handle the case above to 
> apply right-padding into `'abc'`. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34786) read parquet uint64 as decimal



[ 
https://issues.apache.org/jira/browse/SPARK-34786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308698#comment-17308698
 ] 

Apache Spark commented on SPARK-34786:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31960

> read parquet uint64 as decimal
> --
>
> Key: SPARK-34786
> URL: https://issues.apache.org/jira/browse/SPARK-34786
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently Spark can't read parquet uint64 as it doesn't fit the Spark long 
> type. We can read uint64 as decimal as a workaround.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34786) read parquet uint64 as decimal



 [ 
https://issues.apache.org/jira/browse/SPARK-34786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34786:


Assignee: Apache Spark

> read parquet uint64 as decimal
> --
>
> Key: SPARK-34786
> URL: https://issues.apache.org/jira/browse/SPARK-34786
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>
> Currently Spark can't read parquet uint64 as it doesn't fit the Spark long 
> type. We can read uint64 as decimal as a workaround.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34786) read parquet uint64 as decimal



 [ 
https://issues.apache.org/jira/browse/SPARK-34786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34786:


Assignee: (was: Apache Spark)

> read parquet uint64 as decimal
> --
>
> Key: SPARK-34786
> URL: https://issues.apache.org/jira/browse/SPARK-34786
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently Spark can't read parquet uint64 as it doesn't fit the Spark long 
> type. We can read uint64 as decimal as a workaround.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34526) Skip checking glob path in FileStreamSink.hasMetadata

2021-03-25 Thread Yuanjian Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-34526:

Summary: Skip checking glob path in FileStreamSink.hasMetadata  (was: Add a 
flag to skip checking file sink format and handle glob path)

> Skip checking glob path in FileStreamSink.hasMetadata
> -
>
> Key: SPARK-34526
> URL: https://issues.apache.org/jira/browse/SPARK-34526
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Yuanjian Li
>Priority: Major
>
> This ticket fixes the following issues related to file sink format checking 
> together:
>  * Some users may use a very long glob path to read and `isDirectory` may 
> fail when the path is too long. We should ignore the error when the path is a 
> glob path since the file streaming sink doesn’t support glob paths.
>  * Checking whether a directory is outputted by File Streaming Sink may fail 
> for various issues happening in the storage. We should add a flag to allow 
> users to disable the checking logic and read the directory as a batch output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34526) Skip checking glob path in FileStreamSink.hasMetadata

2021-03-25 Thread Yuanjian Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-34526:

Description: Some users may use a very long glob path to read and 
`isDirectory` may fail when the path is too long. We should ignore the error 
when the path is a glob path since the file streaming sink doesn’t support glob 
paths.  (was: This ticket fixes the following issues related to file sink 
format checking together:
 * Some users may use a very long glob path to read and `isDirectory` may fail 
when the path is too long. We should ignore the error when the path is a glob 
path since the file streaming sink doesn’t support glob paths.
 * Checking whether a directory is outputted by File Streaming Sink may fail 
for various issues happening in the storage. We should add a flag to allow 
users to disable the checking logic and read the directory as a batch output.)

> Skip checking glob path in FileStreamSink.hasMetadata
> -
>
> Key: SPARK-34526
> URL: https://issues.apache.org/jira/browse/SPARK-34526
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Yuanjian Li
>Priority: Major
>
> Some users may use a very long glob path to read and `isDirectory` may fail 
> when the path is too long. We should ignore the error when the path is a glob 
> path since the file streaming sink doesn’t support glob paths.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34870) Jars downloaded with the --packages argument are not added to the classpath for executors.

2021-03-25 Thread Cory Maklin (Jira)

Cory Maklin created SPARK-34870:
---

 Summary: Jars downloaded with the --packages argument are not 
added to the classpath for executors.
 Key: SPARK-34870
 URL: https://issues.apache.org/jira/browse/SPARK-34870
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 3.0.0
 Environment: Spark worker running inside a Kubernetes pod with a 
Bitnami Spark image, and the driver running inside of a Jupyter Spark 
Kubernetes pod.
Reporter: Cory Maklin


When Spark is run in local mode, it works as expected. However, when Spark is 
run in client mode, it copies the jars to the executor ($SPARK_HOME/work//), but never adds them to the classpath.

It might be worth noting that `spark.jars` does add the jars to the classpath, 
but unlike `spark.jars.packages` it doesn't automatically download the jar's 
compiled dependencies.

 

```

spark = SparkSession.builder\
 .master(SPARK_MASTER)\
 .appName(APP_NAME)\
...
 .config("spark.jars.packages", DEPENDENCY_PACKAGES) \

...
 .getOrCreate()
```

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34868) Divide year-month interval by numeric



 [ 
https://issues.apache.org/jira/browse/SPARK-34868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34868:


Assignee: Apache Spark  (was: Max Gekk)

> Divide year-month interval by numeric
> -
>
> Key: SPARK-34868
> URL: https://issues.apache.org/jira/browse/SPARK-34868
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.2.0
>
>
> Support the divide op of year-month interval by numeric types including:
> # ByteType
> # ShortType
> # IntegerType
> # LongType
> # FloatType
> # DoubleType
> # DecimalType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34868) Divide year-month interval by numeric



 [ 
https://issues.apache.org/jira/browse/SPARK-34868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34868:


Assignee: Max Gekk  (was: Apache Spark)

> Divide year-month interval by numeric
> -
>
> Key: SPARK-34868
> URL: https://issues.apache.org/jira/browse/SPARK-34868
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Support the divide op of year-month interval by numeric types including:
> # ByteType
> # ShortType
> # IntegerType
> # LongType
> # FloatType
> # DoubleType
> # DecimalType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34868) Divide year-month interval by numeric



[ 
https://issues.apache.org/jira/browse/SPARK-34868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308774#comment-17308774
 ] 

Apache Spark commented on SPARK-34868:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31961

> Divide year-month interval by numeric
> -
>
> Key: SPARK-34868
> URL: https://issues.apache.org/jira/browse/SPARK-34868
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Support the divide op of year-month interval by numeric types including:
> # ByteType
> # ShortType
> # IntegerType
> # LongType
> # FloatType
> # DoubleType
> # DecimalType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34868) Divide year-month interval by numeric



[ 
https://issues.apache.org/jira/browse/SPARK-34868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308775#comment-17308775
 ] 

Apache Spark commented on SPARK-34868:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31961

> Divide year-month interval by numeric
> -
>
> Key: SPARK-34868
> URL: https://issues.apache.org/jira/browse/SPARK-34868
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Support the divide op of year-month interval by numeric types including:
> # ByteType
> # ShortType
> # IntegerType
> # LongType
> # FloatType
> # DoubleType
> # DecimalType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34870) Jars downloaded with the --packages argument are not added to the classpath for executors.

2021-03-25 Thread Cory Maklin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cory Maklin updated SPARK-34870:

Description: 
When Spark is run in local mode, it works as expected. However, when Spark is 
run in client mode, it copies the jars to the executor ($SPARK_HOME/work//), but never adds them to the classpath.

It might be worth noting that `spark.jars` does add the jars to the classpath, 
but unlike `spark.jars.packages` it doesn't automatically download the jar's 
compiled dependencies.

 

```

spark = SparkSession.builder\
 .master(SPARK_MASTER)\
 .appName(APP_NAME)\
 ...
 .config("spark.jars.packages", DEPENDENCY_PACKAGES) \

...
 .getOrCreate()
 ```

 

  was:
When Spark is run in local mode, it works as expected. However, when Spark is 
run in client mode, it copies the jars to the executor ($SPARK_HOME/work//), but never adds them to the classpath.

It might be worth noting that `spark.jars` does add the jars to the classpath, 
but unlike `spark.jars.packages` it doesn't automatically download the jar's 
compiled dependencies.

 

```

spark = SparkSession.builder\
 .master(SPARK_MASTER)\
 .appName(APP_NAME)\
...
 .config("spark.jars.packages", DEPENDENCY_PACKAGES) \

...
 .getOrCreate()
```

 


> Jars downloaded with the --packages argument are not added to the classpath 
> for executors.
> --
>
> Key: SPARK-34870
> URL: https://issues.apache.org/jira/browse/SPARK-34870
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.0.0
> Environment: Spark worker running inside a Kubernetes pod with a 
> Bitnami Spark image, and the driver running inside of a Jupyter Spark 
> Kubernetes pod.
>Reporter: Cory Maklin
>Priority: Major
>
> When Spark is run in local mode, it works as expected. However, when Spark is 
> run in client mode, it copies the jars to the executor ($SPARK_HOME/work/ id>/), but never adds them to the classpath.
> It might be worth noting that `spark.jars` does add the jars to the 
> classpath, but unlike `spark.jars.packages` it doesn't automatically download 
> the jar's compiled dependencies.
>  
> ```
> spark = SparkSession.builder\
>  .master(SPARK_MASTER)\
>  .appName(APP_NAME)\
>  ...
>  .config("spark.jars.packages", DEPENDENCY_PACKAGES) \
> ...
>  .getOrCreate()
>  ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34871) Move checkpoint resolving logic to the rule ResolveWriteToStream

2021-03-25 Thread Yuanjian Li (Jira)

Yuanjian Li created SPARK-34871:
---

 Summary: Move checkpoint resolving logic to the rule 
ResolveWriteToStream
 Key: SPARK-34871
 URL: https://issues.apache.org/jira/browse/SPARK-34871
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.2.0
Reporter: Yuanjian Li


After SPARK-34748, we have a rule ResolveWriteToStream for the analysis logic 
for the resolving logic of stream write plans. Based on it, we can further move 
the checkpoint location resolving work in the rule.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34869) Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output



 [ 
https://issues.apache.org/jira/browse/SPARK-34869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34869:
---
Description: 
As came up in https://github.com/apache/spark/pull/31923#issuecomment-806706204 
we can extend  "EXTRA LOGS FOR THE FAILED TEST" section with describe pods 
output.


  was:
As came up in https://github.com/apache/spark/pull/31923#issuecomment-806706204 
we can extend  "EXTRA LOGS FOR THE FAILED TEST" section with pod describe 
output.



> Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output
> -
>
> Key: SPARK-34869
> URL: https://issues.apache.org/jira/browse/SPARK-34869
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Minor
>
> As came up in 
> https://github.com/apache/spark/pull/31923#issuecomment-806706204 we can 
> extend  "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34869) Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output



 [ 
https://issues.apache.org/jira/browse/SPARK-34869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34869:
---
Summary: Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with describe 
pods output  (was: Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with pod 
describe output)

> Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output
> -
>
> Key: SPARK-34869
> URL: https://issues.apache.org/jira/browse/SPARK-34869
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Minor
>
> As came up in 
> https://github.com/apache/spark/pull/31923#issuecomment-806706204 we can 
> extend  "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34869) Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output



[ 
https://issues.apache.org/jira/browse/SPARK-34869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308782#comment-17308782
 ] 

Apache Spark commented on SPARK-34869:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/31962

> Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output
> -
>
> Key: SPARK-34869
> URL: https://issues.apache.org/jira/browse/SPARK-34869
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Minor
>
> As came up in 
> https://github.com/apache/spark/pull/31923#issuecomment-806706204 we can 
> extend  "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34869) Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output



 [ 
https://issues.apache.org/jira/browse/SPARK-34869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34869:


Assignee: Apache Spark  (was: Attila Zsolt Piros)

> Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output
> -
>
> Key: SPARK-34869
> URL: https://issues.apache.org/jira/browse/SPARK-34869
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Apache Spark
>Priority: Minor
>
> As came up in 
> https://github.com/apache/spark/pull/31923#issuecomment-806706204 we can 
> extend  "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34869) Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output



 [ 
https://issues.apache.org/jira/browse/SPARK-34869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34869:


Assignee: Attila Zsolt Piros  (was: Apache Spark)

> Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output
> -
>
> Key: SPARK-34869
> URL: https://issues.apache.org/jira/browse/SPARK-34869
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Minor
>
> As came up in 
> https://github.com/apache/spark/pull/31923#issuecomment-806706204 we can 
> extend  "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34869) Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output



[ 
https://issues.apache.org/jira/browse/SPARK-34869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308783#comment-17308783
 ] 

Apache Spark commented on SPARK-34869:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/31962

> Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output
> -
>
> Key: SPARK-34869
> URL: https://issues.apache.org/jira/browse/SPARK-34869
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Minor
>
> As came up in 
> https://github.com/apache/spark/pull/31923#issuecomment-806706204 we can 
> extend  "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34871) Move checkpoint resolving logic to the rule ResolveWriteToStream



 [ 
https://issues.apache.org/jira/browse/SPARK-34871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34871:


Assignee: (was: Apache Spark)

> Move checkpoint resolving logic to the rule ResolveWriteToStream
> 
>
> Key: SPARK-34871
> URL: https://issues.apache.org/jira/browse/SPARK-34871
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Yuanjian Li
>Priority: Major
>
> After SPARK-34748, we have a rule ResolveWriteToStream for the analysis logic 
> for the resolving logic of stream write plans. Based on it, we can further 
> move the checkpoint location resolving work in the rule.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34871) Move checkpoint resolving logic to the rule ResolveWriteToStream



 [ 
https://issues.apache.org/jira/browse/SPARK-34871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34871:


Assignee: Apache Spark

> Move checkpoint resolving logic to the rule ResolveWriteToStream
> 
>
> Key: SPARK-34871
> URL: https://issues.apache.org/jira/browse/SPARK-34871
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Yuanjian Li
>Assignee: Apache Spark
>Priority: Major
>
> After SPARK-34748, we have a rule ResolveWriteToStream for the analysis logic 
> for the resolving logic of stream write plans. Based on it, we can further 
> move the checkpoint location resolving work in the rule.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34871) Move checkpoint resolving logic to the rule ResolveWriteToStream



[ 
https://issues.apache.org/jira/browse/SPARK-34871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308787#comment-17308787
 ] 

Apache Spark commented on SPARK-34871:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/31963

> Move checkpoint resolving logic to the rule ResolveWriteToStream
> 
>
> Key: SPARK-34871
> URL: https://issues.apache.org/jira/browse/SPARK-34871
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Yuanjian Li
>Priority: Major
>
> After SPARK-34748, we have a rule ResolveWriteToStream for the analysis logic 
> for the resolving logic of stream write plans. Based on it, we can further 
> move the checkpoint location resolving work in the rule.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33600) Group exception messages in execution/datasources/v2



 [ 
https://issues.apache.org/jira/browse/SPARK-33600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33600:
---

Assignee: Karen Feng

> Group exception messages in execution/datasources/v2
> 
>
> Key: SPARK-33600
> URL: https://issues.apache.org/jira/browse/SPARK-33600
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: Karen Feng
>Priority: Major
>
> '/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2'
> || Filename ||   Count ||
> | AlterTableExec.scala |   1 |
> | CreateNamespaceExec.scala|   1 |
> | CreateTableExec.scala|   1 |
> | DataSourceRDD.scala  |   2 |
> | DataSourceV2Strategy.scala   |   9 |
> | DropNamespaceExec.scala  |   2 |
> | DropTableExec.scala  |   1 |
> | EmptyPartitionReader.scala   |   1 |
> | FileDataSourceV2.scala   |   1 |
> | FilePartitionReader.scala|   2 |
> | FilePartitionReaderFactory.scala |   1 |
> | ReplaceTableExec.scala   |   3 |
> | TableCapabilityCheck.scala   |   2 |
> | V1FallbackWriters.scala  |   1 |
> | V2SessionCatalog.scala   |  14 |
> | WriteToDataSourceV2Exec.scala|  10 |
> '/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc'
> || Filename   ||   Count ||
> | JDBCTableCatalog.scala |   3 |
> '/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2'
> || Filename||   Count ||
> | DataSourceV2Implicits.scala |   3 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33600) Group exception messages in execution/datasources/v2



 [ 
https://issues.apache.org/jira/browse/SPARK-33600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33600.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31619
[https://github.com/apache/spark/pull/31619]

> Group exception messages in execution/datasources/v2
> 
>
> Key: SPARK-33600
> URL: https://issues.apache.org/jira/browse/SPARK-33600
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: Karen Feng
>Priority: Major
> Fix For: 3.2.0
>
>
> '/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2'
> || Filename ||   Count ||
> | AlterTableExec.scala |   1 |
> | CreateNamespaceExec.scala|   1 |
> | CreateTableExec.scala|   1 |
> | DataSourceRDD.scala  |   2 |
> | DataSourceV2Strategy.scala   |   9 |
> | DropNamespaceExec.scala  |   2 |
> | DropTableExec.scala  |   1 |
> | EmptyPartitionReader.scala   |   1 |
> | FileDataSourceV2.scala   |   1 |
> | FilePartitionReader.scala|   2 |
> | FilePartitionReaderFactory.scala |   1 |
> | ReplaceTableExec.scala   |   3 |
> | TableCapabilityCheck.scala   |   2 |
> | V1FallbackWriters.scala  |   1 |
> | V2SessionCatalog.scala   |  14 |
> | WriteToDataSourceV2Exec.scala|  10 |
> '/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc'
> || Filename   ||   Count ||
> | JDBCTableCatalog.scala |   3 |
> '/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2'
> || Filename||   Count ||
> | DataSourceV2Implicits.scala |   3 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34856) ANSI mode: Allow casting complex types as string type

2021-03-25 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-34856.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31954
[https://github.com/apache/spark/pull/31954]

> ANSI mode: Allow casting complex types as string type
> -
>
> Key: SPARK-34856
> URL: https://issues.apache.org/jira/browse/SPARK-34856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently, complex types are not allowed to cast as string type. This breaks 
> the Dataset.show() API. E.g
> {code:java}
> scala> sql(“select array(1, 2, 2)“).show(false)
> org.apache.spark.sql.AnalysisException: cannot resolve ‘CAST(`array(1, 2, 2)` 
> AS STRING)’ due to data type mismatch:
>  cannot cast array to string with ANSI mode on.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34872) quoteIfNeeded should quote a name which contains non-word characters

2021-03-25 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-34872:
--

 Summary: quoteIfNeeded should quote a name which contains non-word 
characters
 Key: SPARK-34872
 URL: https://issues.apache.org/jira/browse/SPARK-34872
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1, 3.0.2, 3.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


quoteIfNeeded quotes a string only when it contains . (dots) or ` (backticks) 
but the method should quote it if it contains non-word characters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34872) quoteIfNeeded should quote a name which contains non-word characters

2021-03-25 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-34872:
---
Description: quoteIfNeeded quotes a name only when it contains . (dots) or 
` (backticks) but the method should quote it if it contains non-word 
characters.  (was: quoteIfNeeded quotes a string only when it contains . (dots) 
or ` (backticks) but the method should quote it if it contains non-word 
characters.)

> quoteIfNeeded should quote a name which contains non-word characters
> 
>
> Key: SPARK-34872
> URL: https://issues.apache.org/jira/browse/SPARK-34872
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> quoteIfNeeded quotes a name only when it contains . (dots) or ` (backticks) 
> but the method should quote it if it contains non-word characters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34872) quoteIfNeeded should quote a name which contains non-word characters



 [ 
https://issues.apache.org/jira/browse/SPARK-34872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34872:


Assignee: Apache Spark  (was: Kousuke Saruta)

> quoteIfNeeded should quote a name which contains non-word characters
> 
>
> Key: SPARK-34872
> URL: https://issues.apache.org/jira/browse/SPARK-34872
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> quoteIfNeeded quotes a name only when it contains . (dots) or ` (backticks) 
> but the method should quote it if it contains non-word characters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34872) quoteIfNeeded should quote a name which contains non-word characters



 [ 
https://issues.apache.org/jira/browse/SPARK-34872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34872:


Assignee: Apache Spark  (was: Kousuke Saruta)

> quoteIfNeeded should quote a name which contains non-word characters
> 
>
> Key: SPARK-34872
> URL: https://issues.apache.org/jira/browse/SPARK-34872
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> quoteIfNeeded quotes a name only when it contains . (dots) or ` (backticks) 
> but the method should quote it if it contains non-word characters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34872) quoteIfNeeded should quote a name which contains non-word characters



[ 
https://issues.apache.org/jira/browse/SPARK-34872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308799#comment-17308799
 ] 

Apache Spark commented on SPARK-34872:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/31964

> quoteIfNeeded should quote a name which contains non-word characters
> 
>
> Key: SPARK-34872
> URL: https://issues.apache.org/jira/browse/SPARK-34872
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> quoteIfNeeded quotes a name only when it contains . (dots) or ` (backticks) 
> but the method should quote it if it contains non-word characters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34872) quoteIfNeeded should quote a name which contains non-word characters



 [ 
https://issues.apache.org/jira/browse/SPARK-34872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34872:


Assignee: Kousuke Saruta  (was: Apache Spark)

> quoteIfNeeded should quote a name which contains non-word characters
> 
>
> Key: SPARK-34872
> URL: https://issues.apache.org/jira/browse/SPARK-34872
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> quoteIfNeeded quotes a name only when it contains . (dots) or ` (backticks) 
> but the method should quote it if it contains non-word characters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used

2021-03-25 Thread Michael Chen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308811#comment-17308811
 ] 

Michael Chen commented on SPARK-34780:
--

Np [~csun]. If the cache isn't materialized until after the configs change, 
then I believe the input RDDs for InMemoryTableScanExec are still built with 
the old stale confs so the stale confs would also be used in the 
DataSourceScanExec? Even if the cache was materialized before the configs 
changed, reading an RDD that was created with a stale conf would be a concern 
if any of the confs can change results right? (not sure if this is possible)

> Cached Table (parquet) with old Configs Used
> 
>
> Key: SPARK-34780
> URL: https://issues.apache.org/jira/browse/SPARK-34780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.1
>Reporter: Michael Chen
>Priority: Major
>
> When a dataframe is cached, the logical plan can contain copies of the spark 
> session meaning the SQLConfs are stored. Then if a different dataframe can 
> replace parts of it's logical plan with a cached logical plan, the cached 
> SQLConfs will be used for the evaluation of the cached logical plan. This is 
> because HadoopFsRelation ignores sparkSession for equality checks (introduced 
> in https://issues.apache.org/jira/browse/SPARK-17358).
> {code:java}
> test("cache uses old SQLConf") {
>   import testImplicits._
>   withTempDir { dir =>
> val tableDir = dir.getAbsoluteFile + "/table"
> val df = Seq("a").toDF("key")
> df.write.parquet(tableDir)
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1Stats = spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10")
> val df2 = spark.read.parquet(tableDir).select("key")
> df2.cache()
> val compression10Stats = df2.queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1StatsWithCache = 
> spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> // I expect these stats to be the same because file compression factor is 
> the same
> assert(compression1Stats == compression1StatsWithCache)
> // Instead, we can see the file compression factor is being cached and 
> used along with
> // the logical plan
> assert(compression10Stats == compression1StatsWithCache)
>   }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34870) Jars downloaded with the --packages argument are not added to the classpath for executors.

2021-03-25 Thread Cory Maklin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cory Maklin updated SPARK-34870:

Description: 
When Spark is run in local mode, it works as expected. However, when Spark is 
run in client mode, it copies the jars to the executor ($SPARK_HOME/work//), but never adds them to the classpath.

It might be worth noting that `spark.jars` does add the jars to the classpath, 
but unlike `spark.jars.packages` it doesn't automatically download the jar's 
compile dependencies.

 

```

spark = SparkSession.builder\
 .master(SPARK_MASTER)\
 .appName(APP_NAME)\
 ...
 .config("spark.jars.packages", DEPENDENCY_PACKAGES) \

...
 .getOrCreate()
 ```

 

  was:
When Spark is run in local mode, it works as expected. However, when Spark is 
run in client mode, it copies the jars to the executor ($SPARK_HOME/work//), but never adds them to the classpath.

It might be worth noting that `spark.jars` does add the jars to the classpath, 
but unlike `spark.jars.packages` it doesn't automatically download the jar's 
compiled dependencies.

 

```

spark = SparkSession.builder\
 .master(SPARK_MASTER)\
 .appName(APP_NAME)\
 ...
 .config("spark.jars.packages", DEPENDENCY_PACKAGES) \

...
 .getOrCreate()
 ```

 


> Jars downloaded with the --packages argument are not added to the classpath 
> for executors.
> --
>
> Key: SPARK-34870
> URL: https://issues.apache.org/jira/browse/SPARK-34870
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.0.0
> Environment: Spark worker running inside a Kubernetes pod with a 
> Bitnami Spark image, and the driver running inside of a Jupyter Spark 
> Kubernetes pod.
>Reporter: Cory Maklin
>Priority: Major
>
> When Spark is run in local mode, it works as expected. However, when Spark is 
> run in client mode, it copies the jars to the executor ($SPARK_HOME/work/ id>/), but never adds them to the classpath.
> It might be worth noting that `spark.jars` does add the jars to the 
> classpath, but unlike `spark.jars.packages` it doesn't automatically download 
> the jar's compile dependencies.
>  
> ```
> spark = SparkSession.builder\
>  .master(SPARK_MASTER)\
>  .appName(APP_NAME)\
>  ...
>  .config("spark.jars.packages", DEPENDENCY_PACKAGES) \
> ...
>  .getOrCreate()
>  ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used

2021-03-25 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308854#comment-17308854
 ] 

Chao Sun commented on SPARK-34780:
--

[~mikechen], yes you're right. I'm not sure if this is a big concern though, 
since it just means the plan fragment for the cache is executed with the stale 
conf. I guess as long as there is no correctness issue (which I'd be surprised 
to see if there's any), it should be fine?

It seems a bit tricky to fix the issue, since the {{SparkSession}} is leaked to 
many places. I guess one way is to follow the idea of SPARK-33389 and change 
{{SessionState}} to always use the active conf. 

> Cached Table (parquet) with old Configs Used
> 
>
> Key: SPARK-34780
> URL: https://issues.apache.org/jira/browse/SPARK-34780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.1
>Reporter: Michael Chen
>Priority: Major
>
> When a dataframe is cached, the logical plan can contain copies of the spark 
> session meaning the SQLConfs are stored. Then if a different dataframe can 
> replace parts of it's logical plan with a cached logical plan, the cached 
> SQLConfs will be used for the evaluation of the cached logical plan. This is 
> because HadoopFsRelation ignores sparkSession for equality checks (introduced 
> in https://issues.apache.org/jira/browse/SPARK-17358).
> {code:java}
> test("cache uses old SQLConf") {
>   import testImplicits._
>   withTempDir { dir =>
> val tableDir = dir.getAbsoluteFile + "/table"
> val df = Seq("a").toDF("key")
> df.write.parquet(tableDir)
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1Stats = spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10")
> val df2 = spark.read.parquet(tableDir).select("key")
> df2.cache()
> val compression10Stats = df2.queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1StatsWithCache = 
> spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> // I expect these stats to be the same because file compression factor is 
> the same
> assert(compression1Stats == compression1StatsWithCache)
> // Instead, we can see the file compression factor is being cached and 
> used along with
> // the logical plan
> assert(compression10Stats == compression1StatsWithCache)
>   }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34840) Fix cases of corruption in merged shuffle blocks that are pushed

2021-03-25 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-34840:
---

Assignee: Chandni Singh

> Fix cases of corruption in merged shuffle blocks that are pushed
> 
>
> Key: SPARK-34840
> URL: https://issues.apache.org/jira/browse/SPARK-34840
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> The {{RemoteBlockPushResolver}} which handles the shuffle push blocks and 
> merges them was introduced in 
> [#30062|https://github.com/apache/spark/pull/30062]. We have identified 2 
> scenarios where the merged blocks get corrupted:
>  # {{StreamCallback.onFailure()}} is called more than once. Initially we 
> assumed that the onFailure callback will be called just once per stream. 
> However, we observed that this is called twice when a client connection is 
> reset. When the client connection is reset then there are 2 events that get 
> triggered in this order.
>  * {{exceptionCaught}}. This event is propagated to {{StreamInterceptor}}. 
> {{StreamInterceptor.exceptionCaught()}} invokes 
> {{callback.onFailure(streamId, cause)}}. This is the first time 
> StreamCallback.onFailure() will be invoked.
>  * {{channelInactive}}. Since the channel closes, the {{channelInactive}} 
> event gets triggered which again is propagated to {{StreamInterceptor}}. 
> {{StreamInterceptor.channelInactive()}} invokes 
> {{callback.onFailure(streamId, new ClosedChannelException())}}. This is the 
> second time StreamCallback.onFailure() will be invoked.
>  # The flag {{isWriting}} is set prematurely to true. This introduces an edge 
> case where a stream that is trying to merge a duplicate block (created 
> because of a speculative task) may interfere with an active stream if the 
> duplicate stream fails.
> Also adding additional changes that improve the code.
>  # Using positional writes all the time because this simplifies the code and 
> with microbenchmarking haven't seen any performance impact.
>  # Additional minor changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34840) Fix cases of corruption in merged shuffle blocks that are pushed

2021-03-25 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-34840.
-
Fix Version/s: 3.1.2
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 31934
[https://github.com/apache/spark/pull/31934]

> Fix cases of corruption in merged shuffle blocks that are pushed
> 
>
> Key: SPARK-34840
> URL: https://issues.apache.org/jira/browse/SPARK-34840
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>
> The {{RemoteBlockPushResolver}} which handles the shuffle push blocks and 
> merges them was introduced in 
> [#30062|https://github.com/apache/spark/pull/30062]. We have identified 2 
> scenarios where the merged blocks get corrupted:
>  # {{StreamCallback.onFailure()}} is called more than once. Initially we 
> assumed that the onFailure callback will be called just once per stream. 
> However, we observed that this is called twice when a client connection is 
> reset. When the client connection is reset then there are 2 events that get 
> triggered in this order.
>  * {{exceptionCaught}}. This event is propagated to {{StreamInterceptor}}. 
> {{StreamInterceptor.exceptionCaught()}} invokes 
> {{callback.onFailure(streamId, cause)}}. This is the first time 
> StreamCallback.onFailure() will be invoked.
>  * {{channelInactive}}. Since the channel closes, the {{channelInactive}} 
> event gets triggered which again is propagated to {{StreamInterceptor}}. 
> {{StreamInterceptor.channelInactive()}} invokes 
> {{callback.onFailure(streamId, new ClosedChannelException())}}. This is the 
> second time StreamCallback.onFailure() will be invoked.
>  # The flag {{isWriting}} is set prematurely to true. This introduces an edge 
> case where a stream that is trying to merge a duplicate block (created 
> because of a speculative task) may interfere with an active stream if the 
> duplicate stream fails.
> Also adding additional changes that improve the code.
>  # Using positional writes all the time because this simplifies the code and 
> with microbenchmarking haven't seen any performance impact.
>  # Additional minor changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34859) Vectorized parquet reader needs synchronization among pages for column index

2021-03-25 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308872#comment-17308872
 ] 

Dongjoon Hyun commented on SPARK-34859:
---

Thank you for creating a new JIRA, [~lxian2]!

> Vectorized parquet reader needs synchronization among pages for column index
> 
>
> Key: SPARK-34859
> URL: https://issues.apache.org/jira/browse/SPARK-34859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Li Xian
>Priority: Major
>
> the current implementation has a problem. the pages returned by 
> `readNextFilteredRowGroup` may not be aligned, some columns may have more 
> rows than others.
> Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` 
> with `rowIndexes` to make sure that rows are aligned. 
> Currently `VectorizedParquetRecordReader` doesn't have such synchronizing 
> among pages from different columns. Using `readNextFilteredRowGroup` may 
> result in incorrect result.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34843) JDBCRelation columnPartition function improperly determines stride size. Upper bound is skewed do to stride alignment.

2021-03-25 Thread Jason Yarbrough (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Yarbrough updated SPARK-34843:

Summary: JDBCRelation columnPartition function improperly determines stride 
size. Upper bound is skewed do to stride alignment.  (was: JDBCRelation 
columnPartition function improperly determines stride size)

> JDBCRelation columnPartition function improperly determines stride size. 
> Upper bound is skewed do to stride alignment.
> --
>
> Key: SPARK-34843
> URL: https://issues.apache.org/jira/browse/SPARK-34843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Yarbrough
>Priority: Minor
> Attachments: SPARK-34843.patch
>
>
> Currently, in JDBCRelation (line 123), the stride size is calculated as 
> follows:
> val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
>  
> Due to truncation happening on both divisions, the stride size can fall short 
> of what it should be. This can lead to a big difference between the provided 
> upper bound and the actual start of the last partition.
> I propose this formula, as it is much more accurate and leads to better 
> distribution:
> val stride = (upperBound / numPartitions.toFloat - lowerBound / 
> numPartitions.toFloat).toLong
>  
> An example (using a date column):
> Say you're creating 1,000 partitions. If you provide a lower bound of 
> 1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
> (translated to 18563), Spark determines the stride size as follows:
>  
> (18563L / 1000L) - (-15611 / 1000L) = 33
> Starting from the lower bound, doing strides of 33, you'll end up at 
> 2017-07-08. This is over 3 years of extra data that will go into the last 
> partition, and depending on the shape of the data could cause a very long 
> running task at the end of a job.
>  
> Using the formula I'm proposing, you'd get:
> ((18563L / 1000F) - (-15611 / 1000F)).toLong = 34
> This would put the upper bound at 2020-04-02, which is much closer to the 
> original supplied upper bound. This is the best you can do to get as close as 
> possible to the upper bound (without adjusting the number of partitions). For 
> example, a stride size of 35 would go well past the supplied upper bound 
> (over 2 years, 2022-11-22).
>  
> In the above example, there is only a difference of 1 between the stride size 
> using the current formula and the stride size using the proposed formula, but 
> with greater distance between the lower and upper bounds, or a lower number 
> of partitions, the difference can be much greater. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34843) JDBCRelation columnPartition function improperly determines stride size. Upper bound is skewed do to stride alignment.

2021-03-25 Thread Jason Yarbrough (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Yarbrough updated SPARK-34843:

Target Version/s: 3.2.0, 3.1.2, 3.0.3

> JDBCRelation columnPartition function improperly determines stride size. 
> Upper bound is skewed do to stride alignment.
> --
>
> Key: SPARK-34843
> URL: https://issues.apache.org/jira/browse/SPARK-34843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Yarbrough
>Priority: Minor
> Attachments: SPARK-34843.patch
>
>
> Currently, in JDBCRelation (line 123), the stride size is calculated as 
> follows:
> val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
>  
> Due to truncation happening on both divisions, the stride size can fall short 
> of what it should be. This can lead to a big difference between the provided 
> upper bound and the actual start of the last partition.
> I propose this formula, as it is much more accurate and leads to better 
> distribution:
> val stride = (upperBound / numPartitions.toFloat - lowerBound / 
> numPartitions.toFloat).toLong
>  
> An example (using a date column):
> Say you're creating 1,000 partitions. If you provide a lower bound of 
> 1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
> (translated to 18563), Spark determines the stride size as follows:
>  
> (18563L / 1000L) - (-15611 / 1000L) = 33
> Starting from the lower bound, doing strides of 33, you'll end up at 
> 2017-07-08. This is over 3 years of extra data that will go into the last 
> partition, and depending on the shape of the data could cause a very long 
> running task at the end of a job.
>  
> Using the formula I'm proposing, you'd get:
> ((18563L / 1000F) - (-15611 / 1000F)).toLong = 34
> This would put the upper bound at 2020-04-02, which is much closer to the 
> original supplied upper bound. This is the best you can do to get as close as 
> possible to the upper bound (without adjusting the number of partitions). For 
> example, a stride size of 35 would go well past the supplied upper bound 
> (over 2 years, 2022-11-22).
>  
> In the above example, there is only a difference of 1 between the stride size 
> using the current formula and the stride size using the proposed formula, but 
> with greater distance between the lower and upper bounds, or a lower number 
> of partitions, the difference can be much greater. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34843) JDBCRelation columnPartition function improperly determines stride size. Upper bound is skewed do to stride alignment.

2021-03-25 Thread Jason Yarbrough (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308978#comment-17308978
 ] 

Jason Yarbrough commented on SPARK-34843:
-

PR created: [Spark 34843 - JDBCRelation columnPartition function improperly 
determines stride size. Upper bound is skewed do to stride alignment. by 
hanover-fiste · Pull Request #31965 · apache/spark 
(github.com)|https://github.com/apache/spark/pull/31965]

> JDBCRelation columnPartition function improperly determines stride size. 
> Upper bound is skewed do to stride alignment.
> --
>
> Key: SPARK-34843
> URL: https://issues.apache.org/jira/browse/SPARK-34843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Yarbrough
>Priority: Minor
> Attachments: SPARK-34843.patch
>
>
> Currently, in JDBCRelation (line 123), the stride size is calculated as 
> follows:
> val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
>  
> Due to truncation happening on both divisions, the stride size can fall short 
> of what it should be. This can lead to a big difference between the provided 
> upper bound and the actual start of the last partition.
> I propose this formula, as it is much more accurate and leads to better 
> distribution:
> val stride = (upperBound / numPartitions.toFloat - lowerBound / 
> numPartitions.toFloat).toLong
>  
> An example (using a date column):
> Say you're creating 1,000 partitions. If you provide a lower bound of 
> 1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
> (translated to 18563), Spark determines the stride size as follows:
>  
> (18563L / 1000L) - (-15611 / 1000L) = 33
> Starting from the lower bound, doing strides of 33, you'll end up at 
> 2017-07-08. This is over 3 years of extra data that will go into the last 
> partition, and depending on the shape of the data could cause a very long 
> running task at the end of a job.
>  
> Using the formula I'm proposing, you'd get:
> ((18563L / 1000F) - (-15611 / 1000F)).toLong = 34
> This would put the upper bound at 2020-04-02, which is much closer to the 
> original supplied upper bound. This is the best you can do to get as close as 
> possible to the upper bound (without adjusting the number of partitions). For 
> example, a stride size of 35 would go well past the supplied upper bound 
> (over 2 years, 2022-11-22).
>  
> In the above example, there is only a difference of 1 between the stride size 
> using the current formula and the stride size using the proposed formula, but 
> with greater distance between the lower and upper bounds, or a lower number 
> of partitions, the difference can be much greater. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism



[ 
https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308984#comment-17308984
 ] 

Daniel Zhi commented on SPARK-23977:


[~ste...@apache.org] How to workaround following exception during the execution 
of "INSERT OVERWRITE" in spark.sql (spark 3.1.1 with hadoop 3.2)?

```

  if (dynamicPartitionOverwrite) {

    // until there's explicit extensions to the PathOutputCommitProtocols

    // to support the spark mechanism, it's left to the individual committer

    // choice to handle partitioning.

    throw new IOException(PathOutputCommitProtocol.UNSUPPORTED)

  }

```

> Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
> ---
>
> Key: SPARK-23977
> URL: https://issues.apache.org/jira/browse/SPARK-23977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 3.0.0
>
>
> Hadoop 3.1 adds a mechanism for job-specific and store-specific committers 
> (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, 
> HADOOP-13786
> These committers deliver high-performance output of MR and spark jobs to S3, 
> and offer the key semantics which Spark depends on: no visible output until 
> job commit, a failure of a task at an stage, including partway through task 
> commit, can be handled by executing and committing another task attempt. 
> In contrast, the FileOutputFormat commit algorithms on S3 have issues:
> * Awful performance because files are copied by rename
> * FileOutputFormat v1: weak task commit failure recovery semantics as the 
> (v1) expectation: "directory renames are atomic" doesn't hold.
> * S3 metadata eventual consistency can cause rename to miss files or fail 
> entirely (SPARK-15849)
> Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of 
> the commit semantics w.r.t observability of or recovery from task commit 
> failure, on any filesystem.
> The S3A committers address these by way of uploading all data to the 
> destination through multipart uploads, uploads which are only completed in 
> job commit.
> The new {{PathOutputCommitter}} factory mechanism allows applications to work 
> with the S3A committers and any other, by adding a plugin mechanism into the 
> MRv2 FileOutputFormat class, where it job config and filesystem configuration 
> options can dynamically choose the output committer.
> Spark can use these with some binding classes to 
> # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 
> classes and {{PathOutputCommitterFactory}} to create the committers.
> # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}}
> to wire up Parquet output even when code requires the committer to be a 
> subclass of {{ParquetOutputCommitter}}
> This patch builds on SPARK-23807 for setting up the dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism



[ 
https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308984#comment-17308984
 ] 

Daniel Zhi edited comment on SPARK-23977 at 3/25/21, 9:40 PM:
--

[~ste...@apache.org] How to workaround following exception during the execution 
of "INSERT OVERWRITE" in spark.sql (spark 3.1.1 with hadoop 3.2)?

  if (dynamicPartitionOverwrite) {

    // until there's explicit extensions to the PathOutputCommitProtocols    

    // to support the spark mechanism, it's left to the individual committer    

    // choice to handle partitioning.    

    throw new IOException(PathOutputCommitProtocol.UNSUPPORTED)  

}


was (Author: danzhi):
[~ste...@apache.org] How to workaround following exception during the execution 
of "INSERT OVERWRITE" in spark.sql (spark 3.1.1 with hadoop 3.2)?

```

  if (dynamicPartitionOverwrite) {

    // until there's explicit extensions to the PathOutputCommitProtocols

    // to support the spark mechanism, it's left to the individual committer

    // choice to handle partitioning.

    throw new IOException(PathOutputCommitProtocol.UNSUPPORTED)

  }

```

> Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
> ---
>
> Key: SPARK-23977
> URL: https://issues.apache.org/jira/browse/SPARK-23977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 3.0.0
>
>
> Hadoop 3.1 adds a mechanism for job-specific and store-specific committers 
> (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, 
> HADOOP-13786
> These committers deliver high-performance output of MR and spark jobs to S3, 
> and offer the key semantics which Spark depends on: no visible output until 
> job commit, a failure of a task at an stage, including partway through task 
> commit, can be handled by executing and committing another task attempt. 
> In contrast, the FileOutputFormat commit algorithms on S3 have issues:
> * Awful performance because files are copied by rename
> * FileOutputFormat v1: weak task commit failure recovery semantics as the 
> (v1) expectation: "directory renames are atomic" doesn't hold.
> * S3 metadata eventual consistency can cause rename to miss files or fail 
> entirely (SPARK-15849)
> Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of 
> the commit semantics w.r.t observability of or recovery from task commit 
> failure, on any filesystem.
> The S3A committers address these by way of uploading all data to the 
> destination through multipart uploads, uploads which are only completed in 
> job commit.
> The new {{PathOutputCommitter}} factory mechanism allows applications to work 
> with the S3A committers and any other, by adding a plugin mechanism into the 
> MRv2 FileOutputFormat class, where it job config and filesystem configuration 
> options can dynamically choose the output committer.
> Spark can use these with some binding classes to 
> # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 
> classes and {{PathOutputCommitterFactory}} to create the committers.
> # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}}
> to wire up Parquet output even when code requires the committer to be a 
> subclass of {{ParquetOutputCommitter}}
> This patch builds on SPARK-23807 for setting up the dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism



[ 
https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308984#comment-17308984
 ] 

Daniel Zhi edited comment on SPARK-23977 at 3/25/21, 9:41 PM:
--

[~ste...@apache.org] How to workaround following exception during the execution 
of "INSERT OVERWRITE" in spark.sql (spark 3.1.1 with hadoop 3.2)?

 

  if (dynamicPartitionOverwrite) {

    // until there's explicit extensions to the PathOutputCommitProtocols

    // to support the spark mechanism, it's left to the individual committer

    // choice to handle partitioning.

    throw new IOException(PathOutputCommitProtocol.UNSUPPORTED)

  }


was (Author: danzhi):
[~ste...@apache.org] How to workaround following exception during the execution 
of "INSERT OVERWRITE" in spark.sql (spark 3.1.1 with hadoop 3.2)?

  if (dynamicPartitionOverwrite) {

    // until there's explicit extensions to the PathOutputCommitProtocols    

    // to support the spark mechanism, it's left to the individual committer    

    // choice to handle partitioning.    

    throw new IOException(PathOutputCommitProtocol.UNSUPPORTED)  

}

> Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
> ---
>
> Key: SPARK-23977
> URL: https://issues.apache.org/jira/browse/SPARK-23977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 3.0.0
>
>
> Hadoop 3.1 adds a mechanism for job-specific and store-specific committers 
> (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, 
> HADOOP-13786
> These committers deliver high-performance output of MR and spark jobs to S3, 
> and offer the key semantics which Spark depends on: no visible output until 
> job commit, a failure of a task at an stage, including partway through task 
> commit, can be handled by executing and committing another task attempt. 
> In contrast, the FileOutputFormat commit algorithms on S3 have issues:
> * Awful performance because files are copied by rename
> * FileOutputFormat v1: weak task commit failure recovery semantics as the 
> (v1) expectation: "directory renames are atomic" doesn't hold.
> * S3 metadata eventual consistency can cause rename to miss files or fail 
> entirely (SPARK-15849)
> Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of 
> the commit semantics w.r.t observability of or recovery from task commit 
> failure, on any filesystem.
> The S3A committers address these by way of uploading all data to the 
> destination through multipart uploads, uploads which are only completed in 
> job commit.
> The new {{PathOutputCommitter}} factory mechanism allows applications to work 
> with the S3A committers and any other, by adding a plugin mechanism into the 
> MRv2 FileOutputFormat class, where it job config and filesystem configuration 
> options can dynamically choose the output committer.
> Spark can use these with some binding classes to 
> # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 
> classes and {{PathOutputCommitterFactory}} to create the committers.
> # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}}
> to wire up Parquet output even when code requires the committer to be a 
> subclass of {{ParquetOutputCommitter}}
> This patch builds on SPARK-23807 for setting up the dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism



[ 
https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308984#comment-17308984
 ] 

Daniel Zhi edited comment on SPARK-23977 at 3/25/21, 9:42 PM:
--

[~ste...@apache.org] How to workaround following exception during the execution 
of "INSERT OVERWRITE" in spark.sql (spark 3.1.1 with hadoop 3.2)?
{quote}  if (dynamicPartitionOverwrite) {

    // until there's explicit extensions to the PathOutputCommitProtocols

    // to support the spark mechanism, it's left to the individual committer

    // choice to handle partitioning.

    throw new IOException(PathOutputCommitProtocol.UNSUPPORTED)

  }
{quote}


was (Author: danzhi):
[~ste...@apache.org] How to workaround following exception during the execution 
of "INSERT OVERWRITE" in spark.sql (spark 3.1.1 with hadoop 3.2)?

 

  if (dynamicPartitionOverwrite) {

    // until there's explicit extensions to the PathOutputCommitProtocols

    // to support the spark mechanism, it's left to the individual committer

    // choice to handle partitioning.

    throw new IOException(PathOutputCommitProtocol.UNSUPPORTED)

  }

> Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
> ---
>
> Key: SPARK-23977
> URL: https://issues.apache.org/jira/browse/SPARK-23977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 3.0.0
>
>
> Hadoop 3.1 adds a mechanism for job-specific and store-specific committers 
> (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, 
> HADOOP-13786
> These committers deliver high-performance output of MR and spark jobs to S3, 
> and offer the key semantics which Spark depends on: no visible output until 
> job commit, a failure of a task at an stage, including partway through task 
> commit, can be handled by executing and committing another task attempt. 
> In contrast, the FileOutputFormat commit algorithms on S3 have issues:
> * Awful performance because files are copied by rename
> * FileOutputFormat v1: weak task commit failure recovery semantics as the 
> (v1) expectation: "directory renames are atomic" doesn't hold.
> * S3 metadata eventual consistency can cause rename to miss files or fail 
> entirely (SPARK-15849)
> Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of 
> the commit semantics w.r.t observability of or recovery from task commit 
> failure, on any filesystem.
> The S3A committers address these by way of uploading all data to the 
> destination through multipart uploads, uploads which are only completed in 
> job commit.
> The new {{PathOutputCommitter}} factory mechanism allows applications to work 
> with the S3A committers and any other, by adding a plugin mechanism into the 
> MRv2 FileOutputFormat class, where it job config and filesystem configuration 
> options can dynamically choose the output committer.
> Spark can use these with some binding classes to 
> # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 
> classes and {{PathOutputCommitterFactory}} to create the committers.
> # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}}
> to wire up Parquet output even when code requires the committer to be a 
> subclass of {{ParquetOutputCommitter}}
> This patch builds on SPARK-23807 for setting up the dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join

2021-03-25 Thread Chenxi Zhao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309012#comment-17309012
 ] 

Chenxi Zhao commented on SPARK-31754:
-

I got the same issue. I was using Spark 2.4.4 and doing leftouterjoin from 
Kafka source loading about 288GB data. After the joining state begins, I 
immediately start to see such exception:

 

21/03/24 04:56:51 ERROR Utils: Aborting task
java.lang.NullPointerException
 at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getLong(JoinedRow.scala:85)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.And_0$(Unknown
 Source)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
 at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$10.apply(StreamingSymmetricHashJoinExec.scala:228)
 at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$10.apply(StreamingSymmetricHashJoinExec.scala:228)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
 at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:217)
 at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
 at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
 at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
 at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:117)
 at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116)
 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
 at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146)
 at 
org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67)
 at 
org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.run(Task.scala:123)
 at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:409)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:415)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

> Spark Structured Streaming: NullPointerException in Stream Stream join
> --
>
> Key: SPARK-31754
> URL: https://issues.apache.org/jira/browse/SPARK-31754
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark Version : 2.4.0
> Hadoop Version : 3.0.0
>Reporter: Puviarasu
>Priority: Major
>  Labels: structured-streaming
> Attachments: CodeGen.txt, Excpetion-3.0.0Preview2.txt, 
> Logical-Plan.txt
>
>
> When joining 2 streams with watermarking and windowing we are getting 
> NullPointer Exception after running for few minutes. 
> After failure we analyzed the checkpoint offsets/sources and found the files 
> for which the application failed. These files are not having any null values 
> in the join columns. 
> We even started the job with the files and the application ran. From this we 
> concluded that the exception is not because of the data from the streams.
> *Code:*
>  
> {code:java}
> val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> 
> "1" )
>  val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> 
> "1" )
>  
> spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1")
>  
> spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2")
>  spark.sql("select * from source1 where eventTime1 is not null and col1 is 
> not null").withWatermark("eventTime1", "30 
> minutes").createTempView("viewNotNull1")
>  spark.sql("select * from source2 where eventTime2 is not

[jira] [Assigned] (SPARK-34843) JDBCRelation columnPartition function improperly determines stride size. Upper bound is skewed do to stride alignment.



 [ 
https://issues.apache.org/jira/browse/SPARK-34843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34843:


Assignee: Apache Spark

> JDBCRelation columnPartition function improperly determines stride size. 
> Upper bound is skewed do to stride alignment.
> --
>
> Key: SPARK-34843
> URL: https://issues.apache.org/jira/browse/SPARK-34843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Yarbrough
>Assignee: Apache Spark
>Priority: Minor
> Attachments: SPARK-34843.patch
>
>
> Currently, in JDBCRelation (line 123), the stride size is calculated as 
> follows:
> val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
>  
> Due to truncation happening on both divisions, the stride size can fall short 
> of what it should be. This can lead to a big difference between the provided 
> upper bound and the actual start of the last partition.
> I propose this formula, as it is much more accurate and leads to better 
> distribution:
> val stride = (upperBound / numPartitions.toFloat - lowerBound / 
> numPartitions.toFloat).toLong
>  
> An example (using a date column):
> Say you're creating 1,000 partitions. If you provide a lower bound of 
> 1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
> (translated to 18563), Spark determines the stride size as follows:
>  
> (18563L / 1000L) - (-15611 / 1000L) = 33
> Starting from the lower bound, doing strides of 33, you'll end up at 
> 2017-07-08. This is over 3 years of extra data that will go into the last 
> partition, and depending on the shape of the data could cause a very long 
> running task at the end of a job.
>  
> Using the formula I'm proposing, you'd get:
> ((18563L / 1000F) - (-15611 / 1000F)).toLong = 34
> This would put the upper bound at 2020-04-02, which is much closer to the 
> original supplied upper bound. This is the best you can do to get as close as 
> possible to the upper bound (without adjusting the number of partitions). For 
> example, a stride size of 35 would go well past the supplied upper bound 
> (over 2 years, 2022-11-22).
>  
> In the above example, there is only a difference of 1 between the stride size 
> using the current formula and the stride size using the proposed formula, but 
> with greater distance between the lower and upper bounds, or a lower number 
> of partitions, the difference can be much greater. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34843) JDBCRelation columnPartition function improperly determines stride size. Upper bound is skewed do to stride alignment.



 [ 
https://issues.apache.org/jira/browse/SPARK-34843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34843:


Assignee: (was: Apache Spark)

> JDBCRelation columnPartition function improperly determines stride size. 
> Upper bound is skewed do to stride alignment.
> --
>
> Key: SPARK-34843
> URL: https://issues.apache.org/jira/browse/SPARK-34843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Yarbrough
>Priority: Minor
> Attachments: SPARK-34843.patch
>
>
> Currently, in JDBCRelation (line 123), the stride size is calculated as 
> follows:
> val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
>  
> Due to truncation happening on both divisions, the stride size can fall short 
> of what it should be. This can lead to a big difference between the provided 
> upper bound and the actual start of the last partition.
> I propose this formula, as it is much more accurate and leads to better 
> distribution:
> val stride = (upperBound / numPartitions.toFloat - lowerBound / 
> numPartitions.toFloat).toLong
>  
> An example (using a date column):
> Say you're creating 1,000 partitions. If you provide a lower bound of 
> 1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
> (translated to 18563), Spark determines the stride size as follows:
>  
> (18563L / 1000L) - (-15611 / 1000L) = 33
> Starting from the lower bound, doing strides of 33, you'll end up at 
> 2017-07-08. This is over 3 years of extra data that will go into the last 
> partition, and depending on the shape of the data could cause a very long 
> running task at the end of a job.
>  
> Using the formula I'm proposing, you'd get:
> ((18563L / 1000F) - (-15611 / 1000F)).toLong = 34
> This would put the upper bound at 2020-04-02, which is much closer to the 
> original supplied upper bound. This is the best you can do to get as close as 
> possible to the upper bound (without adjusting the number of partitions). For 
> example, a stride size of 35 would go well past the supplied upper bound 
> (over 2 years, 2022-11-22).
>  
> In the above example, there is only a difference of 1 between the stride size 
> using the current formula and the stride size using the proposed formula, but 
> with greater distance between the lower and upper bounds, or a lower number 
> of partitions, the difference can be much greater. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join

2021-03-25 Thread Chenxi Zhao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309012#comment-17309012
 ] 

Chenxi Zhao edited comment on SPARK-31754 at 3/26/21, 12:08 AM:


I got the same issue. I was using Spark 2.4.4 and doing leftouterjoin from 
Kafka source loading about 288GB data. After the joining state begins, I 
immediately start to see such exception:

 

21/03/24 04:56:51 ERROR Utils: Aborting task
 java.lang.NullPointerException
 at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getLong(JoinedRow.scala:85)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.And_0$(Unknown
 Source)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
 at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$10.apply(StreamingSymmetricHashJoinExec.scala:228)
 at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$10.apply(StreamingSymmetricHashJoinExec.scala:228)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
 at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:217)
 at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
 at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
 at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
 at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:117)
 at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116)
 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
 at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146)
 at 
org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67)
 at 
org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.run(Task.scala:123)
 at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:409)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:415)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

 

Edit: It might be related to the corrupted checkpoint. I deleted the checkpoint 
folder and this particular exception is gone.


was (Author: homezcx):
I got the same issue. I was using Spark 2.4.4 and doing leftouterjoin from 
Kafka source loading about 288GB data. After the joining state begins, I 
immediately start to see such exception:

 

21/03/24 04:56:51 ERROR Utils: Aborting task
java.lang.NullPointerException
 at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getLong(JoinedRow.scala:85)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.And_0$(Unknown
 Source)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
 at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$10.apply(StreamingSymmetricHashJoinExec.scala:228)
 at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$10.apply(StreamingSymmetricHashJoinExec.scala:228)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
 at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:217)
 at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
 at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
 at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
 at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:117)
 at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116)
 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
 at 
org.apache.spark.sql.execution.datasources.v2.DataWritingS

[jira] [Updated] (SPARK-34843) JDBCRelation columnPartition function improperly determines stride size. Upper bound is skewed do to stride alignment.

2021-03-25 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34843:
-
Target Version/s:   (was: 3.2.0, 3.1.2, 3.0.3)

> JDBCRelation columnPartition function improperly determines stride size. 
> Upper bound is skewed do to stride alignment.
> --
>
> Key: SPARK-34843
> URL: https://issues.apache.org/jira/browse/SPARK-34843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Yarbrough
>Priority: Minor
> Attachments: SPARK-34843.patch
>
>
> Currently, in JDBCRelation (line 123), the stride size is calculated as 
> follows:
> val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
>  
> Due to truncation happening on both divisions, the stride size can fall short 
> of what it should be. This can lead to a big difference between the provided 
> upper bound and the actual start of the last partition.
> I propose this formula, as it is much more accurate and leads to better 
> distribution:
> val stride = (upperBound / numPartitions.toFloat - lowerBound / 
> numPartitions.toFloat).toLong
>  
> An example (using a date column):
> Say you're creating 1,000 partitions. If you provide a lower bound of 
> 1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
> (translated to 18563), Spark determines the stride size as follows:
>  
> (18563L / 1000L) - (-15611 / 1000L) = 33
> Starting from the lower bound, doing strides of 33, you'll end up at 
> 2017-07-08. This is over 3 years of extra data that will go into the last 
> partition, and depending on the shape of the data could cause a very long 
> running task at the end of a job.
>  
> Using the formula I'm proposing, you'd get:
> ((18563L / 1000F) - (-15611 / 1000F)).toLong = 34
> This would put the upper bound at 2020-04-02, which is much closer to the 
> original supplied upper bound. This is the best you can do to get as close as 
> possible to the upper bound (without adjusting the number of partitions). For 
> example, a stride size of 35 would go well past the supplied upper bound 
> (over 2 years, 2022-11-22).
>  
> In the above example, there is only a difference of 1 between the stride size 
> using the current formula and the stride size using the proposed formula, but 
> with greater distance between the lower and upper bounds, or a lower number 
> of partitions, the difference can be much greater. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34843) JDBCRelation columnPartition function improperly determines stride size. Upper bound is skewed do to stride alignment.

2021-03-25 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34843:
-
Affects Version/s: 3.0.3
   3.2.0
   3.12

> JDBCRelation columnPartition function improperly determines stride size. 
> Upper bound is skewed do to stride alignment.
> --
>
> Key: SPARK-34843
> URL: https://issues.apache.org/jira/browse/SPARK-34843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.2.0, 3.0.3, 3.12
>Reporter: Jason Yarbrough
>Priority: Minor
> Attachments: SPARK-34843.patch
>
>
> Currently, in JDBCRelation (line 123), the stride size is calculated as 
> follows:
> val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
>  
> Due to truncation happening on both divisions, the stride size can fall short 
> of what it should be. This can lead to a big difference between the provided 
> upper bound and the actual start of the last partition.
> I propose this formula, as it is much more accurate and leads to better 
> distribution:
> val stride = (upperBound / numPartitions.toFloat - lowerBound / 
> numPartitions.toFloat).toLong
>  
> An example (using a date column):
> Say you're creating 1,000 partitions. If you provide a lower bound of 
> 1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
> (translated to 18563), Spark determines the stride size as follows:
>  
> (18563L / 1000L) - (-15611 / 1000L) = 33
> Starting from the lower bound, doing strides of 33, you'll end up at 
> 2017-07-08. This is over 3 years of extra data that will go into the last 
> partition, and depending on the shape of the data could cause a very long 
> running task at the end of a job.
>  
> Using the formula I'm proposing, you'd get:
> ((18563L / 1000F) - (-15611 / 1000F)).toLong = 34
> This would put the upper bound at 2020-04-02, which is much closer to the 
> original supplied upper bound. This is the best you can do to get as close as 
> possible to the upper bound (without adjusting the number of partitions). For 
> example, a stride size of 35 would go well past the supplied upper bound 
> (over 2 years, 2022-11-22).
>  
> In the above example, there is only a difference of 1 between the stride size 
> using the current formula and the stride size using the proposed formula, but 
> with greater distance between the lower and upper bounds, or a lower number 
> of partitions, the difference can be much greater. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34871) Move checkpoint resolving logic to the rule ResolveWriteToStream



 [ 
https://issues.apache.org/jira/browse/SPARK-34871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34871.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31963
[https://github.com/apache/spark/pull/31963]

> Move checkpoint resolving logic to the rule ResolveWriteToStream
> 
>
> Key: SPARK-34871
> URL: https://issues.apache.org/jira/browse/SPARK-34871
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.2.0
>
>
> After SPARK-34748, we have a rule ResolveWriteToStream for the analysis logic 
> for the resolving logic of stream write plans. Based on it, we can further 
> move the checkpoint location resolving work in the rule.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34871) Move checkpoint resolving logic to the rule ResolveWriteToStream



 [ 
https://issues.apache.org/jira/browse/SPARK-34871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34871:


Assignee: Yuanjian Li

> Move checkpoint resolving logic to the rule ResolveWriteToStream
> 
>
> Key: SPARK-34871
> URL: https://issues.apache.org/jira/browse/SPARK-34871
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>
> After SPARK-34748, we have a rule ResolveWriteToStream for the analysis logic 
> for the resolving logic of stream write plans. Based on it, we can further 
> move the checkpoint location resolving work in the rule.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34786) read parquet uint64 as decimal



 [ 
https://issues.apache.org/jira/browse/SPARK-34786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-34786:


Assignee: Kent Yao

> read parquet uint64 as decimal
> --
>
> Key: SPARK-34786
> URL: https://issues.apache.org/jira/browse/SPARK-34786
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Kent Yao
>Priority: Major
>
> Currently Spark can't read parquet uint64 as it doesn't fit the Spark long 
> type. We can read uint64 as decimal as a workaround.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34786) read parquet uint64 as decimal



 [ 
https://issues.apache.org/jira/browse/SPARK-34786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-34786.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31960
[https://github.com/apache/spark/pull/31960]

> read parquet uint64 as decimal
> --
>
> Key: SPARK-34786
> URL: https://issues.apache.org/jira/browse/SPARK-34786
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently Spark can't read parquet uint64 as it doesn't fit the Spark long 
> type. We can read uint64 as decimal as a workaround.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34849) SPIP: Support pandas API layer on PySpark

2021-03-25 Thread Haejoon Lee (Jira)

[
https://issues.apache.org/jira/browse/SPARK-34849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Haejoon Lee updated SPARK-34849:

Description:
This is a SPIP for porting [Koalas
project|https://github.com/databricks/koalas] to PySpark, that is once
discussed on the dev-mailing list with the same title, [[DISCUSS] Support
pandas API layer on
PySpark|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Support-pandas-API-layer-on-PySpark-td30945.html].

*Q1. What are you trying to do? Articulate your objectives using absolutely no
jargon.*

Porting Koalas into PySpark to support the pandas API layer on PySpark for:
- Users can easily leverage their existing Spark cluster to scale their pandas
workloads.
- Support plot and drawing a chart in PySpark
- Users can easily switch between pandas APIs and PySpark APIs

*Q2. What problem is this proposal NOT designed to solve?*

Some APIs of pandas are explicitly unsupported. For example, {{memory_usage}}
in pandas will not be supported because DataFrames are not materialized in
memory in Spark unlike pandas.

This does not replace the existing PySpark APIs. PySpark API has lots of users
and existing code in many projects, and there are still many PySpark users who
prefer Spark’s immutable DataFrame API to the pandas API.

*Q3. How is it done today, and what are the limits of current practice?*

The current practice has 2 limits as below.
# There are many features missing in Apache Spark that are very commonly used
in data science. Specifically, plotting and drawing a chart is missing which is
one of the most important features that almost every data scientist use in
their daily work.
# Data scientists tend to prefer pandas APIs, but it is very hard to change
them into PySpark APIs when they need to scale their workloads. This is because
PySpark APIs are difficult to learn compared to pandas' and there are many
missing features in PySpark.

*Q4. What is new in your approach and why do you think it will be successful?*

I believe this suggests a new way for both PySpark and pandas users to easily
scale their workloads. I think we can be successful because more and more
people tend to use Python and pandas. In fact, there are already similar tries
such as Dask and Modin which are all growing fast and successfully.

*Q5. Who cares? If you are successful, what difference will it make?*

Anyone who wants to scale their pandas workloads on their Spark cluster. It
will also significantly improve the usability of PySpark.

*Q6. What are the risks?*

Technically I don't see many risks yet given that:
- Koalas has grown separately for more than two years, and has greatly improved
maturity and stability.
- Koalas will be ported into PySpark as a separate package

It is more about putting documentation and test cases in place properly with
properly handling dependencies. For example, Koalas currently uses pytest with
various dependencies whereas PySpark uses the plain unittest with fewer
dependencies.

In addition, Koalas' default Indexing system could not be much loved because it
could potentially cause overhead, so applying it properly to PySpark might be a
challenge.

*Q7. How long will it take?*

Before the Spark 3.2 release.

*Q8. What are the mid-term and final “exams” to check for success?*

The first check for success would be to make sure that all the existing Koalas
APIs and tests work as they are without any affecting the existing Koalas
workloads on PySpark.

The last thing to confirm is to check whether the usability and convenience
that we aim for is actually increased through user feedback and PySpark usage
statistics.

was:
This is a SPIP for porting [Koalas
project|https://github.com/databricks/koalas] to PySpark, that is once
discussed on the dev-mailing list with the same title, [[DISCUSS] Support
pandas API layer on
PySpark|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Support-pandas-API-layer-on-PySpark-td30945.html].

*Q1. What are you trying to do? Articulate your objectives using absolutely no
jargon.*

# Porting Koalas into PySpark to support the pandas API layer on PySpark for:
- Users can easily leverage their existing Spark cluster to scale their pandas
workloads.
- Support plot and drawing a chart in PySpark
- Users can easily switch between pandas APIs and PySpark APIs

*Q2. What problem is this proposal NOT designed to solve?*

Some APIs of pandas are explicitly unsupported. For example, {{memory_usage}}
in pandas will not be supported because DataFrames are not materialized in
memory in Spark unlike pandas.

*Q3. How is it done today, and what are the limits of curren

[jira] [Commented] (SPARK-34638) Spark SQL reads unnecessary nested fields (another type of pruning case)



[ 
https://issues.apache.org/jira/browse/SPARK-34638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309087#comment-17309087
 ] 

Apache Spark commented on SPARK-34638:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/31966

> Spark SQL reads unnecessary nested fields (another type of pruning case)
> 
>
> Key: SPARK-34638
> URL: https://issues.apache.org/jira/browse/SPARK-34638
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Jiri Humpolicek
>Priority: Major
>
> Based on this [SPARK-29721|https://issues.apache.org/jira/browse/SPARK-29721] 
> I found another nested fields pruning case.
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select(explode($"items").as('item)).select($"item.itemId").explain(true)
> // ReadSchema: struct>>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34638) Spark SQL reads unnecessary nested fields (another type of pruning case)



 [ 
https://issues.apache.org/jira/browse/SPARK-34638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34638:


Assignee: Apache Spark

> Spark SQL reads unnecessary nested fields (another type of pruning case)
> 
>
> Key: SPARK-34638
> URL: https://issues.apache.org/jira/browse/SPARK-34638
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Jiri Humpolicek
>Assignee: Apache Spark
>Priority: Major
>
> Based on this [SPARK-29721|https://issues.apache.org/jira/browse/SPARK-29721] 
> I found another nested fields pruning case.
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select(explode($"items").as('item)).select($"item.itemId").explain(true)
> // ReadSchema: struct>>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34638) Spark SQL reads unnecessary nested fields (another type of pruning case)



 [ 
https://issues.apache.org/jira/browse/SPARK-34638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34638:


Assignee: (was: Apache Spark)

> Spark SQL reads unnecessary nested fields (another type of pruning case)
> 
>
> Key: SPARK-34638
> URL: https://issues.apache.org/jira/browse/SPARK-34638
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Jiri Humpolicek
>Priority: Major
>
> Based on this [SPARK-29721|https://issues.apache.org/jira/browse/SPARK-29721] 
> I found another nested fields pruning case.
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select(explode($"items").as('item)).select($"item.itemId").explain(true)
> // ReadSchema: struct>>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34638) Spark SQL reads unnecessary nested fields (another type of pruning case)



[ 
https://issues.apache.org/jira/browse/SPARK-34638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309088#comment-17309088
 ] 

Apache Spark commented on SPARK-34638:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/31966

> Spark SQL reads unnecessary nested fields (another type of pruning case)
> 
>
> Key: SPARK-34638
> URL: https://issues.apache.org/jira/browse/SPARK-34638
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Jiri Humpolicek
>Priority: Major
>
> Based on this [SPARK-29721|https://issues.apache.org/jira/browse/SPARK-29721] 
> I found another nested fields pruning case.
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select(explode($"items").as('item)).select($"item.itemId").explain(true)
> // ReadSchema: struct>>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34607) NewInstance.resolved should not throw malformed class name error

2021-03-25 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-34607:

Fix Version/s: 2.4.8

> NewInstance.resolved should not throw malformed class name error
> 
>
> Key: SPARK-34607
> URL: https://issues.apache.org/jira/browse/SPARK-34607
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Kris Mok
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 2.4.8, 3.2.0, 3.1.2, 3.0.3
>
>
> I'd like to seek for community help on fixing this issue:
> Related to SPARK-34596, one of our users had hit an issue with 
> {{ExpressionEncoder}} when running Spark code in a Scala REPL, where 
> {{NewInstance.resolved}} was throwing {{"Malformed class name"}} error, with 
> the following kind of stack trace:
> {code}
> java.lang.InternalError: Malformed class name
>   at java.lang.Class.getSimpleBinaryName(Class.java:1450)
>   at java.lang.Class.isMemberClass(Class.java:1433)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved$lzycompute(objects.scala:447)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved(objects.scala:441)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpressionBottomUp(Analyzer.scala:1935)
>   ...
>  Caused by: sbt.ForkMain$ForkError: 
> java.lang.StringIndexOutOfBoundsException: String index out of range: -83
>   at java.lang.String.substring(String.java:1931)
>   at java.lang.Class.getSimpleBinaryName(Class.java:1448)
>   at java.lang.Class.isMemberClass(Class.java:1433)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved$lzycompute(objects.scala:447)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved(objects.scala:441)
>   ...
> {code}
> The most important point in the stack trace is this:
> {code}
> java.lang.InternalError: Malformed class name
>   at java.lang.Class.getSimpleBinaryName(Class.java:1450)
>   at java.lang.Class.isMemberClass(Class.java:1433)
> {code}
> The most common way to hit the {{"Malformed class name"}} issue in Spark is 
> via {{java.lang.Class.getSimpleName}}. But as this stack trace demonstrates, 
> it can happen via other code paths from the JDK as well.
> If we want to fix it in a similar fashion as {{Utils.getSimpleName}}, we'd 
> have to emulate {{java.lang.Class.isMemberClass}} in Spark's {{Utils}}, and 
> then use it in the {{NewInstance.resolved}} code path.
> Here's a reproducer test case (in diff form against Spark master's 
> {{ExpressionEncoderSuite}} ), which uses explicit nested classes to emulate 
> the code structure that'd be generated by Scala's REPL:
> (the level of nesting can be further reduced and still reproduce the issue)
> {code}
> diff --git 
> a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
>  
> b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
> index 2635264..fd1b23d 100644
> --- 
> a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
> +++ 
> b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
> @@ -217,6 +217,95 @@ class ExpressionEncoderSuite extends 
> CodegenInterpretedPlanTest with AnalysisTes
>"nested Scala class should work")
>}
>  
> +  object OuterLevelWithVeryVeryVeryLongClassName1 {
> +object OuterLevelWithVeryVeryVeryLongClassName2 {
> +  object OuterLevelWithVeryVeryVeryLongClassName3 {
> +object OuterLevelWithVeryVeryVeryLongClassName4 {
> +  object OuterLevelWithVeryVeryVeryLongClassName5 {
> +object OuterLevelWithVeryVeryVeryLongClassName6 {
> +  object OuterLevelWithVeryVeryVeryLongClassName7 {
> +object OuterLevelWithVeryVeryVeryLongClassName8 {
> +  object OuterLevelWithVeryVeryVeryLongClassName9 {
> +object OuterLevelWithVeryVeryVeryLongClassName10 {
> +  object OuterLevelWithVeryVeryVeryLongClassName11 {
> +object OuterLevelWithVeryVeryVeryLongClassName12 {
> +  object OuterLevelWithVeryVeryVeryLongClassName13 {
> +object OuterLevelWithVeryVeryVeryLongClassName14 
> {
> +  object 
> OuterLevelWithVeryVeryVeryLongClassName15 {
> +object 
> OuterLevelWithVeryVeryVeryLongClassName16 {
> +  object 
> OuterLevelWithVeryVeryVeryLongClassName17 {
> +object 
> OuterLevelWithVeryV

[jira] [Updated] (SPARK-34542) Upgrade Parquet to 1.12.0

2021-03-25 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34542:

Description: 
Parquet-1.12.0 release notes:
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/CHANGES.md

  was:
Parquet-1.12.0 release notes:
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0-rc2/CHANGES.md


> Upgrade Parquet to 1.12.0
> -
>
> Key: SPARK-34542
> URL: https://issues.apache.org/jira/browse/SPARK-34542
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet-1.12.0 release notes:
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/CHANGES.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32855) Improve DPP for some join type do not support broadcast filtering side



 [ 
https://issues.apache.org/jira/browse/SPARK-32855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32855.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 29726
[https://github.com/apache/spark/pull/29726]

> Improve DPP for some join type do not support broadcast filtering side
> --
>
> Key: SPARK-32855
> URL: https://issues.apache.org/jira/browse/SPARK-32855
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> For some filtering side can not broadcast by join type but can broadcast by 
> size,
> then we should not consider reuse broadcast only, for example:
> Left outer join and left side very small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32855) Improve DPP for some join type do not support broadcast filtering side



 [ 
https://issues.apache.org/jira/browse/SPARK-32855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32855:
---

Assignee: Yuming Wang

> Improve DPP for some join type do not support broadcast filtering side
> --
>
> Key: SPARK-32855
> URL: https://issues.apache.org/jira/browse/SPARK-32855
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> For some filtering side can not broadcast by join type but can broadcast by 
> size,
> then we should not consider reuse broadcast only, for example:
> Left outer join and left side very small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34816) Support for Parquet unsigned LogicalTypes



 [ 
https://issues.apache.org/jira/browse/SPARK-34816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-34816:


Assignee: Kent Yao

> Support for Parquet unsigned LogicalTypes
> -
>
> Key: SPARK-34816
> URL: https://issues.apache.org/jira/browse/SPARK-34816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> Parquet supports some unsigned datatypes.  Here is the definition related in 
> parquet.thrift
> {code:java}
> /**
>  * Common types used by frameworks(e.g. hive, pig) using parquet.  This helps 
> map
>  * between types in those frameworks to the base types in parquet.  This is 
> only
>  * metadata and not needed to read or write the data.
>  */
>   /**
>* An unsigned integer value.
>*
>* The number describes the maximum number of meaningful data bits in
>* the stored value. 8, 16 and 32 bit values are stored using the
>* INT32 physical type.  64 bit values are stored using the INT64
>* physical type.
>*
>*/
>   UINT_8 = 11;
>   UINT_16 = 12;
>   UINT_32 = 13;
>   UINT_64 = 14;
> {code}
> Spark does not support unsigned datatypes. In SPARK-10113, we emit an 
> exception with a clear message for them. 
> UInt8-[0:255]
> UInt16-[0:65535]
> UInt32-[0:4294967295]
> UInt64-[0:18446744073709551615]
> Unsigned types - may be used to produce smaller in-memory representations of 
> the data. If the stored value is larger than the maximum allowed by int32 or 
> int64, then the behavior is undefined.
> In this ticket, we try to read them as a higher precision signed type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34816) Support for Parquet unsigned LogicalTypes