[jira] [Updated] (SPARK-35959) Add a new Maven profile "no-shaded-client" for older Hadoop 3.x versions

2021-09-09 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35959:
-
Priority: Major  (was: Blocker)

> Add a new Maven profile "no-shaded-client" for older Hadoop 3.x versions 
> -
>
> Key: SPARK-35959
> URL: https://issues.apache.org/jira/browse/SPARK-35959
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently Spark uses Hadoop shaded client by default. However, if Spark users 
> want to build Spark with older version of Hadoop, such as 3.1.x, the shaded 
> client cannot be used (currently it only support Hadoop 3.2.2+ and 3.3.1+). 
> Therefore, this proposes to offer a new Maven profile "no-shaded-client" for 
> this use case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset

2021-09-08 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412167#comment-17412167
 ] 

Chao Sun commented on SPARK-36696:
--

[This|https://github.com/apache/arrow/blob/master/cpp/src/parquet/metadata.cc#L1331]
 looks suspicious: why column chunk file offset = dictionary/data page offset + 
compressed size of the column chunk?

> spark.read.parquet loads empty dataset
> --
>
> Key: SPARK-36696
> URL: https://issues.apache.org/jira/browse/SPARK-36696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Blocker
> Attachments: example.parquet
>
>
> Here's a parquet file Spark 3.2/master can't read properly.
> The file was stored by pandas and must contain 3650 rows, but Spark 
> 3.2/master returns an empty dataset.
> {code:python}
> >>> import pandas as pd
> >>> len(pd.read_parquet('/path/to/example.parquet'))
> 3650
> >>> spark.read.parquet('/path/to/example.parquet').count()
> 0
> {code}
> I guess it's caused by the parquet 1.12.0.
> When I reverted two commits related to the parquet 1.12.0 from branch-3.2:
>  - 
> [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa]
>  - 
> [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da]
> it reads the data successfully.
> We need to add some workaround, or revert the commits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset

2021-09-08 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412164#comment-17412164
 ] 

Chao Sun commented on SPARK-36696:
--

This looks like the same issue as in PARQUET-2078. The file offset for the 
first row group is set to 31173 which causes {{filterFileMetaDataByMidpoint}} 
to filter out the only row group (range filter is [0, 37968], while startIndex 
is 31173 and total size is 35820).

Seems there is a bug in Apache Arrow which writes incorrect file offset. cc 
[~gershinsky] to see if you know any info there.

> spark.read.parquet loads empty dataset
> --
>
> Key: SPARK-36696
> URL: https://issues.apache.org/jira/browse/SPARK-36696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Blocker
> Attachments: example.parquet
>
>
> Here's a parquet file Spark 3.2/master can't read properly.
> The file was stored by pandas and must contain 3650 rows, but Spark 
> 3.2/master returns an empty dataset.
> {code:python}
> >>> import pandas as pd
> >>> len(pd.read_parquet('/path/to/example.parquet'))
> 3650
> >>> spark.read.parquet('/path/to/example.parquet').count()
> 0
> {code}
> I guess it's caused by the parquet 1.12.0.
> When I reverted two commits related to the parquet 1.12.0 from branch-3.2:
>  - 
> [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa]
>  - 
> [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da]
> it reads the data successfully.
> We need to add some workaround, or revert the commits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36695) Allow passing V2 functions to data sources via V2 filters

2021-09-08 Thread Chao Sun (Jira)
Chao Sun created SPARK-36695:


 Summary: Allow passing V2 functions to data sources via V2 filters
 Key: SPARK-36695
 URL: https://issues.apache.org/jira/browse/SPARK-36695
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


The V2 filter API currently only allow {{NamedReference}} in predicates that 
are pushed down to data sources. It may be beneficial to allow V2 functions in 
predicates as well so that we can implement function pushdown. This feature is 
also supported by Trino (Presto).

One use case is we can pushdown predicates such as {{bucket(col, 32) = 10}} 
which will allow data sources such as Iceberg to only scan a single partition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36676) Create shaded Hive module and upgrade to higher version of Guava

2021-09-06 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410726#comment-17410726
 ] 

Chao Sun commented on SPARK-36676:
--

Will post a PR soon

> Create shaded Hive module and upgrade to higher version of Guava
> 
>
> Key: SPARK-36676
> URL: https://issues.apache.org/jira/browse/SPARK-36676
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently Spark is tied with Guava from Hive which is of version 14. This 
> proposes to create a separate module {{hive-shaded}} which shades 
> dependencies from Hive and subsequently allows us to upgrade Guava 
> independently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36676) Create shaded Hive module and upgrade to higher version of Guava

2021-09-06 Thread Chao Sun (Jira)
Chao Sun created SPARK-36676:


 Summary: Create shaded Hive module and upgrade to higher version 
of Guava
 Key: SPARK-36676
 URL: https://issues.apache.org/jira/browse/SPARK-36676
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


Currently Spark is tied with Guava from Hive which is of version 14. This 
proposes to create a separate module {{hive-shaded}} which shades dependencies 
from Hive and subsequently allows us to upgrade Guava independently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12

2021-09-01 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408324#comment-17408324
 ] 

Chao Sun commented on SPARK-34276:
--

I did some study on the code and it seems this will only affect Spark when 
{{spark.sql.hive.convertMetastoreParquet}} is set to false, as [~nemon] pointed 
above. By default Spark uses {{filterFileMetaDataByMidpoint}} (see 
[here|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1226]),
 which is not impacted much by this bug. In the worst case it could cause 
imbalance when assigning Parquet row groups to Spark tasks but nothing like 
read error or data loss.

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should 
> upgrade/revert Parquet. At the same time, we should encourage the whole 
> community to do the compatibility and performance tests for their production 
> workloads, including both read and write code paths.
> More details: 
> [https://github.com/apache/spark/pull/26804#issuecomment-768790620]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12

2021-08-31 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407554#comment-17407554
 ] 

Chao Sun commented on SPARK-34276:
--

[~smilegator] yea seems like Spark will be affected. cc [~gszadovszky] to 
confirm. [~nemon] is the issue you mentioned the same as PARQUET-2078? 

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should 
> upgrade/revert Parquet. At the same time, we should encourage the whole 
> community to do the compatibility and performance tests for their production 
> workloads, including both read and write code paths.
> More details: 
> [https://github.com/apache/spark/pull/26804#issuecomment-768790620]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36528) Implement lazy decoding for the vectorized Parquet reader

2021-08-26 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36528:
-
Description: Currently Spark first decode (e.g., RLE/bit-packed, PLAIN) 
into column vector and then operate on the decoded data. However, it may be 
more efficient to directly operate on encoded data, for instance, performing 
filter or aggregation on RLE-encoded data, or performing comparison over 
dictionary-encoded string data. This can also potentially work with encodings 
in Parquet v2 format, such as DELTA_BYTE_ARRAY.  (was: Currently Spark first 
decode (e.g., RLE/bit-packed, PLAIN) into column vector and then operate on the 
decoded data. However, it may be more efficient to directly operate on encoded 
data (e.g., when the data is using RLE encoding). This can also potentially 
work with encodings in Parquet v2 format, such as DELTA_BYTE_ARRAY.)

> Implement lazy decoding for the vectorized Parquet reader
> -
>
> Key: SPARK-36528
> URL: https://issues.apache.org/jira/browse/SPARK-36528
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently Spark first decode (e.g., RLE/bit-packed, PLAIN) into column vector 
> and then operate on the decoded data. However, it may be more efficient to 
> directly operate on encoded data, for instance, performing filter or 
> aggregation on RLE-encoded data, or performing comparison over 
> dictionary-encoded string data. This can also potentially work with encodings 
> in Parquet v2 format, such as DELTA_BYTE_ARRAY.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36528) Implement lazy decoding for the vectorized Parquet reader

2021-08-16 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36528:
-
Description: Currently Spark first decode (e.g., RLE/bit-packed, PLAIN) 
into column vector and then operate on the decoded data. However, it may be 
more efficient to directly operate on encoded data (e.g., when the data is 
using RLE encoding). This can also potentially work with encodings in Parquet 
v2 format, such as DELTA_BYTE_ARRAY.  (was: Currently Spark first decode (e.g., 
RLE/bit-packed, PLAIN) into column vector and then operate on the decoded data. 
However, it may be more efficient to directly operate on encoded data (e.g., 
when the data is using RLE encoding).)

> Implement lazy decoding for the vectorized Parquet reader
> -
>
> Key: SPARK-36528
> URL: https://issues.apache.org/jira/browse/SPARK-36528
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently Spark first decode (e.g., RLE/bit-packed, PLAIN) into column vector 
> and then operate on the decoded data. However, it may be more efficient to 
> directly operate on encoded data (e.g., when the data is using RLE encoding). 
> This can also potentially work with encodings in Parquet v2 format, such as 
> DELTA_BYTE_ARRAY.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36527) Implement lazy materialization for the vectorized Parquet reader

2021-08-16 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36527:
-
Description: At the moment the Parquet vectorized reader will eagerly 
decode all the columns that are in the read schema, before any filter has been 
applied to them. This is costly. Instead it's better to only materialize these 
column vectors when the data are actually needed.  (was: At the moment the 
Parquet vectorized reader will eagerly decode all the columns that are in the 
read schema, before any filter has been applied to them. This is costly. 
Instead it's better to only materialize these column vectors when the data are 
actually read.)

> Implement lazy materialization for the vectorized Parquet reader
> 
>
> Key: SPARK-36527
> URL: https://issues.apache.org/jira/browse/SPARK-36527
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> At the moment the Parquet vectorized reader will eagerly decode all the 
> columns that are in the read schema, before any filter has been applied to 
> them. This is costly. Instead it's better to only materialize these column 
> vectors when the data are actually needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36529) Decouple CPU with IO work in vectorized Parquet reader

2021-08-16 Thread Chao Sun (Jira)
Chao Sun created SPARK-36529:


 Summary: Decouple CPU with IO work in vectorized Parquet reader
 Key: SPARK-36529
 URL: https://issues.apache.org/jira/browse/SPARK-36529
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


Currently it seems the vectorized Parquet reader does almost everything in a 
sequential manner:
1. read the row group using file system API (perhaps from remote storage like 
S3)
2. allocate buffers and store those row group bytes into them
3. decompress the data pages
4. in Spark, decode all the read columns one by one
5. read the next row group and repeat from 1.

A lot of improvements can be done to decouple the IO and CPU intensive work. In 
addition, we could parallelize the row group loading and column decoding, and 
utilizing all the cores available for a Spark task.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36528) Implement lazy decoding for the vectorized Parquet reader

2021-08-16 Thread Chao Sun (Jira)
Chao Sun created SPARK-36528:


 Summary: Implement lazy decoding for the vectorized Parquet reader
 Key: SPARK-36528
 URL: https://issues.apache.org/jira/browse/SPARK-36528
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


Currently Spark first decode (e.g., RLE/bit-packed, PLAIN) into column vector 
and then operate on the decoded data. However, it may be more efficient to 
directly operate on encoded data (e.g., when the data is using RLE encoding).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36527) Implement lazy materialization for the vectorized Parquet reader

2021-08-16 Thread Chao Sun (Jira)
Chao Sun created SPARK-36527:


 Summary: Implement lazy materialization for the vectorized Parquet 
reader
 Key: SPARK-36527
 URL: https://issues.apache.org/jira/browse/SPARK-36527
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


At the moment the Parquet vectorized reader will eagerly decode all the columns 
that are in the read schema, before any filter has been applied to them. This 
is costly. Instead it's better to only materialize these column vectors when 
the data are actually read.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36511) Remove ColumnIO once PARQUET-2050 is released in Parquet 1.13

2021-08-13 Thread Chao Sun (Jira)
Chao Sun created SPARK-36511:


 Summary: Remove ColumnIO once PARQUET-2050 is released in Parquet 
1.13
 Key: SPARK-36511
 URL: https://issues.apache.org/jira/browse/SPARK-36511
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


{{ColumnIO}} doesn't expose methods to get repetition and definition level so 
Spark has to use a hack. This should be removed once PARQUET-2050 is released.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-34861) Support nested column in Spark vectorized readers

2021-08-10 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-34861:
-
Comment: was deleted

(was: Synced with [~chengsu] offline and I will take over this JIRA.)

> Support nested column in Spark vectorized readers
> -
>
> Key: SPARK-34861
> URL: https://issues.apache.org/jira/browse/SPARK-34861
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> This is the umbrella task to track the overall progress. The task is to 
> support nested column type in Spark vectorized reader, namely Parquet and 
> ORC. Currently both Parquet and ORC vectorized readers do not support nested 
> column type (struct, array and map). We implemented nested column vectorized 
> reader for FB-ORC in our internal fork of Spark. We are seeing performance 
> improvement compared to non-vectorized reader when reading nested columns. In 
> addition, this can also help improve the non-nested column performance when 
> reading non-nested and nested columns together in one query.
>  
> Parquet: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L173]
>  
> ORC:
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L138]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36440) Spark3 fails to read hive table with mixed format

2021-08-05 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394529#comment-17394529
 ] 

Chao Sun commented on SPARK-36440:
--

Hmm really? Spark 2.x support this? I'm not sure why Spark is still expected to 
work in this case since the serde is changed to Parquet but the underlying data 
file is in ORC. It seems like an error that users should avoid.

> Spark3 fails to read hive table with mixed format
> -
>
> Key: SPARK-36440
> URL: https://issues.apache.org/jira/browse/SPARK-36440
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.1, 3.1.2
>Reporter: Jason Xu
>Priority: Major
>
> Spark3 fails to read hive table with mixed format with hive Serde, this is a 
> regression compares to Spark 2.4. 
> Replication steps :
>  1. In spark 3 (3.0 or 3.1) spark shell:
> {code:java}
> scala> spark.sql("create table tmp.test_table (id int, name string) 
> partitioned by (pt int) stored as rcfile")
> scala> spark.sql("insert into tmp.test_table (pt = 1) values (1, 'Alice'), 
> (2, 'Bob')")
> {code}
> 2. Run hive command to change table file format (from RCFile to Parquet).
> {code:java}
> hive (default)> alter table set tmp.test_table fileformat Parquet;
> {code}
> 3. Try to read partition (in RCFile format) with hive serde using Spark shell:
> {code:java}
> scala> spark.conf.set("spark.sql.hive.convertMetastoreParquet", "false")
> scala> spark.sql("select * from tmp.test_table where pt=1").show{code}
> Exception: (anonymized file path with )
> {code:java}
> Caused by: java.lang.RuntimeException: 
> s3a:///data/part-0-22112178-5dd7-4065-89d7-2ee550296909-c000 is not 
> a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [5, 
> 96, 1, -33]
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:524)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:433)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:79)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:75)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:286)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:285)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36317) PruneFileSourcePartitionsSuite tests are failing after the fix to SPARK-36136

2021-07-27 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388204#comment-17388204
 ] 

Chao Sun commented on SPARK-36317:
--

[~vsowrirajan]: the change is already reverted - are you still seeing the test 
failures?

> PruneFileSourcePartitionsSuite tests are failing after the fix to SPARK-36136
> -
>
> Key: SPARK-36317
> URL: https://issues.apache.org/jira/browse/SPARK-36317
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Venkata krishnan Sowrirajan
>Priority: Major
>
> After the fix to [SPARK-36136][SQL][TESTS] Refactor 
> PruneFileSourcePartitionsSuite etc to a different package, couple of tests in 
> PruneFileSourcePartitionsSuite are failing now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36137) HiveShim always fallback to getAllPartitionsOf regardless of whether directSQL is enabled in remote HMS

2021-07-14 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36137:
-
Description: 
At the moment {{getPartitionsByFilter}} in Hive shim only fallback to use 
{{getAllPartitionsOf}} when {{hive.metastore.try.direct.sql}} is enabled in the 
remote HMS. However, in certain cases the remote HMS will fallback to use ORM 
(which only support string type for partition columns) to query the underlying 
RDBMS even if this config is set to true, and Spark will not be able to recover 
from the error and will just fail the query. 

For instance, we encountered this bug HIVE-21497 in HMS running Hive 3.1.2, and 
Spark was not able to pushdown filter for {{date}} column.


  was:
At the moment {{getPartitionsByFilter}} in Hive shim only fallback to use 
{{getAllPartitionsOf}} when {{hive.metastore.try.direct.sql}} is enabled in the 
remote HMS. However, in certain cases the remote HMS will fallback to use ORM 
to query the underlying RDBMS even if this config is set to true, and Spark 
will not be able to recover from the error and will just fail the query. 

For instance, we encountered this bug HIVE-21497 in HMS running Hive 3.1.2, and 
Spark was not able to pushdown filter for {{date}} column.



> HiveShim always fallback to getAllPartitionsOf regardless of whether 
> directSQL is enabled in remote HMS
> ---
>
> Key: SPARK-36137
> URL: https://issues.apache.org/jira/browse/SPARK-36137
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> At the moment {{getPartitionsByFilter}} in Hive shim only fallback to use 
> {{getAllPartitionsOf}} when {{hive.metastore.try.direct.sql}} is enabled in 
> the remote HMS. However, in certain cases the remote HMS will fallback to use 
> ORM (which only support string type for partition columns) to query the 
> underlying RDBMS even if this config is set to true, and Spark will not be 
> able to recover from the error and will just fail the query. 
> For instance, we encountered this bug HIVE-21497 in HMS running Hive 3.1.2, 
> and Spark was not able to pushdown filter for {{date}} column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36137) HiveShim always fallback to getAllPartitionsOf regardless of whether directSQL is enabled in remote HMS

2021-07-14 Thread Chao Sun (Jira)
Chao Sun created SPARK-36137:


 Summary: HiveShim always fallback to getAllPartitionsOf regardless 
of whether directSQL is enabled in remote HMS
 Key: SPARK-36137
 URL: https://issues.apache.org/jira/browse/SPARK-36137
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


At the moment {{getPartitionsByFilter}} in Hive shim only fallback to use 
{{getAllPartitionsOf}} when {{hive.metastore.try.direct.sql}} is enabled in the 
remote HMS. However, in certain cases the remote HMS will fallback to use ORM 
to query the underlying RDBMS even if this config is set to true, and Spark 
will not be able to recover from the error and will just fail the query. 

For instance, we encountered this bug HIVE-21497 in HMS running Hive 3.1.2, and 
Spark was not able to pushdown filter for {{date}} column.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36136) Move PruneFileSourcePartitionsSuite out of org.apache.spark.sql.hive

2021-07-14 Thread Chao Sun (Jira)
Chao Sun created SPARK-36136:


 Summary: Move PruneFileSourcePartitionsSuite out of 
org.apache.spark.sql.hive
 Key: SPARK-36136
 URL: https://issues.apache.org/jira/browse/SPARK-36136
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


Currently both {{PruneFileSourcePartitionsSuite}} and 
{{PrunePartitionSuiteBase}} are in {{org.apache.spark.sql.hive.execution}} 
which doesn't look right. They should belong to 
{{org.apache.spark.sql.execution.datasources}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36128) CatalogFileIndex.filterPartitions should respect spark.sql.hive.metastorePartitionPruning

2021-07-13 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380326#comment-17380326
 ] 

Chao Sun commented on SPARK-36128:
--

Thanks, I'm slightly inclined to reuse the existing config but document the new 
behavior (e.g., it is used for data source tables too when 
{{spark.sql.hive.manageFilesourcePartitions}} is set). Let me raise a PR for 
this and we can discuss there.

> CatalogFileIndex.filterPartitions should respect 
> spark.sql.hive.metastorePartitionPruning
> -
>
> Key: SPARK-36128
> URL: https://issues.apache.org/jira/browse/SPARK-36128
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently the config {{spark.sql.hive.metastorePartitionPruning}} is only 
> used in {{PruneHiveTablePartitions}} but not {{PruneFileSourcePartitions}}. 
> The latter calls {{CatalogFileIndex.filterPartitions}} which calls 
> {{listPartitionsByFilter}} regardless of whether the above config is set or 
> not. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36131) Refactor ParquetColumnIndexSuite

2021-07-13 Thread Chao Sun (Jira)
Chao Sun created SPARK-36131:


 Summary: Refactor ParquetColumnIndexSuite
 Key: SPARK-36131
 URL: https://issues.apache.org/jira/browse/SPARK-36131
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


This is a minor refactoring on {{ParquetColumnIndexSuite}} to allow better code 
reuse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36128) CatalogFileIndex.filterPartitions should respect spark.sql.hive.metastorePartitionPruning

2021-07-13 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380299#comment-17380299
 ] 

Chao Sun commented on SPARK-36128:
--

[~hyukjin.kwon] you are right - I didn't know this config is designed to be 
only used by Hive table scan, but this poses a few issues:
1. by default, data source tables also manage their partitions through HMS, via 
config {{spark.sql.hive.manageFilesourcePartitions}}. This config also says 
"When partition management is enabled, datasource tables store partition in the 
Hive metastore, and use the metastore to prune partitions during query 
planning", so it sounds like they should have the same partition pruning 
mechanism as Hive tables.
2. there is effectively only one implementation for {{ExternalCatalog}} which 
is HMS, so I'm not sure why we treat Hive table scans differently than data 
source table scans, even though both of them are reading partition metadata 
from HMS.

> CatalogFileIndex.filterPartitions should respect 
> spark.sql.hive.metastorePartitionPruning
> -
>
> Key: SPARK-36128
> URL: https://issues.apache.org/jira/browse/SPARK-36128
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently the config {{spark.sql.hive.metastorePartitionPruning}} is only 
> used in {{PruneHiveTablePartitions}} but not {{PruneFileSourcePartitions}}. 
> The latter calls {{CatalogFileIndex.filterPartitions}} which calls 
> {{listPartitionsByFilter}} regardless of whether the above config is set or 
> not. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36128) CatalogFileIndex.filterPartitions should respect spark.sql.hive.metastorePartitionPruning

2021-07-13 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380299#comment-17380299
 ] 

Chao Sun edited comment on SPARK-36128 at 7/14/21, 4:24 AM:


[~hyukjin.kwon] you are right - I didn't know this config is designed to be 
only used by Hive table scan, but this poses a few issues:
1. by default, data source tables also manage their partitions through HMS, via 
config {{spark.sql.hive.manageFilesourcePartitions}}. This config also says 
"When partition management is enabled, datasource tables store partition in the 
Hive metastore, and use the metastore to prune partitions during query 
planning", so it sounds like they should have the same partition pruning 
mechanism as Hive tables, including the flag.
2. there is effectively only one implementation for {{ExternalCatalog}} which 
is HMS, so I'm not sure why we treat Hive table scans differently than data 
source table scans, even though both of them are reading partition metadata 
from HMS.


was (Author: csun):
[~hyukjin.kwon] you are right - I didn't know this config is designed to be 
only used by Hive table scan, but this poses a few issues:
1. by default, data source tables also manage their partitions through HMS, via 
config {{spark.sql.hive.manageFilesourcePartitions}}. This config also says 
"When partition management is enabled, datasource tables store partition in the 
Hive metastore, and use the metastore to prune partitions during query 
planning", so it sounds like they should have the same partition pruning 
mechanism as Hive tables.
2. there is effectively only one implementation for {{ExternalCatalog}} which 
is HMS, so I'm not sure why we treat Hive table scans differently than data 
source table scans, even though both of them are reading partition metadata 
from HMS.

> CatalogFileIndex.filterPartitions should respect 
> spark.sql.hive.metastorePartitionPruning
> -
>
> Key: SPARK-36128
> URL: https://issues.apache.org/jira/browse/SPARK-36128
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently the config {{spark.sql.hive.metastorePartitionPruning}} is only 
> used in {{PruneHiveTablePartitions}} but not {{PruneFileSourcePartitions}}. 
> The latter calls {{CatalogFileIndex.filterPartitions}} which calls 
> {{listPartitionsByFilter}} regardless of whether the above config is set or 
> not. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36128) CatalogFileIndex.filterPartitions should respect spark.sql.hive.metastorePartitionPruning

2021-07-13 Thread Chao Sun (Jira)
Chao Sun created SPARK-36128:


 Summary: CatalogFileIndex.filterPartitions should respect 
spark.sql.hive.metastorePartitionPruning
 Key: SPARK-36128
 URL: https://issues.apache.org/jira/browse/SPARK-36128
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


Currently the config {{spark.sql.hive.metastorePartitionPruning}} is only used 
in {{PruneHiveTablePartitions}} but not {{PruneFileSourcePartitions}}. The 
latter calls {{CatalogFileIndex.filterPartitions}} which calls 
{{listPartitionsByFilter}} regardless of whether the above config is set or 
not. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36123) Parquet vectorized reader doesn't skip null values correctly

2021-07-13 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36123:
-
Labels: correctness  (was: )

> Parquet vectorized reader doesn't skip null values correctly
> 
>
> Key: SPARK-36123
> URL: https://issues.apache.org/jira/browse/SPARK-36123
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Blocker
>  Labels: correctness
>
> Currently when Parquet column index is effective, the vectorized reader 
> doesn't skip null values properly which will cause incorrect result.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36123) Parquet vectorized reader doesn't skip null values correctly

2021-07-13 Thread Chao Sun (Jira)
Chao Sun created SPARK-36123:


 Summary: Parquet vectorized reader doesn't skip null values 
correctly
 Key: SPARK-36123
 URL: https://issues.apache.org/jira/browse/SPARK-36123
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


Currently when Parquet column index is effective, the vectorized reader doesn't 
skip null values properly which will cause incorrect result.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35743) Improve Parquet vectorized reader

2021-07-11 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35743:
-
Labels: parquet  (was: )

> Improve Parquet vectorized reader
> -
>
> Key: SPARK-35743
> URL: https://issues.apache.org/jira/browse/SPARK-35743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>  Labels: parquet
>
> This umbrella JIRA tracks efforts to improve vectorized Parquet reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36056) Combine readBatch and readIntegers in VectorizedRleValuesReader

2021-07-08 Thread Chao Sun (Jira)
Chao Sun created SPARK-36056:


 Summary: Combine readBatch and readIntegers in 
VectorizedRleValuesReader
 Key: SPARK-36056
 URL: https://issues.apache.org/jira/browse/SPARK-36056
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


{{readBatch}} and {{readIntegers}} share similar code path and this Jira aims 
to combine them into one method for easier maintenance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35959) Add a new Maven profile "no-shaded-client" for older Hadoop 3.x versions

2021-06-30 Thread Chao Sun (Jira)
Chao Sun created SPARK-35959:


 Summary: Add a new Maven profile "no-shaded-client" for older 
Hadoop 3.x versions 
 Key: SPARK-35959
 URL: https://issues.apache.org/jira/browse/SPARK-35959
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.2.0
Reporter: Chao Sun


Currently Spark uses Hadoop shaded client by default. However, if Spark users 
want to build Spark with older version of Hadoop, such as 3.1.x, the shaded 
client cannot be used (currently it only support Hadoop 3.2.2+ and 3.3.1+). 
Therefore, this proposes to offer a new Maven profile "no-shaded-client" for 
this use case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-06-23 Thread Chao Sun (Jira)
Chao Sun created SPARK-35867:


 Summary: Enable vectorized read for 
VectorizedPlainValuesReader.readBooleans
 Key: SPARK-35867
 URL: https://issues.apache.org/jira/browse/SPARK-35867
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


Currently we decode PLAIN encoded booleans as follow:
{code:java}
  public final void readBooleans(int total, WritableColumnVector c, int rowId) {
// TODO: properly vectorize this
for (int i = 0; i < total; i++) {
  c.putBoolean(rowId + i, readBoolean());
}
  }
{code}

Ideally we should vectorize this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35846) Introduce ParquetReadState to track various states while reading a Parquet column chunk

2021-06-21 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35846:
-
Description: This is mostly refactoring work to complete SPARK-34859

> Introduce ParquetReadState to track various states while reading a Parquet 
> column chunk
> ---
>
> Key: SPARK-35846
> URL: https://issues.apache.org/jira/browse/SPARK-35846
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> This is mostly refactoring work to complete SPARK-34859



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35846) Introduce ParquetReadState to track various states while reading a Parquet column chunk

2021-06-21 Thread Chao Sun (Jira)
Chao Sun created SPARK-35846:


 Summary: Introduce ParquetReadState to track various states while 
reading a Parquet column chunk
 Key: SPARK-35846
 URL: https://issues.apache.org/jira/browse/SPARK-35846
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35640) Refactor Parquet vectorized reader to remove duplicated code paths

2021-06-11 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35640:
-
Parent: SPARK-35743
Issue Type: Sub-task  (was: Improvement)

> Refactor Parquet vectorized reader to remove duplicated code paths
> --
>
> Key: SPARK-35640
> URL: https://issues.apache.org/jira/browse/SPARK-35640
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently in Parquet vectorized code path, there are many code duplications 
> such as the following:
> {code:java}
>   public void readIntegers(
>   int total,
>   WritableColumnVector c,
>   int rowId,
>   int level,
>   VectorizedValuesReader data) throws IOException {
> int left = total;
> while (left > 0) {
>   if (this.currentCount == 0) this.readNextGroup();
>   int n = Math.min(left, this.currentCount);
>   switch (mode) {
> case RLE:
>   if (currentValue == level) {
> data.readIntegers(n, c, rowId);
>   } else {
> c.putNulls(rowId, n);
>   }
>   break;
> case PACKED:
>   for (int i = 0; i < n; ++i) {
> if (currentBuffer[currentBufferIdx++] == level) {
>   c.putInt(rowId + i, data.readInteger());
> } else {
>   c.putNull(rowId + i);
> }
>   }
>   break;
>   }
>   rowId += n;
>   left -= n;
>   currentCount -= n;
> }
>   }
> {code}
> This makes it hard to maintain as any change on this will need to be 
> replicated in 20+ places. The issue becomes more serious when we are going to 
> implement column index and complex type support for the vectorized path.
> The original intention is for performance. However now days JIT compilers 
> tend to be smart on this and will inline virtual calls as much as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35743) Improve Parquet vectorized reader

2021-06-11 Thread Chao Sun (Jira)
Chao Sun created SPARK-35743:


 Summary: Improve Parquet vectorized reader
 Key: SPARK-35743
 URL: https://issues.apache.org/jira/browse/SPARK-35743
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


This umbrella JIRA tracks efforts to improve vectorized Parquet reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34861) Support nested column in Spark vectorized readers

2021-06-10 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17361090#comment-17361090
 ] 

Chao Sun commented on SPARK-34861:
--

Synced with [~chengsu] offline and I will take over this JIRA.

> Support nested column in Spark vectorized readers
> -
>
> Key: SPARK-34861
> URL: https://issues.apache.org/jira/browse/SPARK-34861
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> This is the umbrella task to track the overall progress. The task is to 
> support nested column type in Spark vectorized reader, namely Parquet and 
> ORC. Currently both Parquet and ORC vectorized readers do not support nested 
> column type (struct, array and map). We implemented nested column vectorized 
> reader for FB-ORC in our internal fork of Spark. We are seeing performance 
> improvement compared to non-vectorized reader when reading nested columns. In 
> addition, this can also help improve the non-nested column performance when 
> reading non-nested and nested columns together in one query.
>  
> Parquet: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L173]
>  
> ORC:
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L138]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35703) Remove HashClusteredDistribution

2021-06-09 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35703:
-
Description: Currently Spark has {{HashClusteredDistribution}} and 
{{ClusteredDistribution}}. The only difference between the two is that the 
former is more strict when deciding whether bucket join is allowed to avoid 
shuffle: comparing to the latter, it requires *exact* match between the 
clustering keys from the output partitioning (i.e., {{HashPartitioning}}) and 
the join keys. However, this is unnecessary, as we should be able to avoid 
shuffle when the set of clustering keys is a subset of join keys, just like 
{{ClusteredDistribution}}.   (was: Currently Spark has 
{{HashClusteredDistribution}} and {{ClusteredDistribution}}. The only 
difference between the two is that the former is more strict when deciding 
whether bucket join is allowed to avoid shuffle: comparing to the latter, it 
requires *exact* match between the clustering keys from the output partitioning 
and the join keys. However, this is unnecessary, as we should be able to avoid 
shuffle when the set of clustering keys is a subset of join keys, just like 
{{ClusteredDistribution}}. )

> Remove HashClusteredDistribution
> 
>
> Key: SPARK-35703
> URL: https://issues.apache.org/jira/browse/SPARK-35703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently Spark has {{HashClusteredDistribution}} and 
> {{ClusteredDistribution}}. The only difference between the two is that the 
> former is more strict when deciding whether bucket join is allowed to avoid 
> shuffle: comparing to the latter, it requires *exact* match between the 
> clustering keys from the output partitioning (i.e., {{HashPartitioning}}) and 
> the join keys. However, this is unnecessary, as we should be able to avoid 
> shuffle when the set of clustering keys is a subset of join keys, just like 
> {{ClusteredDistribution}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35703) Remove HashClusteredDistribution

2021-06-09 Thread Chao Sun (Jira)
Chao Sun created SPARK-35703:


 Summary: Remove HashClusteredDistribution
 Key: SPARK-35703
 URL: https://issues.apache.org/jira/browse/SPARK-35703
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


Currently Spark has {{HashClusteredDistribution}} and 
{{ClusteredDistribution}}. The only difference between the two is that the 
former is more strict when deciding whether bucket join is allowed to avoid 
shuffle: comparing to the latter, it requires *exact* match between the 
clustering keys from the output partitioning and the join keys. However, this 
is unnecessary, as we should be able to avoid shuffle when the set of 
clustering keys is a subset of join keys, just like {{ClusteredDistribution}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35640) Refactor Parquet vectorized reader to remove duplicated code paths

2021-06-03 Thread Chao Sun (Jira)
Chao Sun created SPARK-35640:


 Summary: Refactor Parquet vectorized reader to remove duplicated 
code paths
 Key: SPARK-35640
 URL: https://issues.apache.org/jira/browse/SPARK-35640
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


Currently in Parquet vectorized code path, there are many code duplications 
such as the following:
{code:java}
  public void readIntegers(
  int total,
  WritableColumnVector c,
  int rowId,
  int level,
  VectorizedValuesReader data) throws IOException {
int left = total;
while (left > 0) {
  if (this.currentCount == 0) this.readNextGroup();
  int n = Math.min(left, this.currentCount);
  switch (mode) {
case RLE:
  if (currentValue == level) {
data.readIntegers(n, c, rowId);
  } else {
c.putNulls(rowId, n);
  }
  break;
case PACKED:
  for (int i = 0; i < n; ++i) {
if (currentBuffer[currentBufferIdx++] == level) {
  c.putInt(rowId + i, data.readInteger());
} else {
  c.putNull(rowId + i);
}
  }
  break;
  }
  rowId += n;
  left -= n;
  currentCount -= n;
}
  }
{code}

This makes it hard to maintain as any change on this will need to be replicated 
in 20+ places. The issue becomes more serious when we are going to implement 
column index and complex type support for the vectorized path.

The original intention is for performance. However now days JIT compilers tend 
to be smart on this and will inline virtual calls as much as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34859) Vectorized parquet reader needs synchronization among pages for column index

2021-05-25 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-34859:
-
Priority: Critical  (was: Major)

> Vectorized parquet reader needs synchronization among pages for column index
> 
>
> Key: SPARK-34859
> URL: https://issues.apache.org/jira/browse/SPARK-34859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Li Xian
>Priority: Critical
>  Labels: correctness
> Attachments: 
> part-0-bee08cae-04cd-491c-9602-4c66791af3d0-c000.snappy.parquet
>
>
> the current implementation has a problem. the pages returned by 
> `readNextFilteredRowGroup` may not be aligned, some columns may have more 
> rows than others.
> Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` 
> with `rowIndexes` to make sure that rows are aligned. 
> Currently `VectorizedParquetRecordReader` doesn't have such synchronizing 
> among pages from different columns. Using `readNextFilteredRowGroup` may 
> result in incorrect result.
>  
> I have attache an example parquet file. This file is generated with 
> `spark.range(0, 2000).map(i => (i.toLong, i.toInt))` and the layout of this 
> file is listed below.
> row group 0
> 
> _1:  INT64 SNAPPY DO:0 FPO:4 SZ:8161/16104/1.97 VC:2000 ENC:PLAIN,BIT_PACKED 
> [more]...
> _2:  INT32 SNAPPY DO:0 FPO:8165 SZ:8061/8052/1.00 VC:2000 
> ENC:PLAIN,BIT_PACKED [more]...
>     _1 TV=2000 RL=0 DL=0
>     
> 
>     page 0:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:500
>     page 1:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:500
>     page 2:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:500
>     page 3:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:500
>     _2 TV=2000 RL=0 DL=0
>     
> 
>     page 0:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:1000
>     page 1:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:1000
>  
> As you can see in the row group 0, column1 has 4 data pages each with 500 
> values and column 2 has 2 data pages with 1000 values each. 
> If we want to filter the rows by values with _1 = 510 using columnindex, 
> parquet will return the page 1 of column _1 and page 0 of column _2. Page 1 
> of column _1 starts with row 500, and page 0 of column _2 starts with row 0, 
> and it will be incorrect if we simply read the two values as one row.
>  
> As an example, If you try filter with  _1 = 510 with column index on in 
> current version, it will give you the wrong result
> +---+---+
> |_1 |_2 |
> +---+---+
> |510|10 |
> +---+---+
> And if turn columnindex off, you can get the correct result
> +---+---+
> |_1 |_2 |
> +---+---+
> |510|510|
> +---+---+
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34859) Vectorized parquet reader needs synchronization among pages for column index

2021-05-25 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-34859:
-
Labels: correctness  (was: )

> Vectorized parquet reader needs synchronization among pages for column index
> 
>
> Key: SPARK-34859
> URL: https://issues.apache.org/jira/browse/SPARK-34859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Li Xian
>Priority: Major
>  Labels: correctness
> Attachments: 
> part-0-bee08cae-04cd-491c-9602-4c66791af3d0-c000.snappy.parquet
>
>
> the current implementation has a problem. the pages returned by 
> `readNextFilteredRowGroup` may not be aligned, some columns may have more 
> rows than others.
> Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` 
> with `rowIndexes` to make sure that rows are aligned. 
> Currently `VectorizedParquetRecordReader` doesn't have such synchronizing 
> among pages from different columns. Using `readNextFilteredRowGroup` may 
> result in incorrect result.
>  
> I have attache an example parquet file. This file is generated with 
> `spark.range(0, 2000).map(i => (i.toLong, i.toInt))` and the layout of this 
> file is listed below.
> row group 0
> 
> _1:  INT64 SNAPPY DO:0 FPO:4 SZ:8161/16104/1.97 VC:2000 ENC:PLAIN,BIT_PACKED 
> [more]...
> _2:  INT32 SNAPPY DO:0 FPO:8165 SZ:8061/8052/1.00 VC:2000 
> ENC:PLAIN,BIT_PACKED [more]...
>     _1 TV=2000 RL=0 DL=0
>     
> 
>     page 0:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:500
>     page 1:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:500
>     page 2:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:500
>     page 3:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:500
>     _2 TV=2000 RL=0 DL=0
>     
> 
>     page 0:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:1000
>     page 1:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  
> [more]... VC:1000
>  
> As you can see in the row group 0, column1 has 4 data pages each with 500 
> values and column 2 has 2 data pages with 1000 values each. 
> If we want to filter the rows by values with _1 = 510 using columnindex, 
> parquet will return the page 1 of column _1 and page 0 of column _2. Page 1 
> of column _1 starts with row 500, and page 0 of column _2 starts with row 0, 
> and it will be incorrect if we simply read the two values as one row.
>  
> As an example, If you try filter with  _1 = 510 with column index on in 
> current version, it will give you the wrong result
> +---+---+
> |_1 |_2 |
> +---+---+
> |510|10 |
> +---+---+
> And if turn columnindex off, you can get the correct result
> +---+---+
> |_1 |_2 |
> +---+---+
> |510|510|
> +---+---+
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35461) Error when reading dictionary-encoded Parquet int column when read schema is bigint

2021-05-20 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348667#comment-17348667
 ] 

Chao Sun commented on SPARK-35461:
--

Actually this also fails when turning off the vectorized reader:
{code}
Caused by: java.lang.ClassCastException: class 
org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to class 
org.apache.spark.sql.catalyst.expressions.MutableInt 
(org.apache.spark.sql.catalyst.expressions.MutableLong and 
org.apache.spark.sql.catalyst.expressions.MutableInt are in unnamed module of 
loader 'app')
at 
org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.setInt(SpecificInternalRow.scala:253)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:178)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:88)
at 
org.apache.parquet.column.impl.ColumnReaderBase$2$3.writeValue(ColumnReaderBase.java:297)
at 
org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:440)
at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:229)
{code}
In this case parquet-mr is able to return the value but Spark won't be able to 
handle it.

> Error when reading dictionary-encoded Parquet int column when read schema is 
> bigint
> ---
>
> Key: SPARK-35461
> URL: https://issues.apache.org/jira/browse/SPARK-35461
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Chao Sun
>Priority: Major
>
> When reading a dictionary-encoded integer column from a Parquet file, and 
> users specify read schema to be bigint, Spark currently will fail with the 
> following exception:
> {code}
> java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
>   at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:50)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:344)
> {code}
> To reproduce:
> {code}
> val data = (0 to 10).flatMap(n => Seq.fill(10)(n)).map(i => (i, 
> i.toString))
> withParquetFile(data) { path =>
>   val readSchema = StructType(Seq(StructField("_1", LongType)))
>   spark.read.schema(readSchema).parquet(path).first()
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35461) Error when reading dictionary-encoded Parquet int column when read schema is bigint

2021-05-20 Thread Chao Sun (Jira)
Chao Sun created SPARK-35461:


 Summary: Error when reading dictionary-encoded Parquet int column 
when read schema is bigint
 Key: SPARK-35461
 URL: https://issues.apache.org/jira/browse/SPARK-35461
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1, 3.0.2
Reporter: Chao Sun


When reading a dictionary-encoded integer column from a Parquet file, and users 
specify read schema to be bigint, Spark currently will fail with the following 
exception:
{code}
java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:49)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:50)
at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:344)
{code}

To reproduce:
{code}
val data = (0 to 10).flatMap(n => Seq.fill(10)(n)).map(i => (i, i.toString))
withParquetFile(data) { path =>
  val readSchema = StructType(Seq(StructField("_1", LongType)))
  spark.read.schema(readSchema).parquet(path).first()
}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35422) Many test cases failed in Scala 2.13 CI

2021-05-17 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346448#comment-17346448
 ] 

Chao Sun commented on SPARK-35422:
--

Thanks [~dongjoon]. I've opened a PR for the above failures: 
https://github.com/apache/spark/pull/32575

> Many test cases failed in Scala 2.13 CI
> ---
>
> Key: SPARK-35422
> URL: https://issues.apache.org/jira/browse/SPARK-35422
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Major
>
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/]
>  
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   
> [org.apache.spark.sql.SQLQueryTestSuite.subquery/scalar-subquery/scalar-subquery-select.sql|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/subquery_scalar_subquery_scalar_subquery_select_sql/]|2.4
>  
> sec|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified 
> (tpcds-modifiedQueries/q46)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q46_/]|59
>  
> ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified 
> (tpcds-modifiedQueries/q53)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q53_/]|62
>  
> ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified 
> (tpcds-modifiedQueries/q63)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q63_/]|54
>  
> ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified 
> (tpcds-modifiedQueries/q68)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q68_/]|50
>  
> ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified 
> (tpcds-modifiedQueries/q73)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q73_/]|58
>  
> ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilityWithStatsSuite.check 
> simplified sf100 
> (tpcds-modifiedQueries/q46)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilityWithStatsSuite/check_simplified_sf100__tpcds_modifiedQueries_q46_/]|62
>  
> ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilityWithStatsSuite.check 
> simplified sf100 
> (tpcds-modifiedQueries/q53)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilityWithStatsSuite/check_simplified_sf100__tpcds_modified

[jira] [Created] (SPARK-35390) Handle type coercion when resolving V2 functions

2021-05-12 Thread Chao Sun (Jira)
Chao Sun created SPARK-35390:


 Summary: Handle type coercion when resolving V2 functions 
 Key: SPARK-35390
 URL: https://issues.apache.org/jira/browse/SPARK-35390
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


When resolving V2 functions, we should handle type coercion by checking the 
expected argument types from the UDF function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35389) Analyzer should set progagateNull to false for magic function invocation

2021-05-12 Thread Chao Sun (Jira)
Chao Sun created SPARK-35389:


 Summary: Analyzer should set progagateNull to false for magic 
function invocation
 Key: SPARK-35389
 URL: https://issues.apache.org/jira/browse/SPARK-35389
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


For both {{Invoke}} and {{StaticInvoke}} used by magic method of 
{{ScalarFunction}}, we should set {{propgateNull}} to false, so that null 
values will be passed to the UDF for evaluation, instead of bypassing that and 
directly return null. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35384) Improve performance for InvokeLike.invoke

2021-05-11 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35384:
-
Issue Type: Improvement  (was: Bug)

> Improve performance for InvokeLike.invoke
> -
>
> Key: SPARK-35384
> URL: https://issues.apache.org/jira/browse/SPARK-35384
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> `InvokeLike.invoke` uses `map` to evaluate arguments:
> {code:java}
> val args = arguments.map(e => e.eval(input).asInstanceOf[Object])
> if (needNullCheck && args.exists(_ == null)) {
>   // return null if one of arguments is null
>   null
> } else { 
> {code}
> which seems pretty expensive if the method itself is trivial. We can change 
> it to a plain for-loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35384) Improve performance for InvokeLike.invoke

2021-05-11 Thread Chao Sun (Jira)
Chao Sun created SPARK-35384:


 Summary: Improve performance for InvokeLike.invoke
 Key: SPARK-35384
 URL: https://issues.apache.org/jira/browse/SPARK-35384
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


`InvokeLike.invoke` uses `map` to evaluate arguments:
{code:java}
val args = arguments.map(e => e.eval(input).asInstanceOf[Object])
if (needNullCheck && args.exists(_ == null)) {
  // return null if one of arguments is null
  null
} else { 
{code}
which seems pretty expensive if the method itself is trivial. We can change it 
to a plain for-loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35361) Improve performance for ApplyFunctionExpression

2021-05-10 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35361:
-
Priority: Minor  (was: Major)

> Improve performance for ApplyFunctionExpression
> ---
>
> Key: SPARK-35361
> URL: https://issues.apache.org/jira/browse/SPARK-35361
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> When the `ScalarFunction` itself is trivial, `ApplyFunctionExpression` could 
> incur significant runtime cost with `zipWithIndex` call. This proposes to 
> move the call outside the loop over each input row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35361) Improve performance for ApplyFunctionExpression

2021-05-10 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35361:
-
Priority: Major  (was: Minor)

> Improve performance for ApplyFunctionExpression
> ---
>
> Key: SPARK-35361
> URL: https://issues.apache.org/jira/browse/SPARK-35361
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> When the `ScalarFunction` itself is trivial, `ApplyFunctionExpression` could 
> incur significant runtime cost with `zipWithIndex` call. This proposes to 
> move the call outside the loop over each input row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35361) Improve performance for ApplyFunctionExpression

2021-05-10 Thread Chao Sun (Jira)
Chao Sun created SPARK-35361:


 Summary: Improve performance for ApplyFunctionExpression
 Key: SPARK-35361
 URL: https://issues.apache.org/jira/browse/SPARK-35361
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


When the `ScalarFunction` itself is trivial, `ApplyFunctionExpression` could 
incur significant runtime cost with `zipWithIndex` call. This proposes to move 
the call outside the loop over each input row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35321) Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift API missing

2021-05-05 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35321:
-
Issue Type: Bug  (was: Improvement)

> Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift 
> API missing
> ---
>
> Key: SPARK-35321
> URL: https://issues.apache.org/jira/browse/SPARK-35321
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-10319 introduced a new API 
> {{get_all_functions}} which is only supported in Hive 1.3.0/2.0.0 and up. 
> This is called when creating a new {{Hive}} object:
> {code}
>   private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException {
> conf = c;
> if (doRegisterAllFns) {
>   registerAllFunctionsOnce();
> }
>   }
> {code}
> {{registerAllFunctionsOnce}} will reload all the permanent functions by 
> calling the {{get_all_functions}} API from the megastore. In Spark, we always 
> pass {{doRegisterAllFns}} as true, and this will cause failure:
> {code}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_all_functions'
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231)
>   ... 96 more
> Caused by: org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_all_functions'
>   at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
>   at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845)
>   at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833)
> {code}
> It looks like Spark doesn't really need to call {{registerAllFunctionsOnce}} 
> since it loads the Hive permanent function directly from HMS API. The Hive 
> {{FunctionRegistry}} is only used for loading Hive built-in functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35321) Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift API missing

2021-05-05 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339946#comment-17339946
 ] 

Chao Sun commented on SPARK-35321:
--

[~yumwang] I'm thinking of using 
[Hive#getWithFastCheck|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L389]
 for this purpose, which allows us to set the flag to false. The fast check 
flag also offers a way to compare {{HiveConf}} faster when the conf rarely 
changes.

> Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift 
> API missing
> ---
>
> Key: SPARK-35321
> URL: https://issues.apache.org/jira/browse/SPARK-35321
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-10319 introduced a new API 
> {{get_all_functions}} which is only supported in Hive 1.3.0/2.0.0 and up. 
> This is called when creating a new {{Hive}} object:
> {code}
>   private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException {
> conf = c;
> if (doRegisterAllFns) {
>   registerAllFunctionsOnce();
> }
>   }
> {code}
> {{registerAllFunctionsOnce}} will reload all the permanent functions by 
> calling the {{get_all_functions}} API from the megastore. In Spark, we always 
> pass {{doRegisterAllFns}} as true, and this will cause failure:
> {code}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_all_functions'
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231)
>   ... 96 more
> Caused by: org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_all_functions'
>   at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
>   at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845)
>   at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833)
> {code}
> It looks like Spark doesn't really need to call {{registerAllFunctionsOnce}} 
> since it loads the Hive permanent function directly from HMS API. The Hive 
> {{FunctionRegistry}} is only used for loading Hive built-in functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35321) Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift API missing

2021-05-05 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339906#comment-17339906
 ] 

Chao Sun commented on SPARK-35321:
--

[~xkrogen] yes that can help to solve the issue, but users need to specify both 
{{spark.sql.hive.metastore.version}} and {{spark.sql.hive.metastore.jars}}. The 
latter is not so easy to setup: the {{maven}} option usually takes a very long 
time to download all the jars, while the {{path}} option require users to 
download all the relevant Hive jars with the specific version and it's tedious. 

I think this specific issue is worth fixing in Spark itself regardless since it 
doesn't really need to load all the permanent functions when starting up Hive 
client from what I can see. The process could also be pretty expensive if there 
are many UDFs registered in HMS.

> Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift 
> API missing
> ---
>
> Key: SPARK-35321
> URL: https://issues.apache.org/jira/browse/SPARK-35321
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-10319 introduced a new API 
> {{get_all_functions}} which is only supported in Hive 1.3.0/2.0.0 and up. 
> This is called when creating a new {{Hive}} object:
> {code}
>   private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException {
> conf = c;
> if (doRegisterAllFns) {
>   registerAllFunctionsOnce();
> }
>   }
> {code}
> {{registerAllFunctionsOnce}} will reload all the permanent functions by 
> calling the {{get_all_functions}} API from the megastore. In Spark, we always 
> pass {{doRegisterAllFns}} as true, and this will cause failure:
> {code}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_all_functions'
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231)
>   ... 96 more
> Caused by: org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_all_functions'
>   at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
>   at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845)
>   at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833)
> {code}
> It looks like Spark doesn't really need to call {{registerAllFunctionsOnce}} 
> since it loads the Hive permanent function directly from HMS API. The Hive 
> {{FunctionRegistry}} is only used for loading Hive built-in functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35321) Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift API missing

2021-05-05 Thread Chao Sun (Jira)
Chao Sun created SPARK-35321:


 Summary: Spark 3.x can't talk to HMS 1.2.x and lower due to 
get_all_functions Thrift API missing
 Key: SPARK-35321
 URL: https://issues.apache.org/jira/browse/SPARK-35321
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1, 3.0.2, 3.2.0
Reporter: Chao Sun


https://issues.apache.org/jira/browse/HIVE-10319 introduced a new API 
{{get_all_functions}} which is only supported in Hive 1.3.0/2.0.0 and up. This 
is called when creating a new {{Hive}} object:
{code}
  private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException {
conf = c;
if (doRegisterAllFns) {
  registerAllFunctionsOnce();
}
  }
{code}

{{registerAllFunctionsOnce }} will reload all the permanent functions by 
calling the {{get_all_functions}} API from the megastore. In Spark, we always 
pass {{doRegisterAllFns}} as true, and this will cause failure:
{code}
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
org.apache.thrift.TApplicationException: Invalid method name: 
'get_all_functions'
at 
org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897)
at 
org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248)
at 
org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231)
... 96 more
Caused by: org.apache.thrift.TApplicationException: Invalid method name: 
'get_all_functions'
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833)
{code}

It looks like Spark doesn't really need to call {{registerAllFunctionsOnce}} 
since it loads the Hive permanent function directly from HMS API. The Hive 
{{FunctionRegistry}} is only used for loading Hive built-in functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35321) Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift API missing

2021-05-05 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35321:
-
Description: 
https://issues.apache.org/jira/browse/HIVE-10319 introduced a new API 
{{get_all_functions}} which is only supported in Hive 1.3.0/2.0.0 and up. This 
is called when creating a new {{Hive}} object:
{code}
  private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException {
conf = c;
if (doRegisterAllFns) {
  registerAllFunctionsOnce();
}
  }
{code}

{{registerAllFunctionsOnce}} will reload all the permanent functions by calling 
the {{get_all_functions}} API from the megastore. In Spark, we always pass 
{{doRegisterAllFns}} as true, and this will cause failure:
{code}
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
org.apache.thrift.TApplicationException: Invalid method name: 
'get_all_functions'
at 
org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897)
at 
org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248)
at 
org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231)
... 96 more
Caused by: org.apache.thrift.TApplicationException: Invalid method name: 
'get_all_functions'
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833)
{code}

It looks like Spark doesn't really need to call {{registerAllFunctionsOnce}} 
since it loads the Hive permanent function directly from HMS API. The Hive 
{{FunctionRegistry}} is only used for loading Hive built-in functions.

  was:
https://issues.apache.org/jira/browse/HIVE-10319 introduced a new API 
{{get_all_functions}} which is only supported in Hive 1.3.0/2.0.0 and up. This 
is called when creating a new {{Hive}} object:
{code}
  private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException {
conf = c;
if (doRegisterAllFns) {
  registerAllFunctionsOnce();
}
  }
{code}

{{registerAllFunctionsOnce }} will reload all the permanent functions by 
calling the {{get_all_functions}} API from the megastore. In Spark, we always 
pass {{doRegisterAllFns}} as true, and this will cause failure:
{code}
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
org.apache.thrift.TApplicationException: Invalid method name: 
'get_all_functions'
at 
org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897)
at 
org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248)
at 
org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231)
... 96 more
Caused by: org.apache.thrift.TApplicationException: Invalid method name: 
'get_all_functions'
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833)
{code}

It looks like Spark doesn't really need to call {{registerAllFunctionsOnce}} 
since it loads the Hive permanent function directly from HMS API. The Hive 
{{FunctionRegistry}} is only used for loading Hive built-in functions.


> Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift 
> API missing
> ---
>
> Key: SPARK-35321
> URL: https://issues.apache.org/jira/browse/SPARK-35321
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-10319 introduced a new API 
> {{get_all_functions}} which is only supported in Hive 1.3.0/2.0.0 and up. 
> This is called when creating a new {{Hive}} object:
> {code}
>   private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException {
> conf = c;
> if (doRegisterAllFns) {
>   registerAllFunctionsOnce();
> }
>   }
> {code}
> {{registerAllFunctionsOnce}} will reload all the permanent functions by 
> calling the {{get_all_functions}} API from the megastore. In Spark, we always 
> pass {{doRegisterAllFns}} as true, and this will cause failure:
> {code}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_all_functions'
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.relo

[jira] [Updated] (SPARK-35315) Keep benchmark result consistent between spark-submit and SBT

2021-05-04 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35315:
-
Priority: Minor  (was: Major)

> Keep benchmark result consistent between spark-submit and SBT
> -
>
> Key: SPARK-35315
> URL: https://issues.apache.org/jira/browse/SPARK-35315
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> Currently benchmark can be done in two ways: {{spark-submit}}, or SBT 
> command. However in the former Spark will miss some properties such as 
> {{IS_TESTING}}, which is useful to turn on/off some behavior like codegen. 
> Therefore, the result could differ with the two methods. In addition, the 
> benchmark GitHub workflow is using the {{spark-submit}} approach.
> This propose to set {{IS_TESTING}} to true in {{BenchmarkBase}} so that it is 
> always on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35315) Keep benchmark result consistent between spark-submit and SBT

2021-05-04 Thread Chao Sun (Jira)
Chao Sun created SPARK-35315:


 Summary: Keep benchmark result consistent between spark-submit and 
SBT
 Key: SPARK-35315
 URL: https://issues.apache.org/jira/browse/SPARK-35315
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.2.0
Reporter: Chao Sun


Currently benchmark can be done in two ways: {{spark-submit}}, or SBT command. 
However in the former Spark will miss some properties such as {{IS_TESTING}}, 
which is useful to turn on/off some behavior like codegen. Therefore, the 
result could differ with the two methods. In addition, the benchmark GitHub 
workflow is using the {{spark-submit}} approach.

This propose to set {{IS_TESTING}} to true in {{BenchmarkBase}} so that it is 
always on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35281) StaticInvoke should not apply boxing if return type is primitive

2021-04-30 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35281:
-
Priority: Minor  (was: Major)

> StaticInvoke should not apply boxing if return type is primitive
> 
>
> Key: SPARK-35281
> URL: https://issues.apache.org/jira/browse/SPARK-35281
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> Currently {{StaticInvoke}} will apply boxing even if the return type is 
> primitive. This seems unnecessary. In comparison, {{Invoke}} checks this and 
> skips the boxing if return type is primitive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35281) StaticInvoke should not apply boxing if return type is primitive

2021-04-30 Thread Chao Sun (Jira)
Chao Sun created SPARK-35281:


 Summary: StaticInvoke should not apply boxing if return type is 
primitive
 Key: SPARK-35281
 URL: https://issues.apache.org/jira/browse/SPARK-35281
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


Currently {{StaticInvoke}} will apply boxing even if the return type is 
primitive. This seems unnecessary. In comparison, {{Invoke}} checks this and 
skips the boxing if return type is primitive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35261) Support static invoke for stateless UDF

2021-04-28 Thread Chao Sun (Jira)
Chao Sun created SPARK-35261:


 Summary: Support static invoke for stateless UDF
 Key: SPARK-35261
 URL: https://issues.apache.org/jira/browse/SPARK-35261
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


For UDFs that are stateless, we should allow users to define "magic method" as 
a static Java method which removes extra costs from dynamic dispatch and gives 
better performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34981) Implement V2 function resolution and evaluation

2021-04-28 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-34981:
-
Parent: SPARK-35260
Issue Type: Sub-task  (was: Improvement)

> Implement V2 function resolution and evaluation 
> 
>
> Key: SPARK-34981
> URL: https://issues.apache.org/jira/browse/SPARK-34981
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims 
> at implementing the function resolution (in analyzer) and evaluation by 
> wrapping them into corresponding expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35260) DataSourceV2 Function Catalog implementation

2021-04-28 Thread Chao Sun (Jira)
Chao Sun created SPARK-35260:


 Summary: DataSourceV2 Function Catalog implementation
 Key: SPARK-35260
 URL: https://issues.apache.org/jira/browse/SPARK-35260
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


This tracks the implementation and follow-up work for V2 Function Catalog 
introduced in SPARK-27658



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35233) Switch from bintray to scala.jfrog.io for SBT download in branch 2.4 and 3.0

2021-04-26 Thread Chao Sun (Jira)
Chao Sun created SPARK-35233:


 Summary: Switch from bintray to scala.jfrog.io for SBT download in 
branch 2.4 and 3.0
 Key: SPARK-35233
 URL: https://issues.apache.org/jira/browse/SPARK-35233
 Project: Spark
  Issue Type: Task
  Components: Build
Affects Versions: 2.4.8, 3.1.2
Reporter: Chao Sun


As bintray is going to be 
[deprecated|https://eed3si9n.com/bintray-to-jfrog-artifactory-migration-status-and-sbt-1.5.1],
 the download of {{sot-launch.jar}} will fail. This proposes to migrate the URL 
from {{https://dl.bintray.com/typesafe}} to 
{{https://scala.jfrog.io/artifactory}} for branch-2.4 and branch-3.0 which are 
affected by this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35232) Nested column pruning should retain column metadata

2021-04-26 Thread Chao Sun (Jira)
Chao Sun created SPARK-35232:


 Summary: Nested column pruning should retain column metadata
 Key: SPARK-35232
 URL: https://issues.apache.org/jira/browse/SPARK-35232
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


It seems we should retain column metadata when pruning nested columns. 
Otherwise the info will be lost and will affect things such as re-constructing 
CHAR/VARCHAR type (SPARK-33901).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35195) Move InMemoryTable etc to org.apache.spark.sql.connector.catalog

2021-04-22 Thread Chao Sun (Jira)
Chao Sun created SPARK-35195:


 Summary: Move InMemoryTable etc to 
org.apache.spark.sql.connector.catalog
 Key: SPARK-35195
 URL: https://issues.apache.org/jira/browse/SPARK-35195
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


Currently test classes such as {{InMemoryTable}} reside in 
{{org.apache.spark.sql.connector}} rather than 
{{org.apache.spark.sql.connector.catalog}}. We should move them to latter to 
match the interfaces they implement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35003) Improve performance for reading smallint in vectorized Parquet reader

2021-04-09 Thread Chao Sun (Jira)
Chao Sun created SPARK-35003:


 Summary: Improve performance for reading smallint in vectorized 
Parquet reader
 Key: SPARK-35003
 URL: https://issues.apache.org/jira/browse/SPARK-35003
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


Currently {{VectorizedRleValuesReader}} reads short in the following way:
{code:java}
for (int i = 0; i < n; i++) {
  c.putShort(rowId + i, (short)data.readInteger());
}
{code}

For PLAIN encoding {{readInteger}} is done via:
{code:java}
public final int readInteger() {
  return getBuffer(4).getInt();
}
{code}
which means it needs to repeatedly call {{slice}} buffer which is more 
expensive than calling it once in a big chunk and then reading the ints out.
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used

2021-04-07 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316714#comment-17316714
 ] 

Chao Sun commented on SPARK-34780:
--

Hi [~mikechen] (and sorry for the late reply again), thanks for providing 
another very useful code snippet! I'm not sure if this qualifies as correctness 
issue though since it is (to me) more like different interpretations of 
malformed columns in CSV? 

My previous statement about {{SessionState}} is incorrect. It seems the conf in 
{{SessionState}} is always the most up-to-date one. The only solution I can 
think of to solve this issue is to take conf into account when checking 
equality of {{HadoopFsRelation}} (and potentially others), which means we'd 
need to define equality for {{SQLConf}}..

> Cached Table (parquet) with old Configs Used
> 
>
> Key: SPARK-34780
> URL: https://issues.apache.org/jira/browse/SPARK-34780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.1
>Reporter: Michael Chen
>Priority: Major
>
> When a dataframe is cached, the logical plan can contain copies of the spark 
> session meaning the SQLConfs are stored. Then if a different dataframe can 
> replace parts of it's logical plan with a cached logical plan, the cached 
> SQLConfs will be used for the evaluation of the cached logical plan. This is 
> because HadoopFsRelation ignores sparkSession for equality checks (introduced 
> in https://issues.apache.org/jira/browse/SPARK-17358).
> {code:java}
> test("cache uses old SQLConf") {
>   import testImplicits._
>   withTempDir { dir =>
> val tableDir = dir.getAbsoluteFile + "/table"
> val df = Seq("a").toDF("key")
> df.write.parquet(tableDir)
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1Stats = spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10")
> val df2 = spark.read.parquet(tableDir).select("key")
> df2.cache()
> val compression10Stats = df2.queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1StatsWithCache = 
> spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> // I expect these stats to be the same because file compression factor is 
> the same
> assert(compression1Stats == compression1StatsWithCache)
> // Instead, we can see the file compression factor is being cached and 
> used along with
> // the logical plan
> assert(compression10Stats == compression1StatsWithCache)
>   }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34981) Implement V2 function resolution and evaluation

2021-04-07 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316477#comment-17316477
 ] 

Chao Sun commented on SPARK-34981:
--

Will submit a PR soon.

> Implement V2 function resolution and evaluation 
> 
>
> Key: SPARK-34981
> URL: https://issues.apache.org/jira/browse/SPARK-34981
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims 
> at implementing the function resolution (in analyzer) and evaluation by 
> wrapping them into corresponding expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34981) Implement V2 function resolution and evaluation

2021-04-07 Thread Chao Sun (Jira)
Chao Sun created SPARK-34981:


 Summary: Implement V2 function resolution and evaluation 
 Key: SPARK-34981
 URL: https://issues.apache.org/jira/browse/SPARK-34981
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims at 
implementing the function resolution (in analyzer) and evaluation by wrapping 
them into corresponding expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34973) Cleanup unused fields and methods in vectorized Parquet reader

2021-04-06 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-34973:
-
Priority: Minor  (was: Major)

> Cleanup unused fields and methods in vectorized Parquet reader
> --
>
> Key: SPARK-34973
> URL: https://issues.apache.org/jira/browse/SPARK-34973
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> There are some legacy fields and methods in vectorized Parquet reader which 
> are no longer used. It's better to clean them up to make the code easier to 
> maintain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34973) Cleanup unused fields and methods in vectorized Parquet reader

2021-04-06 Thread Chao Sun (Jira)
Chao Sun created SPARK-34973:


 Summary: Cleanup unused fields and methods in vectorized Parquet 
reader
 Key: SPARK-34973
 URL: https://issues.apache.org/jira/browse/SPARK-34973
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


There are some legacy fields and methods in vectorized Parquet reader which are 
no longer used. It's better to clean them up to make the code easier to 
maintain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34947) Streaming write to a V2 table should invalidate its associated cache

2021-04-02 Thread Chao Sun (Jira)
Chao Sun created SPARK-34947:


 Summary: Streaming write to a V2 table should invalidate its 
associated cache
 Key: SPARK-34947
 URL: https://issues.apache.org/jira/browse/SPARK-34947
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.1
Reporter: Chao Sun


A DSv2 table that supports {{STREAMING_WRITE}} can be written by a streaming 
job. However currently Spark doesn't invalidate its associated cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34945) Fix Javadoc for catalyst module

2021-04-02 Thread Chao Sun (Jira)
Chao Sun created SPARK-34945:


 Summary: Fix Javadoc for catalyst module
 Key: SPARK-34945
 URL: https://issues.apache.org/jira/browse/SPARK-34945
 Project: Spark
  Issue Type: Improvement
  Components: docs, Documentation
Affects Versions: 3.2.0
Reporter: Chao Sun


Inside catalyst module there are many Java classes, especially those in DSv2, 
are not using proper Javadoc format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used

2021-03-25 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308854#comment-17308854
 ] 

Chao Sun commented on SPARK-34780:
--

[~mikechen], yes you're right. I'm not sure if this is a big concern though, 
since it just means the plan fragment for the cache is executed with the stale 
conf. I guess as long as there is no correctness issue (which I'd be surprised 
to see if there's any), it should be fine?

It seems a bit tricky to fix the issue, since the {{SparkSession}} is leaked to 
many places. I guess one way is to follow the idea of SPARK-33389 and change 
{{SessionState}} to always use the active conf. 

> Cached Table (parquet) with old Configs Used
> 
>
> Key: SPARK-34780
> URL: https://issues.apache.org/jira/browse/SPARK-34780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.1
>Reporter: Michael Chen
>Priority: Major
>
> When a dataframe is cached, the logical plan can contain copies of the spark 
> session meaning the SQLConfs are stored. Then if a different dataframe can 
> replace parts of it's logical plan with a cached logical plan, the cached 
> SQLConfs will be used for the evaluation of the cached logical plan. This is 
> because HadoopFsRelation ignores sparkSession for equality checks (introduced 
> in https://issues.apache.org/jira/browse/SPARK-17358).
> {code:java}
> test("cache uses old SQLConf") {
>   import testImplicits._
>   withTempDir { dir =>
> val tableDir = dir.getAbsoluteFile + "/table"
> val df = Seq("a").toDF("key")
> df.write.parquet(tableDir)
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1Stats = spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10")
> val df2 = spark.read.parquet(tableDir).select("key")
> df2.cache()
> val compression10Stats = df2.queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1StatsWithCache = 
> spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> // I expect these stats to be the same because file compression factor is 
> the same
> assert(compression1Stats == compression1StatsWithCache)
> // Instead, we can see the file compression factor is being cached and 
> used along with
> // the logical plan
> assert(compression10Stats == compression1StatsWithCache)
>   }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used

2021-03-24 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308262#comment-17308262
 ] 

Chao Sun commented on SPARK-34780:
--

Sorry for the late reply [~mikechen]! There's something I still not quite 
clear: when the cache is retrieved, a {{InMemoryRelation}} will be used to 
replace the plan fragment that is matched. Therefore, how can the old stale 
conf still be used in places like {{DataSourceScanExec}}?

> Cached Table (parquet) with old Configs Used
> 
>
> Key: SPARK-34780
> URL: https://issues.apache.org/jira/browse/SPARK-34780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.1
>Reporter: Michael Chen
>Priority: Major
>
> When a dataframe is cached, the logical plan can contain copies of the spark 
> session meaning the SQLConfs are stored. Then if a different dataframe can 
> replace parts of it's logical plan with a cached logical plan, the cached 
> SQLConfs will be used for the evaluation of the cached logical plan. This is 
> because HadoopFsRelation ignores sparkSession for equality checks (introduced 
> in https://issues.apache.org/jira/browse/SPARK-17358).
> {code:java}
> test("cache uses old SQLConf") {
>   import testImplicits._
>   withTempDir { dir =>
> val tableDir = dir.getAbsoluteFile + "/table"
> val df = Seq("a").toDF("key")
> df.write.parquet(tableDir)
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1Stats = spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10")
> val df2 = spark.read.parquet(tableDir).select("key")
> df2.cache()
> val compression10Stats = df2.queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1StatsWithCache = 
> spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> // I expect these stats to be the same because file compression factor is 
> the same
> assert(compression1Stats == compression1StatsWithCache)
> // Instead, we can see the file compression factor is being cached and 
> used along with
> // the logical plan
> assert(compression10Stats == compression1StatsWithCache)
>   }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30497) migrate DESCRIBE TABLE to the new framework

2021-03-24 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308067#comment-17308067
 ] 

Chao Sun commented on SPARK-30497:
--

[~cloud_fan] this is resolved right?

> migrate DESCRIBE TABLE to the new framework
> ---
>
> Key: SPARK-30497
> URL: https://issues.apache.org/jira/browse/SPARK-30497
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used

2021-03-19 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305109#comment-17305109
 ] 

Chao Sun commented on SPARK-34780:
--

Thanks for the reporting [~mikechen], the test case you provided is very 
useful. 

I'm not sure, though, how severe is the issue since it only affects 
{{computeStats}}, and when the cache is actually materialized (e.g., via 
{{df2.count()}} after {{df2.cache()}}), the value from {{computeStats}} will be 
different anyways. Could you give more details?

> Cached Table (parquet) with old Configs Used
> 
>
> Key: SPARK-34780
> URL: https://issues.apache.org/jira/browse/SPARK-34780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.1
>Reporter: Michael Chen
>Priority: Major
>
> When a dataframe is cached, the logical plan can contain copies of the spark 
> session meaning the SQLConfs are stored. Then if a different dataframe can 
> replace parts of it's logical plan with a cached logical plan, the cached 
> SQLConfs will be used for the evaluation of the cached logical plan. This is 
> because HadoopFsRelation ignores sparkSession for equality checks (introduced 
> in https://issues.apache.org/jira/browse/SPARK-17358).
> {code:java}
> test("cache uses old SQLConf") {
>   import testImplicits._
>   withTempDir { dir =>
> val tableDir = dir.getAbsoluteFile + "/table"
> val df = Seq("a").toDF("key")
> df.write.parquet(tableDir)
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1Stats = spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10")
> val df2 = spark.read.parquet(tableDir).select("key")
> df2.cache()
> val compression10Stats = df2.queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1StatsWithCache = 
> spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> // I expect these stats to be the same because file compression factor is 
> the same
> assert(compression1Stats == compression1StatsWithCache)
> // Instead, we can see the file compression factor is being cached and 
> used along with
> // the logical plan
> assert(compression10Stats == compression1StatsWithCache)
>   }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32703) Replace deprecated API calls from SpecificParquetRecordReaderBase

2021-02-26 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-32703:
-
Description: Currently in {{SpecificParquetRecordReaderBase}} we use 
deprecated APIs in a few places from Parquet, such as {{readFooter}}, 
{{ParquetInputSplit}}, deprecated ctor for {{ParquetFileReader}}, 
{{filterRowGroups}}, etc. These are going to be removed in some of the future 
Parquet versions so we should move to the new APIs for better maintainability.  
 (was: Parquet vectorized reader still uses the old API for {{filterRowGroups}} 
and only filters on statistics. It should switch to the new API and do 
dictionary filtering as well.)

> Replace deprecated API calls from SpecificParquetRecordReaderBase
> -
>
> Key: SPARK-32703
> URL: https://issues.apache.org/jira/browse/SPARK-32703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Priority: Minor
>
> Currently in {{SpecificParquetRecordReaderBase}} we use deprecated APIs in a 
> few places from Parquet, such as {{readFooter}}, {{ParquetInputSplit}}, 
> deprecated ctor for {{ParquetFileReader}}, {{filterRowGroups}}, etc. These 
> are going to be removed in some of the future Parquet versions so we should 
> move to the new APIs for better maintainability. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32703) Replace deprecated API calls from SpecificParquetRecordReaderBase

2021-02-26 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-32703:
-
Summary: Replace deprecated API calls from SpecificParquetRecordReaderBase  
(was: Enable dictionary filtering for Parquet vectorized reader)

> Replace deprecated API calls from SpecificParquetRecordReaderBase
> -
>
> Key: SPARK-32703
> URL: https://issues.apache.org/jira/browse/SPARK-32703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Priority: Minor
>
> Parquet vectorized reader still uses the old API for {{filterRowGroups}} and 
> only filters on statistics. It should switch to the new API and do dictionary 
> filtering as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-24 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290707#comment-17290707
 ] 

Chao Sun commented on SPARK-33212:
--

Yes. I think the only class Spark needs from this jar is 
{{org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter}}, which together 
with other two classes it depends on from the same package, do not have Guava 
dependency except {{VisibleForTesting}}.

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-24 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290613#comment-17290613
 ] 

Chao Sun edited comment on SPARK-33212 at 2/25/21, 2:21 AM:


I was able to reproduce the error in my local environment, and find a potential 
fix in Spark. I think only {{hadoop-yarn-server-web-proxy}} is needed by Spark 
- all the other YARN jars are already covered by {{hadoop-client-api}} and 
{{hadoop-client-runtime}}. I'll open a PR for this soon.


was (Author: csun):
I was able to reproduce the error in my local environment, and find a potential 
fix in Spark. I think {{hadoop-yarn-server-web-proxy}} is needed by Spark - all 
the other YARN jars are already covered by {{hadoop-client-api}} and 
{{hadoop-client-runtime}}. I'll open a PR for this soon.

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-24 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290613#comment-17290613
 ] 

Chao Sun commented on SPARK-33212:
--

I was able to reproduce the error in my local environment, and find a potential 
fix in Spark. I think {{hadoop-yarn-server-web-proxy}} is needed by Spark - all 
the other YARN jars are already covered by {{hadoop-client-api}} and 
{{hadoop-client-runtime}}. I'll open a PR for this soon.

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-24 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290127#comment-17290127
 ] 

Chao Sun commented on SPARK-33212:
--

Thanks again [~ouyangxc.zte]. 
{{org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter}} was not included 
in the {{hadoop-client}} jars since it is a server-side class and ideally 
should not be exposed to client applications such as Spark. 

[~dongjoon] Let me see how we can fix this either in Spark or Hadoop.

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-23 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289652#comment-17289652
 ] 

Chao Sun commented on SPARK-33212:
--

Thanks for the details [~ouyangxc.zte]!

{quote}
Get AMIpFilter ClassNotFoundException , because there is no 
'hadoop-client-minicluster.jar' in classpath
{quote}
This is interesting. the {{hadoop-client-minicluster.jar}} should only be used 
in tests - curious why it is needed here. Could you share stacktraces for the 
{{ClassNotFoundException}}?

{quote}
2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error initializing 
SparkContext.
java.lang.IllegalStateException: class 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
javax.servlet.Filter
{quote}
Could you also share the stacktraces for this exception?

And to confirm, you are using {{client}} as the deploy mode, is that correct? 
I'll try to reproduce this in my local environment.


> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-23 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289200#comment-17289200
 ] 

Chao Sun commented on SPARK-33212:
--

Thanks for the report [~ouyangxc.zte]. Can you provide more details? such as 
error messages, stack traces, steps to reproduce the issue, etc?

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34419) Move PartitionTransforms from java to scala directory

2021-02-10 Thread Chao Sun (Jira)
Chao Sun created SPARK-34419:


 Summary: Move PartitionTransforms from java to scala directory
 Key: SPARK-34419
 URL: https://issues.apache.org/jira/browse/SPARK-34419
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


{{PartitionTransforms}} is currently under 
{{sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions}}. It 
should be under 
{{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34347) CatalogImpl.uncacheTable should invalidate in cascade for temp views

2021-02-03 Thread Chao Sun (Jira)
Chao Sun created SPARK-34347:


 Summary: CatalogImpl.uncacheTable should invalidate in cascade for 
temp views 
 Key: SPARK-34347
 URL: https://issues.apache.org/jira/browse/SPARK-34347
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


When {{spark.sql.legacy.storeAnalyzedPlanForView}} is false, 
{{CatalogImpl.uncacheTable}} should invalidate caches for temp view in cascade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34108) Cache lookup doesn't work in certain cases

2021-01-27 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-34108.
--
Resolution: Duplicate

> Cache lookup doesn't work in certain cases
> --
>
> Key: SPARK-34108
> URL: https://issues.apache.org/jira/browse/SPARK-34108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently, caching a temporary or permenant view doesn't work in certain 
> cases. For instance, in the following:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t
> CACHE TABLE v1
> SELECT key FROM t
> {code}
> The last SELECT query will hit the cached {{v1}}. On the other hand:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t ORDER by key
> CACHE TABLE v1
> SELECT key FROM t ORDER BY key
> {code}
> The SELECT won't hit the cache.
> It seems this is related to {{EliminateView}}. In the second case, it will 
> insert an extra project operator which makes the comparison on canonicalized 
> plan during cache lookup fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34108) Cache lookup doesn't work in certain cases

2021-01-27 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-34108:
-
Description: 
Currently, caching a temporary or permenant view doesn't work in certain cases. 
For instance, in the following:
{code:sql}
CREATE TABLE t (key bigint, value string) USING parquet
CREATE VIEW v1 AS SELECT key FROM t
CACHE TABLE v1
SELECT key FROM t
{code}
The last SELECT query will hit the cached {{v1}}. On the other hand:
{code:sql}
CREATE TABLE t (key bigint, value string) USING parquet
CREATE VIEW v1 AS SELECT key FROM t ORDER by key
CACHE TABLE v1
SELECT key FROM t ORDER BY key
{code}
The SELECT won't hit the cache.

It seems this is related to {{EliminateView}}. In the second case, it will 
insert an extra project operator which makes the comparison on canonicalized 
plan during cache lookup fail.

  was:
Currently, caching a permanent view doesn't work in certain cases. For 
instance, in the following:
{code:sql}
CREATE TABLE t (key bigint, value string) USING parquet
CREATE VIEW v1 AS SELECT key FROM t
CACHE TABLE v1
SELECT key FROM t
{code}
The last SELECT query will hit the cached {{v1}}. On the other hand:
{code:sql}
CREATE TABLE t (key bigint, value string) USING parquet
CREATE VIEW v1 AS SELECT key FROM t ORDER by key
CACHE TABLE v1
SELECT key FROM t ORDER BY key
{code}
The SELECT won't hit the cache.

It seems this is related to {{EliminateView}}. In the second case, it will 
insert an extra project operator which makes the comparison on canonicalized 
plan during cache lookup fail.


> Cache lookup doesn't work in certain cases
> --
>
> Key: SPARK-34108
> URL: https://issues.apache.org/jira/browse/SPARK-34108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently, caching a temporary or permenant view doesn't work in certain 
> cases. For instance, in the following:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t
> CACHE TABLE v1
> SELECT key FROM t
> {code}
> The last SELECT query will hit the cached {{v1}}. On the other hand:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t ORDER by key
> CACHE TABLE v1
> SELECT key FROM t ORDER BY key
> {code}
> The SELECT won't hit the cache.
> It seems this is related to {{EliminateView}}. In the second case, it will 
> insert an extra project operator which makes the comparison on canonicalized 
> plan during cache lookup fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34108) Cache lookup doesn't work in certain cases

2021-01-27 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-34108:
-
Summary: Cache lookup doesn't work in certain cases  (was: Caching with 
permanent view doesn't work in certain cases)

> Cache lookup doesn't work in certain cases
> --
>
> Key: SPARK-34108
> URL: https://issues.apache.org/jira/browse/SPARK-34108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently, caching a permanent view doesn't work in certain cases. For 
> instance, in the following:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t
> CACHE TABLE v1
> SELECT key FROM t
> {code}
> The last SELECT query will hit the cached {{v1}}. On the other hand:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t ORDER by key
> CACHE TABLE v1
> SELECT key FROM t ORDER BY key
> {code}
> The SELECT won't hit the cache.
> It seems this is related to {{EliminateView}}. In the second case, it will 
> insert an extra project operator which makes the comparison on canonicalized 
> plan during cache lookup fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34271) Use majorMinorPatchVersion for Hive version parsing

2021-01-27 Thread Chao Sun (Jira)
Chao Sun created SPARK-34271:


 Summary: Use majorMinorPatchVersion for Hive version parsing
 Key: SPARK-34271
 URL: https://issues.apache.org/jira/browse/SPARK-34271
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


Currently {{IsolatedClientLoader}} need to enumerate all Hive patch versions. 
Therefore, whenever we upgrade Hive version we'd have to remember to update the 
method. It would be better if we just check major & minor version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27589) Spark file source V2

2021-01-27 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273161#comment-17273161
 ] 

Chao Sun commented on SPARK-27589:
--

[~xkrogen] FWIW I'm working on a POC for SPARK-32935 at the moment. There is 
also a design doc under working. Hopefully we'll be able to share it soon. cc 
[~rdblue] too.

> Spark file source V2
> 
>
> Key: SPARK-27589
> URL: https://issues.apache.org/jira/browse/SPARK-27589
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Re-implement file sources with data source V2 API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34052) A cached view should become invalid after a table is dropped

2021-01-26 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272333#comment-17272333
 ] 

Chao Sun commented on SPARK-34052:
--

[~hyukjin.kwon] [~cloud_fan] do you think we should include this in 3.1.1? 
since we've changed how temp view work in SPARK-33142 it may be better to add 
this too to make it consistent.

> A cached view should become invalid after a table is dropped
> 
>
> Key: SPARK-34052
> URL: https://issues.apache.org/jira/browse/SPARK-34052
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>
> It seems a view doesn't become invalid after a DSv2 table is dropped or 
> replaced. This is different from V1 and may cause correctness issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33507) Improve and fix cache behavior in v1 and v2

2021-01-19 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17268063#comment-17268063
 ] 

Chao Sun commented on SPARK-33507:
--

[~aokolnychyi] could you elaborate on the question? currently Spark doesn't 
support caching streaming tables yet.

> Improve and fix cache behavior in v1 and v2
> ---
>
> Key: SPARK-33507
> URL: https://issues.apache.org/jira/browse/SPARK-33507
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Critical
>
> This is an umbrella JIRA to track fixes & improvements for caching behavior 
> in Spark datasource v1 and v2, which includes:
>   - fix existing cache behavior in v1 and v2.
>   - fix inconsistent cache behavior between v1 and v2
>   - implement missing features in v2 to align with those in v1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33507) Improve and fix cache behavior in v1 and v2

2021-01-13 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264608#comment-17264608
 ] 

Chao Sun edited comment on SPARK-33507 at 1/14/21, 5:23 AM:


Thanks [~hyukjin.kwon]. From my side, there is no regression. Although I feel 
SPARK-34052 is a bit important since it concerns correctness. I'n working on a 
fix but got delayed by a few other issues found along the way :(.

The issue has been there for a long time though so I'm fine moving this to the 
next release.
 

 


was (Author: csun):
Thanks [~hyukjin.kwon]. From my side, there is no regression. Although I feel 
SPARK-34052 is a bit important since it concerns correctness. I'n working on a 
fix but got delayed by a few other issues found during the process :(

> Improve and fix cache behavior in v1 and v2
> ---
>
> Key: SPARK-33507
> URL: https://issues.apache.org/jira/browse/SPARK-33507
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Critical
>
> This is an umbrella JIRA to track fixes & improvements for caching behavior 
> in Spark datasource v1 and v2, which includes:
>   - fix existing cache behavior in v1 and v2.
>   - fix inconsistent cache behavior between v1 and v2
>   - implement missing features in v2 to align with those in v1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33507) Improve and fix cache behavior in v1 and v2

2021-01-13 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264608#comment-17264608
 ] 

Chao Sun commented on SPARK-33507:
--

Thanks [~hyukjin.kwon]. From my side, there is no regression. Although I feel 
SPARK-34052 is a bit important since it concerns correctness. I'n working on a 
fix but got delayed by a few other issues found during the process :(

> Improve and fix cache behavior in v1 and v2
> ---
>
> Key: SPARK-33507
> URL: https://issues.apache.org/jira/browse/SPARK-33507
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Critical
>
> This is an umbrella JIRA to track fixes & improvements for caching behavior 
> in Spark datasource v1 and v2, which includes:
>   - fix existing cache behavior in v1 and v2.
>   - fix inconsistent cache behavior between v1 and v2
>   - implement missing features in v2 to align with those in v1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34108) Caching with permanent view doesn't work in certain cases

2021-01-13 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-34108:
-
Summary: Caching with permanent view doesn't work in certain cases  (was: 
Caching doesn't work completely with permanent view)

> Caching with permanent view doesn't work in certain cases
> -
>
> Key: SPARK-34108
> URL: https://issues.apache.org/jira/browse/SPARK-34108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently, caching a permanent view doesn't work in certain cases. For 
> instance, in the following:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t
> CACHE TABLE v1
> SELECT key FROM t
> {code}
> The last SELECT query will hit the cached {{v1}}. On the other hand:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t ORDER by key
> CACHE TABLE v1
> SELECT key FROM t ORDER BY key
> {code}
> The SELECT won't hit the cache.
> It seems this is related to {{EliminateView}}. In the second case, it will 
> insert an extra project operator which makes the comparison on canonicalized 
> plan during cache lookup fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34108) Caching doesn't work completely with permanent view

2021-01-13 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-34108:
-
Description: 
Currently, caching a permanent view doesn't work in certain cases. For 
instance, in the following:
{code:sql}
CREATE TABLE t (key bigint, value string) USING parquet
CREATE VIEW v1 AS SELECT key FROM t
CACHE TABLE v1
SELECT key FROM t
{code}
The last SELECT query will hit the cached {{v1}}. On the other hand:
{code:sql}
CREATE TABLE t (key bigint, value string) USING parquet
CREATE VIEW v1 AS SELECT key FROM t ORDER by key
CACHE TABLE v1
SELECT key FROM t ORDER BY key
{code}
The SELECT won't hit the cache.

It seems this is related to {{EliminateView}}. In the second case, it will 
insert an extra project operator which makes the comparison on canonicalized 
plan during cache lookup fail.

  was:
Currently, caching a permanent view doesn't work in some cases. For instance, 
in the following:
{code:sql}
CREATE TABLE t (key bigint, value string) USING parquet
CREATE VIEW v1 AS SELECT key FROM t
CACHE TABLE v1
SELECT key FROM t
{code}

The last SELECT query will hit the cached {{v1}}. However, in the following:
{code:sql}
CREATE TABLE t (key bigint, value string) USING parquet
CREATE VIEW v1 AS SELECT key FROM t ORDER by key
CACHE TABLE v1
SELECT key FROM t ORDER BY key
{code}

The SELECT won't hit the cache.

It seems this is related to {{EliminateView}}. In the second case, it will 
insert an extra project operator which makes the comparison on canonicalized 
plan during cache lookup fail.


> Caching doesn't work completely with permanent view
> ---
>
> Key: SPARK-34108
> URL: https://issues.apache.org/jira/browse/SPARK-34108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently, caching a permanent view doesn't work in certain cases. For 
> instance, in the following:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t
> CACHE TABLE v1
> SELECT key FROM t
> {code}
> The last SELECT query will hit the cached {{v1}}. On the other hand:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t ORDER by key
> CACHE TABLE v1
> SELECT key FROM t ORDER BY key
> {code}
> The SELECT won't hit the cache.
> It seems this is related to {{EliminateView}}. In the second case, it will 
> insert an extra project operator which makes the comparison on canonicalized 
> plan during cache lookup fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   >