[jira] [Updated] (SPARK-35959) Add a new Maven profile "no-shaded-client" for older Hadoop 3.x versions
[ https://issues.apache.org/jira/browse/SPARK-35959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-35959: - Priority: Major (was: Blocker) > Add a new Maven profile "no-shaded-client" for older Hadoop 3.x versions > - > > Key: SPARK-35959 > URL: https://issues.apache.org/jira/browse/SPARK-35959 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > Currently Spark uses Hadoop shaded client by default. However, if Spark users > want to build Spark with older version of Hadoop, such as 3.1.x, the shaded > client cannot be used (currently it only support Hadoop 3.2.2+ and 3.3.1+). > Therefore, this proposes to offer a new Maven profile "no-shaded-client" for > this use case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset
[ https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412167#comment-17412167 ] Chao Sun commented on SPARK-36696: -- [This|https://github.com/apache/arrow/blob/master/cpp/src/parquet/metadata.cc#L1331] looks suspicious: why column chunk file offset = dictionary/data page offset + compressed size of the column chunk? > spark.read.parquet loads empty dataset > -- > > Key: SPARK-36696 > URL: https://issues.apache.org/jira/browse/SPARK-36696 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Blocker > Attachments: example.parquet > > > Here's a parquet file Spark 3.2/master can't read properly. > The file was stored by pandas and must contain 3650 rows, but Spark > 3.2/master returns an empty dataset. > {code:python} > >>> import pandas as pd > >>> len(pd.read_parquet('/path/to/example.parquet')) > 3650 > >>> spark.read.parquet('/path/to/example.parquet').count() > 0 > {code} > I guess it's caused by the parquet 1.12.0. > When I reverted two commits related to the parquet 1.12.0 from branch-3.2: > - > [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa] > - > [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da] > it reads the data successfully. > We need to add some workaround, or revert the commits. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset
[ https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412164#comment-17412164 ] Chao Sun commented on SPARK-36696: -- This looks like the same issue as in PARQUET-2078. The file offset for the first row group is set to 31173 which causes {{filterFileMetaDataByMidpoint}} to filter out the only row group (range filter is [0, 37968], while startIndex is 31173 and total size is 35820). Seems there is a bug in Apache Arrow which writes incorrect file offset. cc [~gershinsky] to see if you know any info there. > spark.read.parquet loads empty dataset > -- > > Key: SPARK-36696 > URL: https://issues.apache.org/jira/browse/SPARK-36696 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Blocker > Attachments: example.parquet > > > Here's a parquet file Spark 3.2/master can't read properly. > The file was stored by pandas and must contain 3650 rows, but Spark > 3.2/master returns an empty dataset. > {code:python} > >>> import pandas as pd > >>> len(pd.read_parquet('/path/to/example.parquet')) > 3650 > >>> spark.read.parquet('/path/to/example.parquet').count() > 0 > {code} > I guess it's caused by the parquet 1.12.0. > When I reverted two commits related to the parquet 1.12.0 from branch-3.2: > - > [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa] > - > [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da] > it reads the data successfully. > We need to add some workaround, or revert the commits. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36695) Allow passing V2 functions to data sources via V2 filters
Chao Sun created SPARK-36695: Summary: Allow passing V2 functions to data sources via V2 filters Key: SPARK-36695 URL: https://issues.apache.org/jira/browse/SPARK-36695 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun The V2 filter API currently only allow {{NamedReference}} in predicates that are pushed down to data sources. It may be beneficial to allow V2 functions in predicates as well so that we can implement function pushdown. This feature is also supported by Trino (Presto). One use case is we can pushdown predicates such as {{bucket(col, 32) = 10}} which will allow data sources such as Iceberg to only scan a single partition. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36676) Create shaded Hive module and upgrade to higher version of Guava
[ https://issues.apache.org/jira/browse/SPARK-36676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410726#comment-17410726 ] Chao Sun commented on SPARK-36676: -- Will post a PR soon > Create shaded Hive module and upgrade to higher version of Guava > > > Key: SPARK-36676 > URL: https://issues.apache.org/jira/browse/SPARK-36676 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Currently Spark is tied with Guava from Hive which is of version 14. This > proposes to create a separate module {{hive-shaded}} which shades > dependencies from Hive and subsequently allows us to upgrade Guava > independently. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36676) Create shaded Hive module and upgrade to higher version of Guava
Chao Sun created SPARK-36676: Summary: Create shaded Hive module and upgrade to higher version of Guava Key: SPARK-36676 URL: https://issues.apache.org/jira/browse/SPARK-36676 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun Currently Spark is tied with Guava from Hive which is of version 14. This proposes to create a separate module {{hive-shaded}} which shades dependencies from Hive and subsequently allows us to upgrade Guava independently. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
[ https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408324#comment-17408324 ] Chao Sun commented on SPARK-34276: -- I did some study on the code and it seems this will only affect Spark when {{spark.sql.hive.convertMetastoreParquet}} is set to false, as [~nemon] pointed above. By default Spark uses {{filterFileMetaDataByMidpoint}} (see [here|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1226]), which is not impacted much by this bug. In the worst case it could cause imbalance when assigning Parquet row groups to Spark tasks but nothing like read error or data loss. > Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12 > -- > > Key: SPARK-34276 > URL: https://issues.apache.org/jira/browse/SPARK-34276 > Project: Spark > Issue Type: Task > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Blocker > > Before the release, we need to double check the unreleased/unresolved > JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should > upgrade/revert Parquet. At the same time, we should encourage the whole > community to do the compatibility and performance tests for their production > workloads, including both read and write code paths. > More details: > [https://github.com/apache/spark/pull/26804#issuecomment-768790620] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
[ https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407554#comment-17407554 ] Chao Sun commented on SPARK-34276: -- [~smilegator] yea seems like Spark will be affected. cc [~gszadovszky] to confirm. [~nemon] is the issue you mentioned the same as PARQUET-2078? > Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12 > -- > > Key: SPARK-34276 > URL: https://issues.apache.org/jira/browse/SPARK-34276 > Project: Spark > Issue Type: Task > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Blocker > > Before the release, we need to double check the unreleased/unresolved > JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should > upgrade/revert Parquet. At the same time, we should encourage the whole > community to do the compatibility and performance tests for their production > workloads, including both read and write code paths. > More details: > [https://github.com/apache/spark/pull/26804#issuecomment-768790620] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36528) Implement lazy decoding for the vectorized Parquet reader
[ https://issues.apache.org/jira/browse/SPARK-36528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36528: - Description: Currently Spark first decode (e.g., RLE/bit-packed, PLAIN) into column vector and then operate on the decoded data. However, it may be more efficient to directly operate on encoded data, for instance, performing filter or aggregation on RLE-encoded data, or performing comparison over dictionary-encoded string data. This can also potentially work with encodings in Parquet v2 format, such as DELTA_BYTE_ARRAY. (was: Currently Spark first decode (e.g., RLE/bit-packed, PLAIN) into column vector and then operate on the decoded data. However, it may be more efficient to directly operate on encoded data (e.g., when the data is using RLE encoding). This can also potentially work with encodings in Parquet v2 format, such as DELTA_BYTE_ARRAY.) > Implement lazy decoding for the vectorized Parquet reader > - > > Key: SPARK-36528 > URL: https://issues.apache.org/jira/browse/SPARK-36528 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Currently Spark first decode (e.g., RLE/bit-packed, PLAIN) into column vector > and then operate on the decoded data. However, it may be more efficient to > directly operate on encoded data, for instance, performing filter or > aggregation on RLE-encoded data, or performing comparison over > dictionary-encoded string data. This can also potentially work with encodings > in Parquet v2 format, such as DELTA_BYTE_ARRAY. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36528) Implement lazy decoding for the vectorized Parquet reader
[ https://issues.apache.org/jira/browse/SPARK-36528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36528: - Description: Currently Spark first decode (e.g., RLE/bit-packed, PLAIN) into column vector and then operate on the decoded data. However, it may be more efficient to directly operate on encoded data (e.g., when the data is using RLE encoding). This can also potentially work with encodings in Parquet v2 format, such as DELTA_BYTE_ARRAY. (was: Currently Spark first decode (e.g., RLE/bit-packed, PLAIN) into column vector and then operate on the decoded data. However, it may be more efficient to directly operate on encoded data (e.g., when the data is using RLE encoding).) > Implement lazy decoding for the vectorized Parquet reader > - > > Key: SPARK-36528 > URL: https://issues.apache.org/jira/browse/SPARK-36528 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Currently Spark first decode (e.g., RLE/bit-packed, PLAIN) into column vector > and then operate on the decoded data. However, it may be more efficient to > directly operate on encoded data (e.g., when the data is using RLE encoding). > This can also potentially work with encodings in Parquet v2 format, such as > DELTA_BYTE_ARRAY. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36527) Implement lazy materialization for the vectorized Parquet reader
[ https://issues.apache.org/jira/browse/SPARK-36527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36527: - Description: At the moment the Parquet vectorized reader will eagerly decode all the columns that are in the read schema, before any filter has been applied to them. This is costly. Instead it's better to only materialize these column vectors when the data are actually needed. (was: At the moment the Parquet vectorized reader will eagerly decode all the columns that are in the read schema, before any filter has been applied to them. This is costly. Instead it's better to only materialize these column vectors when the data are actually read.) > Implement lazy materialization for the vectorized Parquet reader > > > Key: SPARK-36527 > URL: https://issues.apache.org/jira/browse/SPARK-36527 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > At the moment the Parquet vectorized reader will eagerly decode all the > columns that are in the read schema, before any filter has been applied to > them. This is costly. Instead it's better to only materialize these column > vectors when the data are actually needed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36529) Decouple CPU with IO work in vectorized Parquet reader
Chao Sun created SPARK-36529: Summary: Decouple CPU with IO work in vectorized Parquet reader Key: SPARK-36529 URL: https://issues.apache.org/jira/browse/SPARK-36529 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun Currently it seems the vectorized Parquet reader does almost everything in a sequential manner: 1. read the row group using file system API (perhaps from remote storage like S3) 2. allocate buffers and store those row group bytes into them 3. decompress the data pages 4. in Spark, decode all the read columns one by one 5. read the next row group and repeat from 1. A lot of improvements can be done to decouple the IO and CPU intensive work. In addition, we could parallelize the row group loading and column decoding, and utilizing all the cores available for a Spark task. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36528) Implement lazy decoding for the vectorized Parquet reader
Chao Sun created SPARK-36528: Summary: Implement lazy decoding for the vectorized Parquet reader Key: SPARK-36528 URL: https://issues.apache.org/jira/browse/SPARK-36528 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun Currently Spark first decode (e.g., RLE/bit-packed, PLAIN) into column vector and then operate on the decoded data. However, it may be more efficient to directly operate on encoded data (e.g., when the data is using RLE encoding). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36527) Implement lazy materialization for the vectorized Parquet reader
Chao Sun created SPARK-36527: Summary: Implement lazy materialization for the vectorized Parquet reader Key: SPARK-36527 URL: https://issues.apache.org/jira/browse/SPARK-36527 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun At the moment the Parquet vectorized reader will eagerly decode all the columns that are in the read schema, before any filter has been applied to them. This is costly. Instead it's better to only materialize these column vectors when the data are actually read. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36511) Remove ColumnIO once PARQUET-2050 is released in Parquet 1.13
Chao Sun created SPARK-36511: Summary: Remove ColumnIO once PARQUET-2050 is released in Parquet 1.13 Key: SPARK-36511 URL: https://issues.apache.org/jira/browse/SPARK-36511 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun {{ColumnIO}} doesn't expose methods to get repetition and definition level so Spark has to use a hack. This should be removed once PARQUET-2050 is released. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-34861) Support nested column in Spark vectorized readers
[ https://issues.apache.org/jira/browse/SPARK-34861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-34861: - Comment: was deleted (was: Synced with [~chengsu] offline and I will take over this JIRA.) > Support nested column in Spark vectorized readers > - > > Key: SPARK-34861 > URL: https://issues.apache.org/jira/browse/SPARK-34861 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Priority: Minor > > This is the umbrella task to track the overall progress. The task is to > support nested column type in Spark vectorized reader, namely Parquet and > ORC. Currently both Parquet and ORC vectorized readers do not support nested > column type (struct, array and map). We implemented nested column vectorized > reader for FB-ORC in our internal fork of Spark. We are seeing performance > improvement compared to non-vectorized reader when reading nested columns. In > addition, this can also help improve the non-nested column performance when > reading non-nested and nested columns together in one query. > > Parquet: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L173] > > ORC: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L138] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36440) Spark3 fails to read hive table with mixed format
[ https://issues.apache.org/jira/browse/SPARK-36440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394529#comment-17394529 ] Chao Sun commented on SPARK-36440: -- Hmm really? Spark 2.x support this? I'm not sure why Spark is still expected to work in this case since the serde is changed to Parquet but the underlying data file is in ORC. It seems like an error that users should avoid. > Spark3 fails to read hive table with mixed format > - > > Key: SPARK-36440 > URL: https://issues.apache.org/jira/browse/SPARK-36440 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.1, 3.1.2 >Reporter: Jason Xu >Priority: Major > > Spark3 fails to read hive table with mixed format with hive Serde, this is a > regression compares to Spark 2.4. > Replication steps : > 1. In spark 3 (3.0 or 3.1) spark shell: > {code:java} > scala> spark.sql("create table tmp.test_table (id int, name string) > partitioned by (pt int) stored as rcfile") > scala> spark.sql("insert into tmp.test_table (pt = 1) values (1, 'Alice'), > (2, 'Bob')") > {code} > 2. Run hive command to change table file format (from RCFile to Parquet). > {code:java} > hive (default)> alter table set tmp.test_table fileformat Parquet; > {code} > 3. Try to read partition (in RCFile format) with hive serde using Spark shell: > {code:java} > scala> spark.conf.set("spark.sql.hive.convertMetastoreParquet", "false") > scala> spark.sql("select * from tmp.test_table where pt=1").show{code} > Exception: (anonymized file path with ) > {code:java} > Caused by: java.lang.RuntimeException: > s3a:///data/part-0-22112178-5dd7-4065-89d7-2ee550296909-c000 is not > a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [5, > 96, 1, -33] > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:524) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:433) > at > org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:79) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:75) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60) > at > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75) > at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:286) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:285) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > {code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36317) PruneFileSourcePartitionsSuite tests are failing after the fix to SPARK-36136
[ https://issues.apache.org/jira/browse/SPARK-36317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388204#comment-17388204 ] Chao Sun commented on SPARK-36317: -- [~vsowrirajan]: the change is already reverted - are you still seeing the test failures? > PruneFileSourcePartitionsSuite tests are failing after the fix to SPARK-36136 > - > > Key: SPARK-36317 > URL: https://issues.apache.org/jira/browse/SPARK-36317 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: Venkata krishnan Sowrirajan >Priority: Major > > After the fix to [SPARK-36136][SQL][TESTS] Refactor > PruneFileSourcePartitionsSuite etc to a different package, couple of tests in > PruneFileSourcePartitionsSuite are failing now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36137) HiveShim always fallback to getAllPartitionsOf regardless of whether directSQL is enabled in remote HMS
[ https://issues.apache.org/jira/browse/SPARK-36137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36137: - Description: At the moment {{getPartitionsByFilter}} in Hive shim only fallback to use {{getAllPartitionsOf}} when {{hive.metastore.try.direct.sql}} is enabled in the remote HMS. However, in certain cases the remote HMS will fallback to use ORM (which only support string type for partition columns) to query the underlying RDBMS even if this config is set to true, and Spark will not be able to recover from the error and will just fail the query. For instance, we encountered this bug HIVE-21497 in HMS running Hive 3.1.2, and Spark was not able to pushdown filter for {{date}} column. was: At the moment {{getPartitionsByFilter}} in Hive shim only fallback to use {{getAllPartitionsOf}} when {{hive.metastore.try.direct.sql}} is enabled in the remote HMS. However, in certain cases the remote HMS will fallback to use ORM to query the underlying RDBMS even if this config is set to true, and Spark will not be able to recover from the error and will just fail the query. For instance, we encountered this bug HIVE-21497 in HMS running Hive 3.1.2, and Spark was not able to pushdown filter for {{date}} column. > HiveShim always fallback to getAllPartitionsOf regardless of whether > directSQL is enabled in remote HMS > --- > > Key: SPARK-36137 > URL: https://issues.apache.org/jira/browse/SPARK-36137 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > At the moment {{getPartitionsByFilter}} in Hive shim only fallback to use > {{getAllPartitionsOf}} when {{hive.metastore.try.direct.sql}} is enabled in > the remote HMS. However, in certain cases the remote HMS will fallback to use > ORM (which only support string type for partition columns) to query the > underlying RDBMS even if this config is set to true, and Spark will not be > able to recover from the error and will just fail the query. > For instance, we encountered this bug HIVE-21497 in HMS running Hive 3.1.2, > and Spark was not able to pushdown filter for {{date}} column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36137) HiveShim always fallback to getAllPartitionsOf regardless of whether directSQL is enabled in remote HMS
Chao Sun created SPARK-36137: Summary: HiveShim always fallback to getAllPartitionsOf regardless of whether directSQL is enabled in remote HMS Key: SPARK-36137 URL: https://issues.apache.org/jira/browse/SPARK-36137 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun At the moment {{getPartitionsByFilter}} in Hive shim only fallback to use {{getAllPartitionsOf}} when {{hive.metastore.try.direct.sql}} is enabled in the remote HMS. However, in certain cases the remote HMS will fallback to use ORM to query the underlying RDBMS even if this config is set to true, and Spark will not be able to recover from the error and will just fail the query. For instance, we encountered this bug HIVE-21497 in HMS running Hive 3.1.2, and Spark was not able to pushdown filter for {{date}} column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36136) Move PruneFileSourcePartitionsSuite out of org.apache.spark.sql.hive
Chao Sun created SPARK-36136: Summary: Move PruneFileSourcePartitionsSuite out of org.apache.spark.sql.hive Key: SPARK-36136 URL: https://issues.apache.org/jira/browse/SPARK-36136 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun Currently both {{PruneFileSourcePartitionsSuite}} and {{PrunePartitionSuiteBase}} are in {{org.apache.spark.sql.hive.execution}} which doesn't look right. They should belong to {{org.apache.spark.sql.execution.datasources}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36128) CatalogFileIndex.filterPartitions should respect spark.sql.hive.metastorePartitionPruning
[ https://issues.apache.org/jira/browse/SPARK-36128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380326#comment-17380326 ] Chao Sun commented on SPARK-36128: -- Thanks, I'm slightly inclined to reuse the existing config but document the new behavior (e.g., it is used for data source tables too when {{spark.sql.hive.manageFilesourcePartitions}} is set). Let me raise a PR for this and we can discuss there. > CatalogFileIndex.filterPartitions should respect > spark.sql.hive.metastorePartitionPruning > - > > Key: SPARK-36128 > URL: https://issues.apache.org/jira/browse/SPARK-36128 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > Currently the config {{spark.sql.hive.metastorePartitionPruning}} is only > used in {{PruneHiveTablePartitions}} but not {{PruneFileSourcePartitions}}. > The latter calls {{CatalogFileIndex.filterPartitions}} which calls > {{listPartitionsByFilter}} regardless of whether the above config is set or > not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36131) Refactor ParquetColumnIndexSuite
Chao Sun created SPARK-36131: Summary: Refactor ParquetColumnIndexSuite Key: SPARK-36131 URL: https://issues.apache.org/jira/browse/SPARK-36131 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun This is a minor refactoring on {{ParquetColumnIndexSuite}} to allow better code reuse. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36128) CatalogFileIndex.filterPartitions should respect spark.sql.hive.metastorePartitionPruning
[ https://issues.apache.org/jira/browse/SPARK-36128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380299#comment-17380299 ] Chao Sun commented on SPARK-36128: -- [~hyukjin.kwon] you are right - I didn't know this config is designed to be only used by Hive table scan, but this poses a few issues: 1. by default, data source tables also manage their partitions through HMS, via config {{spark.sql.hive.manageFilesourcePartitions}}. This config also says "When partition management is enabled, datasource tables store partition in the Hive metastore, and use the metastore to prune partitions during query planning", so it sounds like they should have the same partition pruning mechanism as Hive tables. 2. there is effectively only one implementation for {{ExternalCatalog}} which is HMS, so I'm not sure why we treat Hive table scans differently than data source table scans, even though both of them are reading partition metadata from HMS. > CatalogFileIndex.filterPartitions should respect > spark.sql.hive.metastorePartitionPruning > - > > Key: SPARK-36128 > URL: https://issues.apache.org/jira/browse/SPARK-36128 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > Currently the config {{spark.sql.hive.metastorePartitionPruning}} is only > used in {{PruneHiveTablePartitions}} but not {{PruneFileSourcePartitions}}. > The latter calls {{CatalogFileIndex.filterPartitions}} which calls > {{listPartitionsByFilter}} regardless of whether the above config is set or > not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-36128) CatalogFileIndex.filterPartitions should respect spark.sql.hive.metastorePartitionPruning
[ https://issues.apache.org/jira/browse/SPARK-36128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380299#comment-17380299 ] Chao Sun edited comment on SPARK-36128 at 7/14/21, 4:24 AM: [~hyukjin.kwon] you are right - I didn't know this config is designed to be only used by Hive table scan, but this poses a few issues: 1. by default, data source tables also manage their partitions through HMS, via config {{spark.sql.hive.manageFilesourcePartitions}}. This config also says "When partition management is enabled, datasource tables store partition in the Hive metastore, and use the metastore to prune partitions during query planning", so it sounds like they should have the same partition pruning mechanism as Hive tables, including the flag. 2. there is effectively only one implementation for {{ExternalCatalog}} which is HMS, so I'm not sure why we treat Hive table scans differently than data source table scans, even though both of them are reading partition metadata from HMS. was (Author: csun): [~hyukjin.kwon] you are right - I didn't know this config is designed to be only used by Hive table scan, but this poses a few issues: 1. by default, data source tables also manage their partitions through HMS, via config {{spark.sql.hive.manageFilesourcePartitions}}. This config also says "When partition management is enabled, datasource tables store partition in the Hive metastore, and use the metastore to prune partitions during query planning", so it sounds like they should have the same partition pruning mechanism as Hive tables. 2. there is effectively only one implementation for {{ExternalCatalog}} which is HMS, so I'm not sure why we treat Hive table scans differently than data source table scans, even though both of them are reading partition metadata from HMS. > CatalogFileIndex.filterPartitions should respect > spark.sql.hive.metastorePartitionPruning > - > > Key: SPARK-36128 > URL: https://issues.apache.org/jira/browse/SPARK-36128 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > Currently the config {{spark.sql.hive.metastorePartitionPruning}} is only > used in {{PruneHiveTablePartitions}} but not {{PruneFileSourcePartitions}}. > The latter calls {{CatalogFileIndex.filterPartitions}} which calls > {{listPartitionsByFilter}} regardless of whether the above config is set or > not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36128) CatalogFileIndex.filterPartitions should respect spark.sql.hive.metastorePartitionPruning
Chao Sun created SPARK-36128: Summary: CatalogFileIndex.filterPartitions should respect spark.sql.hive.metastorePartitionPruning Key: SPARK-36128 URL: https://issues.apache.org/jira/browse/SPARK-36128 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun Currently the config {{spark.sql.hive.metastorePartitionPruning}} is only used in {{PruneHiveTablePartitions}} but not {{PruneFileSourcePartitions}}. The latter calls {{CatalogFileIndex.filterPartitions}} which calls {{listPartitionsByFilter}} regardless of whether the above config is set or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36123) Parquet vectorized reader doesn't skip null values correctly
[ https://issues.apache.org/jira/browse/SPARK-36123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36123: - Labels: correctness (was: ) > Parquet vectorized reader doesn't skip null values correctly > > > Key: SPARK-36123 > URL: https://issues.apache.org/jira/browse/SPARK-36123 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Blocker > Labels: correctness > > Currently when Parquet column index is effective, the vectorized reader > doesn't skip null values properly which will cause incorrect result. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36123) Parquet vectorized reader doesn't skip null values correctly
Chao Sun created SPARK-36123: Summary: Parquet vectorized reader doesn't skip null values correctly Key: SPARK-36123 URL: https://issues.apache.org/jira/browse/SPARK-36123 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun Currently when Parquet column index is effective, the vectorized reader doesn't skip null values properly which will cause incorrect result. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35743) Improve Parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-35743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-35743: - Labels: parquet (was: ) > Improve Parquet vectorized reader > - > > Key: SPARK-35743 > URL: https://issues.apache.org/jira/browse/SPARK-35743 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > Labels: parquet > > This umbrella JIRA tracks efforts to improve vectorized Parquet reader. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36056) Combine readBatch and readIntegers in VectorizedRleValuesReader
Chao Sun created SPARK-36056: Summary: Combine readBatch and readIntegers in VectorizedRleValuesReader Key: SPARK-36056 URL: https://issues.apache.org/jira/browse/SPARK-36056 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun {{readBatch}} and {{readIntegers}} share similar code path and this Jira aims to combine them into one method for easier maintenance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35959) Add a new Maven profile "no-shaded-client" for older Hadoop 3.x versions
Chao Sun created SPARK-35959: Summary: Add a new Maven profile "no-shaded-client" for older Hadoop 3.x versions Key: SPARK-35959 URL: https://issues.apache.org/jira/browse/SPARK-35959 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.2.0 Reporter: Chao Sun Currently Spark uses Hadoop shaded client by default. However, if Spark users want to build Spark with older version of Hadoop, such as 3.1.x, the shaded client cannot be used (currently it only support Hadoop 3.2.2+ and 3.3.1+). Therefore, this proposes to offer a new Maven profile "no-shaded-client" for this use case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans
Chao Sun created SPARK-35867: Summary: Enable vectorized read for VectorizedPlainValuesReader.readBooleans Key: SPARK-35867 URL: https://issues.apache.org/jira/browse/SPARK-35867 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun Currently we decode PLAIN encoded booleans as follow: {code:java} public final void readBooleans(int total, WritableColumnVector c, int rowId) { // TODO: properly vectorize this for (int i = 0; i < total; i++) { c.putBoolean(rowId + i, readBoolean()); } } {code} Ideally we should vectorize this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35846) Introduce ParquetReadState to track various states while reading a Parquet column chunk
[ https://issues.apache.org/jira/browse/SPARK-35846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-35846: - Description: This is mostly refactoring work to complete SPARK-34859 > Introduce ParquetReadState to track various states while reading a Parquet > column chunk > --- > > Key: SPARK-35846 > URL: https://issues.apache.org/jira/browse/SPARK-35846 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Minor > > This is mostly refactoring work to complete SPARK-34859 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35846) Introduce ParquetReadState to track various states while reading a Parquet column chunk
Chao Sun created SPARK-35846: Summary: Introduce ParquetReadState to track various states while reading a Parquet column chunk Key: SPARK-35846 URL: https://issues.apache.org/jira/browse/SPARK-35846 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35640) Refactor Parquet vectorized reader to remove duplicated code paths
[ https://issues.apache.org/jira/browse/SPARK-35640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-35640: - Parent: SPARK-35743 Issue Type: Sub-task (was: Improvement) > Refactor Parquet vectorized reader to remove duplicated code paths > -- > > Key: SPARK-35640 > URL: https://issues.apache.org/jira/browse/SPARK-35640 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > Fix For: 3.2.0 > > > Currently in Parquet vectorized code path, there are many code duplications > such as the following: > {code:java} > public void readIntegers( > int total, > WritableColumnVector c, > int rowId, > int level, > VectorizedValuesReader data) throws IOException { > int left = total; > while (left > 0) { > if (this.currentCount == 0) this.readNextGroup(); > int n = Math.min(left, this.currentCount); > switch (mode) { > case RLE: > if (currentValue == level) { > data.readIntegers(n, c, rowId); > } else { > c.putNulls(rowId, n); > } > break; > case PACKED: > for (int i = 0; i < n; ++i) { > if (currentBuffer[currentBufferIdx++] == level) { > c.putInt(rowId + i, data.readInteger()); > } else { > c.putNull(rowId + i); > } > } > break; > } > rowId += n; > left -= n; > currentCount -= n; > } > } > {code} > This makes it hard to maintain as any change on this will need to be > replicated in 20+ places. The issue becomes more serious when we are going to > implement column index and complex type support for the vectorized path. > The original intention is for performance. However now days JIT compilers > tend to be smart on this and will inline virtual calls as much as possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35743) Improve Parquet vectorized reader
Chao Sun created SPARK-35743: Summary: Improve Parquet vectorized reader Key: SPARK-35743 URL: https://issues.apache.org/jira/browse/SPARK-35743 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun This umbrella JIRA tracks efforts to improve vectorized Parquet reader. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34861) Support nested column in Spark vectorized readers
[ https://issues.apache.org/jira/browse/SPARK-34861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17361090#comment-17361090 ] Chao Sun commented on SPARK-34861: -- Synced with [~chengsu] offline and I will take over this JIRA. > Support nested column in Spark vectorized readers > - > > Key: SPARK-34861 > URL: https://issues.apache.org/jira/browse/SPARK-34861 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Priority: Minor > > This is the umbrella task to track the overall progress. The task is to > support nested column type in Spark vectorized reader, namely Parquet and > ORC. Currently both Parquet and ORC vectorized readers do not support nested > column type (struct, array and map). We implemented nested column vectorized > reader for FB-ORC in our internal fork of Spark. We are seeing performance > improvement compared to non-vectorized reader when reading nested columns. In > addition, this can also help improve the non-nested column performance when > reading non-nested and nested columns together in one query. > > Parquet: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L173] > > ORC: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L138] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35703) Remove HashClusteredDistribution
[ https://issues.apache.org/jira/browse/SPARK-35703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-35703: - Description: Currently Spark has {{HashClusteredDistribution}} and {{ClusteredDistribution}}. The only difference between the two is that the former is more strict when deciding whether bucket join is allowed to avoid shuffle: comparing to the latter, it requires *exact* match between the clustering keys from the output partitioning (i.e., {{HashPartitioning}}) and the join keys. However, this is unnecessary, as we should be able to avoid shuffle when the set of clustering keys is a subset of join keys, just like {{ClusteredDistribution}}. (was: Currently Spark has {{HashClusteredDistribution}} and {{ClusteredDistribution}}. The only difference between the two is that the former is more strict when deciding whether bucket join is allowed to avoid shuffle: comparing to the latter, it requires *exact* match between the clustering keys from the output partitioning and the join keys. However, this is unnecessary, as we should be able to avoid shuffle when the set of clustering keys is a subset of join keys, just like {{ClusteredDistribution}}. ) > Remove HashClusteredDistribution > > > Key: SPARK-35703 > URL: https://issues.apache.org/jira/browse/SPARK-35703 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > Currently Spark has {{HashClusteredDistribution}} and > {{ClusteredDistribution}}. The only difference between the two is that the > former is more strict when deciding whether bucket join is allowed to avoid > shuffle: comparing to the latter, it requires *exact* match between the > clustering keys from the output partitioning (i.e., {{HashPartitioning}}) and > the join keys. However, this is unnecessary, as we should be able to avoid > shuffle when the set of clustering keys is a subset of join keys, just like > {{ClusteredDistribution}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35703) Remove HashClusteredDistribution
Chao Sun created SPARK-35703: Summary: Remove HashClusteredDistribution Key: SPARK-35703 URL: https://issues.apache.org/jira/browse/SPARK-35703 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun Currently Spark has {{HashClusteredDistribution}} and {{ClusteredDistribution}}. The only difference between the two is that the former is more strict when deciding whether bucket join is allowed to avoid shuffle: comparing to the latter, it requires *exact* match between the clustering keys from the output partitioning and the join keys. However, this is unnecessary, as we should be able to avoid shuffle when the set of clustering keys is a subset of join keys, just like {{ClusteredDistribution}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35640) Refactor Parquet vectorized reader to remove duplicated code paths
Chao Sun created SPARK-35640: Summary: Refactor Parquet vectorized reader to remove duplicated code paths Key: SPARK-35640 URL: https://issues.apache.org/jira/browse/SPARK-35640 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun Currently in Parquet vectorized code path, there are many code duplications such as the following: {code:java} public void readIntegers( int total, WritableColumnVector c, int rowId, int level, VectorizedValuesReader data) throws IOException { int left = total; while (left > 0) { if (this.currentCount == 0) this.readNextGroup(); int n = Math.min(left, this.currentCount); switch (mode) { case RLE: if (currentValue == level) { data.readIntegers(n, c, rowId); } else { c.putNulls(rowId, n); } break; case PACKED: for (int i = 0; i < n; ++i) { if (currentBuffer[currentBufferIdx++] == level) { c.putInt(rowId + i, data.readInteger()); } else { c.putNull(rowId + i); } } break; } rowId += n; left -= n; currentCount -= n; } } {code} This makes it hard to maintain as any change on this will need to be replicated in 20+ places. The issue becomes more serious when we are going to implement column index and complex type support for the vectorized path. The original intention is for performance. However now days JIT compilers tend to be smart on this and will inline virtual calls as much as possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34859) Vectorized parquet reader needs synchronization among pages for column index
[ https://issues.apache.org/jira/browse/SPARK-34859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-34859: - Priority: Critical (was: Major) > Vectorized parquet reader needs synchronization among pages for column index > > > Key: SPARK-34859 > URL: https://issues.apache.org/jira/browse/SPARK-34859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Li Xian >Priority: Critical > Labels: correctness > Attachments: > part-0-bee08cae-04cd-491c-9602-4c66791af3d0-c000.snappy.parquet > > > the current implementation has a problem. the pages returned by > `readNextFilteredRowGroup` may not be aligned, some columns may have more > rows than others. > Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` > with `rowIndexes` to make sure that rows are aligned. > Currently `VectorizedParquetRecordReader` doesn't have such synchronizing > among pages from different columns. Using `readNextFilteredRowGroup` may > result in incorrect result. > > I have attache an example parquet file. This file is generated with > `spark.range(0, 2000).map(i => (i.toLong, i.toInt))` and the layout of this > file is listed below. > row group 0 > > _1: INT64 SNAPPY DO:0 FPO:4 SZ:8161/16104/1.97 VC:2000 ENC:PLAIN,BIT_PACKED > [more]... > _2: INT32 SNAPPY DO:0 FPO:8165 SZ:8061/8052/1.00 VC:2000 > ENC:PLAIN,BIT_PACKED [more]... > _1 TV=2000 RL=0 DL=0 > > > page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > page 1: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > page 2: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > page 3: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > _2 TV=2000 RL=0 DL=0 > > > page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:1000 > page 1: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:1000 > > As you can see in the row group 0, column1 has 4 data pages each with 500 > values and column 2 has 2 data pages with 1000 values each. > If we want to filter the rows by values with _1 = 510 using columnindex, > parquet will return the page 1 of column _1 and page 0 of column _2. Page 1 > of column _1 starts with row 500, and page 0 of column _2 starts with row 0, > and it will be incorrect if we simply read the two values as one row. > > As an example, If you try filter with _1 = 510 with column index on in > current version, it will give you the wrong result > +---+---+ > |_1 |_2 | > +---+---+ > |510|10 | > +---+---+ > And if turn columnindex off, you can get the correct result > +---+---+ > |_1 |_2 | > +---+---+ > |510|510| > +---+---+ > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34859) Vectorized parquet reader needs synchronization among pages for column index
[ https://issues.apache.org/jira/browse/SPARK-34859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-34859: - Labels: correctness (was: ) > Vectorized parquet reader needs synchronization among pages for column index > > > Key: SPARK-34859 > URL: https://issues.apache.org/jira/browse/SPARK-34859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Li Xian >Priority: Major > Labels: correctness > Attachments: > part-0-bee08cae-04cd-491c-9602-4c66791af3d0-c000.snappy.parquet > > > the current implementation has a problem. the pages returned by > `readNextFilteredRowGroup` may not be aligned, some columns may have more > rows than others. > Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` > with `rowIndexes` to make sure that rows are aligned. > Currently `VectorizedParquetRecordReader` doesn't have such synchronizing > among pages from different columns. Using `readNextFilteredRowGroup` may > result in incorrect result. > > I have attache an example parquet file. This file is generated with > `spark.range(0, 2000).map(i => (i.toLong, i.toInt))` and the layout of this > file is listed below. > row group 0 > > _1: INT64 SNAPPY DO:0 FPO:4 SZ:8161/16104/1.97 VC:2000 ENC:PLAIN,BIT_PACKED > [more]... > _2: INT32 SNAPPY DO:0 FPO:8165 SZ:8061/8052/1.00 VC:2000 > ENC:PLAIN,BIT_PACKED [more]... > _1 TV=2000 RL=0 DL=0 > > > page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > page 1: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > page 2: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > page 3: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > _2 TV=2000 RL=0 DL=0 > > > page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:1000 > page 1: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:1000 > > As you can see in the row group 0, column1 has 4 data pages each with 500 > values and column 2 has 2 data pages with 1000 values each. > If we want to filter the rows by values with _1 = 510 using columnindex, > parquet will return the page 1 of column _1 and page 0 of column _2. Page 1 > of column _1 starts with row 500, and page 0 of column _2 starts with row 0, > and it will be incorrect if we simply read the two values as one row. > > As an example, If you try filter with _1 = 510 with column index on in > current version, it will give you the wrong result > +---+---+ > |_1 |_2 | > +---+---+ > |510|10 | > +---+---+ > And if turn columnindex off, you can get the correct result > +---+---+ > |_1 |_2 | > +---+---+ > |510|510| > +---+---+ > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35461) Error when reading dictionary-encoded Parquet int column when read schema is bigint
[ https://issues.apache.org/jira/browse/SPARK-35461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348667#comment-17348667 ] Chao Sun commented on SPARK-35461: -- Actually this also fails when turning off the vectorized reader: {code} Caused by: java.lang.ClassCastException: class org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to class org.apache.spark.sql.catalyst.expressions.MutableInt (org.apache.spark.sql.catalyst.expressions.MutableLong and org.apache.spark.sql.catalyst.expressions.MutableInt are in unnamed module of loader 'app') at org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.setInt(SpecificInternalRow.scala:253) at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:178) at org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:88) at org.apache.parquet.column.impl.ColumnReaderBase$2$3.writeValue(ColumnReaderBase.java:297) at org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:440) at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30) at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:229) {code} In this case parquet-mr is able to return the value but Spark won't be able to handle it. > Error when reading dictionary-encoded Parquet int column when read schema is > bigint > --- > > Key: SPARK-35461 > URL: https://issues.apache.org/jira/browse/SPARK-35461 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Chao Sun >Priority: Major > > When reading a dictionary-encoded integer column from a Parquet file, and > users specify read schema to be bigint, Spark currently will fail with the > following exception: > {code} > java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary > at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:49) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:50) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:344) > {code} > To reproduce: > {code} > val data = (0 to 10).flatMap(n => Seq.fill(10)(n)).map(i => (i, > i.toString)) > withParquetFile(data) { path => > val readSchema = StructType(Seq(StructField("_1", LongType))) > spark.read.schema(readSchema).parquet(path).first() > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35461) Error when reading dictionary-encoded Parquet int column when read schema is bigint
Chao Sun created SPARK-35461: Summary: Error when reading dictionary-encoded Parquet int column when read schema is bigint Key: SPARK-35461 URL: https://issues.apache.org/jira/browse/SPARK-35461 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1, 3.0.2 Reporter: Chao Sun When reading a dictionary-encoded integer column from a Parquet file, and users specify read schema to be bigint, Spark currently will fail with the following exception: {code} java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:49) at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:50) at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:344) {code} To reproduce: {code} val data = (0 to 10).flatMap(n => Seq.fill(10)(n)).map(i => (i, i.toString)) withParquetFile(data) { path => val readSchema = StructType(Seq(StructField("_1", LongType))) spark.read.schema(readSchema).parquet(path).first() } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35422) Many test cases failed in Scala 2.13 CI
[ https://issues.apache.org/jira/browse/SPARK-35422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346448#comment-17346448 ] Chao Sun commented on SPARK-35422: -- Thanks [~dongjoon]. I've opened a PR for the above failures: https://github.com/apache/spark/pull/32575 > Many test cases failed in Scala 2.13 CI > --- > > Key: SPARK-35422 > URL: https://issues.apache.org/jira/browse/SPARK-35422 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Major > > [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/] > > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > > [org.apache.spark.sql.SQLQueryTestSuite.subquery/scalar-subquery/scalar-subquery-select.sql|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/subquery_scalar_subquery_scalar_subquery_select_sql/]|2.4 > > sec|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified > (tpcds-modifiedQueries/q46)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q46_/]|59 > > ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified > (tpcds-modifiedQueries/q53)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q53_/]|62 > > ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified > (tpcds-modifiedQueries/q63)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q63_/]|54 > > ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified > (tpcds-modifiedQueries/q68)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q68_/]|50 > > ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified > (tpcds-modifiedQueries/q73)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q73_/]|58 > > ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilityWithStatsSuite.check > simplified sf100 > (tpcds-modifiedQueries/q46)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilityWithStatsSuite/check_simplified_sf100__tpcds_modifiedQueries_q46_/]|62 > > ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilityWithStatsSuite.check > simplified sf100 > (tpcds-modifiedQueries/q53)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilityWithStatsSuite/check_simplified_sf100__tpcds_modified
[jira] [Created] (SPARK-35390) Handle type coercion when resolving V2 functions
Chao Sun created SPARK-35390: Summary: Handle type coercion when resolving V2 functions Key: SPARK-35390 URL: https://issues.apache.org/jira/browse/SPARK-35390 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun When resolving V2 functions, we should handle type coercion by checking the expected argument types from the UDF function. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35389) Analyzer should set progagateNull to false for magic function invocation
Chao Sun created SPARK-35389: Summary: Analyzer should set progagateNull to false for magic function invocation Key: SPARK-35389 URL: https://issues.apache.org/jira/browse/SPARK-35389 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun For both {{Invoke}} and {{StaticInvoke}} used by magic method of {{ScalarFunction}}, we should set {{propgateNull}} to false, so that null values will be passed to the UDF for evaluation, instead of bypassing that and directly return null. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35384) Improve performance for InvokeLike.invoke
[ https://issues.apache.org/jira/browse/SPARK-35384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-35384: - Issue Type: Improvement (was: Bug) > Improve performance for InvokeLike.invoke > - > > Key: SPARK-35384 > URL: https://issues.apache.org/jira/browse/SPARK-35384 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Minor > > `InvokeLike.invoke` uses `map` to evaluate arguments: > {code:java} > val args = arguments.map(e => e.eval(input).asInstanceOf[Object]) > if (needNullCheck && args.exists(_ == null)) { > // return null if one of arguments is null > null > } else { > {code} > which seems pretty expensive if the method itself is trivial. We can change > it to a plain for-loop. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35384) Improve performance for InvokeLike.invoke
Chao Sun created SPARK-35384: Summary: Improve performance for InvokeLike.invoke Key: SPARK-35384 URL: https://issues.apache.org/jira/browse/SPARK-35384 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun `InvokeLike.invoke` uses `map` to evaluate arguments: {code:java} val args = arguments.map(e => e.eval(input).asInstanceOf[Object]) if (needNullCheck && args.exists(_ == null)) { // return null if one of arguments is null null } else { {code} which seems pretty expensive if the method itself is trivial. We can change it to a plain for-loop. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35361) Improve performance for ApplyFunctionExpression
[ https://issues.apache.org/jira/browse/SPARK-35361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-35361: - Priority: Minor (was: Major) > Improve performance for ApplyFunctionExpression > --- > > Key: SPARK-35361 > URL: https://issues.apache.org/jira/browse/SPARK-35361 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Minor > > When the `ScalarFunction` itself is trivial, `ApplyFunctionExpression` could > incur significant runtime cost with `zipWithIndex` call. This proposes to > move the call outside the loop over each input row. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35361) Improve performance for ApplyFunctionExpression
[ https://issues.apache.org/jira/browse/SPARK-35361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-35361: - Priority: Major (was: Minor) > Improve performance for ApplyFunctionExpression > --- > > Key: SPARK-35361 > URL: https://issues.apache.org/jira/browse/SPARK-35361 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > When the `ScalarFunction` itself is trivial, `ApplyFunctionExpression` could > incur significant runtime cost with `zipWithIndex` call. This proposes to > move the call outside the loop over each input row. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35361) Improve performance for ApplyFunctionExpression
Chao Sun created SPARK-35361: Summary: Improve performance for ApplyFunctionExpression Key: SPARK-35361 URL: https://issues.apache.org/jira/browse/SPARK-35361 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun When the `ScalarFunction` itself is trivial, `ApplyFunctionExpression` could incur significant runtime cost with `zipWithIndex` call. This proposes to move the call outside the loop over each input row. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35321) Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift API missing
[ https://issues.apache.org/jira/browse/SPARK-35321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-35321: - Issue Type: Bug (was: Improvement) > Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift > API missing > --- > > Key: SPARK-35321 > URL: https://issues.apache.org/jira/browse/SPARK-35321 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Chao Sun >Priority: Major > > https://issues.apache.org/jira/browse/HIVE-10319 introduced a new API > {{get_all_functions}} which is only supported in Hive 1.3.0/2.0.0 and up. > This is called when creating a new {{Hive}} object: > {code} > private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException { > conf = c; > if (doRegisterAllFns) { > registerAllFunctionsOnce(); > } > } > {code} > {{registerAllFunctionsOnce}} will reload all the permanent functions by > calling the {{get_all_functions}} API from the megastore. In Spark, we always > pass {{doRegisterAllFns}} as true, and this will cause failure: > {code} > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: > org.apache.thrift.TApplicationException: Invalid method name: > 'get_all_functions' > at > org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897) > at > org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248) > at > org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231) > ... 96 more > Caused by: org.apache.thrift.TApplicationException: Invalid method name: > 'get_all_functions' > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833) > {code} > It looks like Spark doesn't really need to call {{registerAllFunctionsOnce}} > since it loads the Hive permanent function directly from HMS API. The Hive > {{FunctionRegistry}} is only used for loading Hive built-in functions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35321) Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift API missing
[ https://issues.apache.org/jira/browse/SPARK-35321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339946#comment-17339946 ] Chao Sun commented on SPARK-35321: -- [~yumwang] I'm thinking of using [Hive#getWithFastCheck|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L389] for this purpose, which allows us to set the flag to false. The fast check flag also offers a way to compare {{HiveConf}} faster when the conf rarely changes. > Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift > API missing > --- > > Key: SPARK-35321 > URL: https://issues.apache.org/jira/browse/SPARK-35321 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Chao Sun >Priority: Major > > https://issues.apache.org/jira/browse/HIVE-10319 introduced a new API > {{get_all_functions}} which is only supported in Hive 1.3.0/2.0.0 and up. > This is called when creating a new {{Hive}} object: > {code} > private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException { > conf = c; > if (doRegisterAllFns) { > registerAllFunctionsOnce(); > } > } > {code} > {{registerAllFunctionsOnce}} will reload all the permanent functions by > calling the {{get_all_functions}} API from the megastore. In Spark, we always > pass {{doRegisterAllFns}} as true, and this will cause failure: > {code} > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: > org.apache.thrift.TApplicationException: Invalid method name: > 'get_all_functions' > at > org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897) > at > org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248) > at > org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231) > ... 96 more > Caused by: org.apache.thrift.TApplicationException: Invalid method name: > 'get_all_functions' > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833) > {code} > It looks like Spark doesn't really need to call {{registerAllFunctionsOnce}} > since it loads the Hive permanent function directly from HMS API. The Hive > {{FunctionRegistry}} is only used for loading Hive built-in functions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35321) Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift API missing
[ https://issues.apache.org/jira/browse/SPARK-35321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339906#comment-17339906 ] Chao Sun commented on SPARK-35321: -- [~xkrogen] yes that can help to solve the issue, but users need to specify both {{spark.sql.hive.metastore.version}} and {{spark.sql.hive.metastore.jars}}. The latter is not so easy to setup: the {{maven}} option usually takes a very long time to download all the jars, while the {{path}} option require users to download all the relevant Hive jars with the specific version and it's tedious. I think this specific issue is worth fixing in Spark itself regardless since it doesn't really need to load all the permanent functions when starting up Hive client from what I can see. The process could also be pretty expensive if there are many UDFs registered in HMS. > Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift > API missing > --- > > Key: SPARK-35321 > URL: https://issues.apache.org/jira/browse/SPARK-35321 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Chao Sun >Priority: Major > > https://issues.apache.org/jira/browse/HIVE-10319 introduced a new API > {{get_all_functions}} which is only supported in Hive 1.3.0/2.0.0 and up. > This is called when creating a new {{Hive}} object: > {code} > private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException { > conf = c; > if (doRegisterAllFns) { > registerAllFunctionsOnce(); > } > } > {code} > {{registerAllFunctionsOnce}} will reload all the permanent functions by > calling the {{get_all_functions}} API from the megastore. In Spark, we always > pass {{doRegisterAllFns}} as true, and this will cause failure: > {code} > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: > org.apache.thrift.TApplicationException: Invalid method name: > 'get_all_functions' > at > org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897) > at > org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248) > at > org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231) > ... 96 more > Caused by: org.apache.thrift.TApplicationException: Invalid method name: > 'get_all_functions' > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833) > {code} > It looks like Spark doesn't really need to call {{registerAllFunctionsOnce}} > since it loads the Hive permanent function directly from HMS API. The Hive > {{FunctionRegistry}} is only used for loading Hive built-in functions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35321) Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift API missing
Chao Sun created SPARK-35321: Summary: Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift API missing Key: SPARK-35321 URL: https://issues.apache.org/jira/browse/SPARK-35321 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.1, 3.0.2, 3.2.0 Reporter: Chao Sun https://issues.apache.org/jira/browse/HIVE-10319 introduced a new API {{get_all_functions}} which is only supported in Hive 1.3.0/2.0.0 and up. This is called when creating a new {{Hive}} object: {code} private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException { conf = c; if (doRegisterAllFns) { registerAllFunctionsOnce(); } } {code} {{registerAllFunctionsOnce }} will reload all the permanent functions by calling the {{get_all_functions}} API from the megastore. In Spark, we always pass {{doRegisterAllFns}} as true, and this will cause failure: {code} Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions' at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248) at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231) ... 96 more Caused by: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions' at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833) {code} It looks like Spark doesn't really need to call {{registerAllFunctionsOnce}} since it loads the Hive permanent function directly from HMS API. The Hive {{FunctionRegistry}} is only used for loading Hive built-in functions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35321) Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift API missing
[ https://issues.apache.org/jira/browse/SPARK-35321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-35321: - Description: https://issues.apache.org/jira/browse/HIVE-10319 introduced a new API {{get_all_functions}} which is only supported in Hive 1.3.0/2.0.0 and up. This is called when creating a new {{Hive}} object: {code} private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException { conf = c; if (doRegisterAllFns) { registerAllFunctionsOnce(); } } {code} {{registerAllFunctionsOnce}} will reload all the permanent functions by calling the {{get_all_functions}} API from the megastore. In Spark, we always pass {{doRegisterAllFns}} as true, and this will cause failure: {code} Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions' at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248) at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231) ... 96 more Caused by: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions' at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833) {code} It looks like Spark doesn't really need to call {{registerAllFunctionsOnce}} since it loads the Hive permanent function directly from HMS API. The Hive {{FunctionRegistry}} is only used for loading Hive built-in functions. was: https://issues.apache.org/jira/browse/HIVE-10319 introduced a new API {{get_all_functions}} which is only supported in Hive 1.3.0/2.0.0 and up. This is called when creating a new {{Hive}} object: {code} private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException { conf = c; if (doRegisterAllFns) { registerAllFunctionsOnce(); } } {code} {{registerAllFunctionsOnce }} will reload all the permanent functions by calling the {{get_all_functions}} API from the megastore. In Spark, we always pass {{doRegisterAllFns}} as true, and this will cause failure: {code} Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions' at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248) at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231) ... 96 more Caused by: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions' at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833) {code} It looks like Spark doesn't really need to call {{registerAllFunctionsOnce}} since it loads the Hive permanent function directly from HMS API. The Hive {{FunctionRegistry}} is only used for loading Hive built-in functions. > Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift > API missing > --- > > Key: SPARK-35321 > URL: https://issues.apache.org/jira/browse/SPARK-35321 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Chao Sun >Priority: Major > > https://issues.apache.org/jira/browse/HIVE-10319 introduced a new API > {{get_all_functions}} which is only supported in Hive 1.3.0/2.0.0 and up. > This is called when creating a new {{Hive}} object: > {code} > private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException { > conf = c; > if (doRegisterAllFns) { > registerAllFunctionsOnce(); > } > } > {code} > {{registerAllFunctionsOnce}} will reload all the permanent functions by > calling the {{get_all_functions}} API from the megastore. In Spark, we always > pass {{doRegisterAllFns}} as true, and this will cause failure: > {code} > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: > org.apache.thrift.TApplicationException: Invalid method name: > 'get_all_functions' > at > org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897) > at > org.apache.hadoop.hive.ql.metadata.Hive.relo
[jira] [Updated] (SPARK-35315) Keep benchmark result consistent between spark-submit and SBT
[ https://issues.apache.org/jira/browse/SPARK-35315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-35315: - Priority: Minor (was: Major) > Keep benchmark result consistent between spark-submit and SBT > - > > Key: SPARK-35315 > URL: https://issues.apache.org/jira/browse/SPARK-35315 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Minor > > Currently benchmark can be done in two ways: {{spark-submit}}, or SBT > command. However in the former Spark will miss some properties such as > {{IS_TESTING}}, which is useful to turn on/off some behavior like codegen. > Therefore, the result could differ with the two methods. In addition, the > benchmark GitHub workflow is using the {{spark-submit}} approach. > This propose to set {{IS_TESTING}} to true in {{BenchmarkBase}} so that it is > always on. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35315) Keep benchmark result consistent between spark-submit and SBT
Chao Sun created SPARK-35315: Summary: Keep benchmark result consistent between spark-submit and SBT Key: SPARK-35315 URL: https://issues.apache.org/jira/browse/SPARK-35315 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 3.2.0 Reporter: Chao Sun Currently benchmark can be done in two ways: {{spark-submit}}, or SBT command. However in the former Spark will miss some properties such as {{IS_TESTING}}, which is useful to turn on/off some behavior like codegen. Therefore, the result could differ with the two methods. In addition, the benchmark GitHub workflow is using the {{spark-submit}} approach. This propose to set {{IS_TESTING}} to true in {{BenchmarkBase}} so that it is always on. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35281) StaticInvoke should not apply boxing if return type is primitive
[ https://issues.apache.org/jira/browse/SPARK-35281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-35281: - Priority: Minor (was: Major) > StaticInvoke should not apply boxing if return type is primitive > > > Key: SPARK-35281 > URL: https://issues.apache.org/jira/browse/SPARK-35281 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Minor > > Currently {{StaticInvoke}} will apply boxing even if the return type is > primitive. This seems unnecessary. In comparison, {{Invoke}} checks this and > skips the boxing if return type is primitive. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35281) StaticInvoke should not apply boxing if return type is primitive
Chao Sun created SPARK-35281: Summary: StaticInvoke should not apply boxing if return type is primitive Key: SPARK-35281 URL: https://issues.apache.org/jira/browse/SPARK-35281 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun Currently {{StaticInvoke}} will apply boxing even if the return type is primitive. This seems unnecessary. In comparison, {{Invoke}} checks this and skips the boxing if return type is primitive. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35261) Support static invoke for stateless UDF
Chao Sun created SPARK-35261: Summary: Support static invoke for stateless UDF Key: SPARK-35261 URL: https://issues.apache.org/jira/browse/SPARK-35261 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun For UDFs that are stateless, we should allow users to define "magic method" as a static Java method which removes extra costs from dynamic dispatch and gives better performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34981) Implement V2 function resolution and evaluation
[ https://issues.apache.org/jira/browse/SPARK-34981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-34981: - Parent: SPARK-35260 Issue Type: Sub-task (was: Improvement) > Implement V2 function resolution and evaluation > > > Key: SPARK-34981 > URL: https://issues.apache.org/jira/browse/SPARK-34981 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.2.0 > > > This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims > at implementing the function resolution (in analyzer) and evaluation by > wrapping them into corresponding expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35260) DataSourceV2 Function Catalog implementation
Chao Sun created SPARK-35260: Summary: DataSourceV2 Function Catalog implementation Key: SPARK-35260 URL: https://issues.apache.org/jira/browse/SPARK-35260 Project: Spark Issue Type: Umbrella Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun This tracks the implementation and follow-up work for V2 Function Catalog introduced in SPARK-27658 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35233) Switch from bintray to scala.jfrog.io for SBT download in branch 2.4 and 3.0
Chao Sun created SPARK-35233: Summary: Switch from bintray to scala.jfrog.io for SBT download in branch 2.4 and 3.0 Key: SPARK-35233 URL: https://issues.apache.org/jira/browse/SPARK-35233 Project: Spark Issue Type: Task Components: Build Affects Versions: 2.4.8, 3.1.2 Reporter: Chao Sun As bintray is going to be [deprecated|https://eed3si9n.com/bintray-to-jfrog-artifactory-migration-status-and-sbt-1.5.1], the download of {{sot-launch.jar}} will fail. This proposes to migrate the URL from {{https://dl.bintray.com/typesafe}} to {{https://scala.jfrog.io/artifactory}} for branch-2.4 and branch-3.0 which are affected by this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35232) Nested column pruning should retain column metadata
Chao Sun created SPARK-35232: Summary: Nested column pruning should retain column metadata Key: SPARK-35232 URL: https://issues.apache.org/jira/browse/SPARK-35232 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun It seems we should retain column metadata when pruning nested columns. Otherwise the info will be lost and will affect things such as re-constructing CHAR/VARCHAR type (SPARK-33901). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35195) Move InMemoryTable etc to org.apache.spark.sql.connector.catalog
Chao Sun created SPARK-35195: Summary: Move InMemoryTable etc to org.apache.spark.sql.connector.catalog Key: SPARK-35195 URL: https://issues.apache.org/jira/browse/SPARK-35195 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun Currently test classes such as {{InMemoryTable}} reside in {{org.apache.spark.sql.connector}} rather than {{org.apache.spark.sql.connector.catalog}}. We should move them to latter to match the interfaces they implement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35003) Improve performance for reading smallint in vectorized Parquet reader
Chao Sun created SPARK-35003: Summary: Improve performance for reading smallint in vectorized Parquet reader Key: SPARK-35003 URL: https://issues.apache.org/jira/browse/SPARK-35003 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun Currently {{VectorizedRleValuesReader}} reads short in the following way: {code:java} for (int i = 0; i < n; i++) { c.putShort(rowId + i, (short)data.readInteger()); } {code} For PLAIN encoding {{readInteger}} is done via: {code:java} public final int readInteger() { return getBuffer(4).getInt(); } {code} which means it needs to repeatedly call {{slice}} buffer which is more expensive than calling it once in a big chunk and then reading the ints out. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used
[ https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316714#comment-17316714 ] Chao Sun commented on SPARK-34780: -- Hi [~mikechen] (and sorry for the late reply again), thanks for providing another very useful code snippet! I'm not sure if this qualifies as correctness issue though since it is (to me) more like different interpretations of malformed columns in CSV? My previous statement about {{SessionState}} is incorrect. It seems the conf in {{SessionState}} is always the most up-to-date one. The only solution I can think of to solve this issue is to take conf into account when checking equality of {{HadoopFsRelation}} (and potentially others), which means we'd need to define equality for {{SQLConf}}.. > Cached Table (parquet) with old Configs Used > > > Key: SPARK-34780 > URL: https://issues.apache.org/jira/browse/SPARK-34780 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.1.1 >Reporter: Michael Chen >Priority: Major > > When a dataframe is cached, the logical plan can contain copies of the spark > session meaning the SQLConfs are stored. Then if a different dataframe can > replace parts of it's logical plan with a cached logical plan, the cached > SQLConfs will be used for the evaluation of the cached logical plan. This is > because HadoopFsRelation ignores sparkSession for equality checks (introduced > in https://issues.apache.org/jira/browse/SPARK-17358). > {code:java} > test("cache uses old SQLConf") { > import testImplicits._ > withTempDir { dir => > val tableDir = dir.getAbsoluteFile + "/table" > val df = Seq("a").toDF("key") > df.write.parquet(tableDir) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") > val compression1Stats = spark.read.parquet(tableDir).select("key"). > queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10") > val df2 = spark.read.parquet(tableDir).select("key") > df2.cache() > val compression10Stats = df2.queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") > val compression1StatsWithCache = > spark.read.parquet(tableDir).select("key"). > queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > // I expect these stats to be the same because file compression factor is > the same > assert(compression1Stats == compression1StatsWithCache) > // Instead, we can see the file compression factor is being cached and > used along with > // the logical plan > assert(compression10Stats == compression1StatsWithCache) > } > }{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34981) Implement V2 function resolution and evaluation
[ https://issues.apache.org/jira/browse/SPARK-34981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316477#comment-17316477 ] Chao Sun commented on SPARK-34981: -- Will submit a PR soon. > Implement V2 function resolution and evaluation > > > Key: SPARK-34981 > URL: https://issues.apache.org/jira/browse/SPARK-34981 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims > at implementing the function resolution (in analyzer) and evaluation by > wrapping them into corresponding expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34981) Implement V2 function resolution and evaluation
Chao Sun created SPARK-34981: Summary: Implement V2 function resolution and evaluation Key: SPARK-34981 URL: https://issues.apache.org/jira/browse/SPARK-34981 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims at implementing the function resolution (in analyzer) and evaluation by wrapping them into corresponding expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34973) Cleanup unused fields and methods in vectorized Parquet reader
[ https://issues.apache.org/jira/browse/SPARK-34973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-34973: - Priority: Minor (was: Major) > Cleanup unused fields and methods in vectorized Parquet reader > -- > > Key: SPARK-34973 > URL: https://issues.apache.org/jira/browse/SPARK-34973 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Minor > > There are some legacy fields and methods in vectorized Parquet reader which > are no longer used. It's better to clean them up to make the code easier to > maintain. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34973) Cleanup unused fields and methods in vectorized Parquet reader
Chao Sun created SPARK-34973: Summary: Cleanup unused fields and methods in vectorized Parquet reader Key: SPARK-34973 URL: https://issues.apache.org/jira/browse/SPARK-34973 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun There are some legacy fields and methods in vectorized Parquet reader which are no longer used. It's better to clean them up to make the code easier to maintain. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34947) Streaming write to a V2 table should invalidate its associated cache
Chao Sun created SPARK-34947: Summary: Streaming write to a V2 table should invalidate its associated cache Key: SPARK-34947 URL: https://issues.apache.org/jira/browse/SPARK-34947 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.1 Reporter: Chao Sun A DSv2 table that supports {{STREAMING_WRITE}} can be written by a streaming job. However currently Spark doesn't invalidate its associated cache. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34945) Fix Javadoc for catalyst module
Chao Sun created SPARK-34945: Summary: Fix Javadoc for catalyst module Key: SPARK-34945 URL: https://issues.apache.org/jira/browse/SPARK-34945 Project: Spark Issue Type: Improvement Components: docs, Documentation Affects Versions: 3.2.0 Reporter: Chao Sun Inside catalyst module there are many Java classes, especially those in DSv2, are not using proper Javadoc format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used
[ https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308854#comment-17308854 ] Chao Sun commented on SPARK-34780: -- [~mikechen], yes you're right. I'm not sure if this is a big concern though, since it just means the plan fragment for the cache is executed with the stale conf. I guess as long as there is no correctness issue (which I'd be surprised to see if there's any), it should be fine? It seems a bit tricky to fix the issue, since the {{SparkSession}} is leaked to many places. I guess one way is to follow the idea of SPARK-33389 and change {{SessionState}} to always use the active conf. > Cached Table (parquet) with old Configs Used > > > Key: SPARK-34780 > URL: https://issues.apache.org/jira/browse/SPARK-34780 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.1.1 >Reporter: Michael Chen >Priority: Major > > When a dataframe is cached, the logical plan can contain copies of the spark > session meaning the SQLConfs are stored. Then if a different dataframe can > replace parts of it's logical plan with a cached logical plan, the cached > SQLConfs will be used for the evaluation of the cached logical plan. This is > because HadoopFsRelation ignores sparkSession for equality checks (introduced > in https://issues.apache.org/jira/browse/SPARK-17358). > {code:java} > test("cache uses old SQLConf") { > import testImplicits._ > withTempDir { dir => > val tableDir = dir.getAbsoluteFile + "/table" > val df = Seq("a").toDF("key") > df.write.parquet(tableDir) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") > val compression1Stats = spark.read.parquet(tableDir).select("key"). > queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10") > val df2 = spark.read.parquet(tableDir).select("key") > df2.cache() > val compression10Stats = df2.queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") > val compression1StatsWithCache = > spark.read.parquet(tableDir).select("key"). > queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > // I expect these stats to be the same because file compression factor is > the same > assert(compression1Stats == compression1StatsWithCache) > // Instead, we can see the file compression factor is being cached and > used along with > // the logical plan > assert(compression10Stats == compression1StatsWithCache) > } > }{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used
[ https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308262#comment-17308262 ] Chao Sun commented on SPARK-34780: -- Sorry for the late reply [~mikechen]! There's something I still not quite clear: when the cache is retrieved, a {{InMemoryRelation}} will be used to replace the plan fragment that is matched. Therefore, how can the old stale conf still be used in places like {{DataSourceScanExec}}? > Cached Table (parquet) with old Configs Used > > > Key: SPARK-34780 > URL: https://issues.apache.org/jira/browse/SPARK-34780 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.1.1 >Reporter: Michael Chen >Priority: Major > > When a dataframe is cached, the logical plan can contain copies of the spark > session meaning the SQLConfs are stored. Then if a different dataframe can > replace parts of it's logical plan with a cached logical plan, the cached > SQLConfs will be used for the evaluation of the cached logical plan. This is > because HadoopFsRelation ignores sparkSession for equality checks (introduced > in https://issues.apache.org/jira/browse/SPARK-17358). > {code:java} > test("cache uses old SQLConf") { > import testImplicits._ > withTempDir { dir => > val tableDir = dir.getAbsoluteFile + "/table" > val df = Seq("a").toDF("key") > df.write.parquet(tableDir) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") > val compression1Stats = spark.read.parquet(tableDir).select("key"). > queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10") > val df2 = spark.read.parquet(tableDir).select("key") > df2.cache() > val compression10Stats = df2.queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") > val compression1StatsWithCache = > spark.read.parquet(tableDir).select("key"). > queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > // I expect these stats to be the same because file compression factor is > the same > assert(compression1Stats == compression1StatsWithCache) > // Instead, we can see the file compression factor is being cached and > used along with > // the logical plan > assert(compression10Stats == compression1StatsWithCache) > } > }{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30497) migrate DESCRIBE TABLE to the new framework
[ https://issues.apache.org/jira/browse/SPARK-30497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308067#comment-17308067 ] Chao Sun commented on SPARK-30497: -- [~cloud_fan] this is resolved right? > migrate DESCRIBE TABLE to the new framework > --- > > Key: SPARK-30497 > URL: https://issues.apache.org/jira/browse/SPARK-30497 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used
[ https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305109#comment-17305109 ] Chao Sun commented on SPARK-34780: -- Thanks for the reporting [~mikechen], the test case you provided is very useful. I'm not sure, though, how severe is the issue since it only affects {{computeStats}}, and when the cache is actually materialized (e.g., via {{df2.count()}} after {{df2.cache()}}), the value from {{computeStats}} will be different anyways. Could you give more details? > Cached Table (parquet) with old Configs Used > > > Key: SPARK-34780 > URL: https://issues.apache.org/jira/browse/SPARK-34780 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.1.1 >Reporter: Michael Chen >Priority: Major > > When a dataframe is cached, the logical plan can contain copies of the spark > session meaning the SQLConfs are stored. Then if a different dataframe can > replace parts of it's logical plan with a cached logical plan, the cached > SQLConfs will be used for the evaluation of the cached logical plan. This is > because HadoopFsRelation ignores sparkSession for equality checks (introduced > in https://issues.apache.org/jira/browse/SPARK-17358). > {code:java} > test("cache uses old SQLConf") { > import testImplicits._ > withTempDir { dir => > val tableDir = dir.getAbsoluteFile + "/table" > val df = Seq("a").toDF("key") > df.write.parquet(tableDir) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") > val compression1Stats = spark.read.parquet(tableDir).select("key"). > queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10") > val df2 = spark.read.parquet(tableDir).select("key") > df2.cache() > val compression10Stats = df2.queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") > val compression1StatsWithCache = > spark.read.parquet(tableDir).select("key"). > queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > // I expect these stats to be the same because file compression factor is > the same > assert(compression1Stats == compression1StatsWithCache) > // Instead, we can see the file compression factor is being cached and > used along with > // the logical plan > assert(compression10Stats == compression1StatsWithCache) > } > }{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32703) Replace deprecated API calls from SpecificParquetRecordReaderBase
[ https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-32703: - Description: Currently in {{SpecificParquetRecordReaderBase}} we use deprecated APIs in a few places from Parquet, such as {{readFooter}}, {{ParquetInputSplit}}, deprecated ctor for {{ParquetFileReader}}, {{filterRowGroups}}, etc. These are going to be removed in some of the future Parquet versions so we should move to the new APIs for better maintainability. (was: Parquet vectorized reader still uses the old API for {{filterRowGroups}} and only filters on statistics. It should switch to the new API and do dictionary filtering as well.) > Replace deprecated API calls from SpecificParquetRecordReaderBase > - > > Key: SPARK-32703 > URL: https://issues.apache.org/jira/browse/SPARK-32703 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Chao Sun >Priority: Minor > > Currently in {{SpecificParquetRecordReaderBase}} we use deprecated APIs in a > few places from Parquet, such as {{readFooter}}, {{ParquetInputSplit}}, > deprecated ctor for {{ParquetFileReader}}, {{filterRowGroups}}, etc. These > are going to be removed in some of the future Parquet versions so we should > move to the new APIs for better maintainability. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32703) Replace deprecated API calls from SpecificParquetRecordReaderBase
[ https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-32703: - Summary: Replace deprecated API calls from SpecificParquetRecordReaderBase (was: Enable dictionary filtering for Parquet vectorized reader) > Replace deprecated API calls from SpecificParquetRecordReaderBase > - > > Key: SPARK-32703 > URL: https://issues.apache.org/jira/browse/SPARK-32703 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Chao Sun >Priority: Minor > > Parquet vectorized reader still uses the old API for {{filterRowGroups}} and > only filters on statistics. It should switch to the new API and do dictionary > filtering as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290707#comment-17290707 ] Chao Sun commented on SPARK-33212: -- Yes. I think the only class Spark needs from this jar is {{org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter}}, which together with other two classes it depends on from the same package, do not have Guava dependency except {{VisibleForTesting}}. > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290613#comment-17290613 ] Chao Sun edited comment on SPARK-33212 at 2/25/21, 2:21 AM: I was able to reproduce the error in my local environment, and find a potential fix in Spark. I think only {{hadoop-yarn-server-web-proxy}} is needed by Spark - all the other YARN jars are already covered by {{hadoop-client-api}} and {{hadoop-client-runtime}}. I'll open a PR for this soon. was (Author: csun): I was able to reproduce the error in my local environment, and find a potential fix in Spark. I think {{hadoop-yarn-server-web-proxy}} is needed by Spark - all the other YARN jars are already covered by {{hadoop-client-api}} and {{hadoop-client-runtime}}. I'll open a PR for this soon. > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290613#comment-17290613 ] Chao Sun commented on SPARK-33212: -- I was able to reproduce the error in my local environment, and find a potential fix in Spark. I think {{hadoop-yarn-server-web-proxy}} is needed by Spark - all the other YARN jars are already covered by {{hadoop-client-api}} and {{hadoop-client-runtime}}. I'll open a PR for this soon. > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290127#comment-17290127 ] Chao Sun commented on SPARK-33212: -- Thanks again [~ouyangxc.zte]. {{org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter}} was not included in the {{hadoop-client}} jars since it is a server-side class and ideally should not be exposed to client applications such as Spark. [~dongjoon] Let me see how we can fix this either in Spark or Hadoop. > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289652#comment-17289652 ] Chao Sun commented on SPARK-33212: -- Thanks for the details [~ouyangxc.zte]! {quote} Get AMIpFilter ClassNotFoundException , because there is no 'hadoop-client-minicluster.jar' in classpath {quote} This is interesting. the {{hadoop-client-minicluster.jar}} should only be used in tests - curious why it is needed here. Could you share stacktraces for the {{ClassNotFoundException}}? {quote} 2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error initializing SparkContext. java.lang.IllegalStateException: class org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a javax.servlet.Filter {quote} Could you also share the stacktraces for this exception? And to confirm, you are using {{client}} as the deploy mode, is that correct? I'll try to reproduce this in my local environment. > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289200#comment-17289200 ] Chao Sun commented on SPARK-33212: -- Thanks for the report [~ouyangxc.zte]. Can you provide more details? such as error messages, stack traces, steps to reproduce the issue, etc? > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34419) Move PartitionTransforms from java to scala directory
Chao Sun created SPARK-34419: Summary: Move PartitionTransforms from java to scala directory Key: SPARK-34419 URL: https://issues.apache.org/jira/browse/SPARK-34419 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun {{PartitionTransforms}} is currently under {{sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions}}. It should be under {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34347) CatalogImpl.uncacheTable should invalidate in cascade for temp views
Chao Sun created SPARK-34347: Summary: CatalogImpl.uncacheTable should invalidate in cascade for temp views Key: SPARK-34347 URL: https://issues.apache.org/jira/browse/SPARK-34347 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun When {{spark.sql.legacy.storeAnalyzedPlanForView}} is false, {{CatalogImpl.uncacheTable}} should invalidate caches for temp view in cascade. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34108) Cache lookup doesn't work in certain cases
[ https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-34108. -- Resolution: Duplicate > Cache lookup doesn't work in certain cases > -- > > Key: SPARK-34108 > URL: https://issues.apache.org/jira/browse/SPARK-34108 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Priority: Major > > Currently, caching a temporary or permenant view doesn't work in certain > cases. For instance, in the following: > {code:sql} > CREATE TABLE t (key bigint, value string) USING parquet > CREATE VIEW v1 AS SELECT key FROM t > CACHE TABLE v1 > SELECT key FROM t > {code} > The last SELECT query will hit the cached {{v1}}. On the other hand: > {code:sql} > CREATE TABLE t (key bigint, value string) USING parquet > CREATE VIEW v1 AS SELECT key FROM t ORDER by key > CACHE TABLE v1 > SELECT key FROM t ORDER BY key > {code} > The SELECT won't hit the cache. > It seems this is related to {{EliminateView}}. In the second case, it will > insert an extra project operator which makes the comparison on canonicalized > plan during cache lookup fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34108) Cache lookup doesn't work in certain cases
[ https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-34108: - Description: Currently, caching a temporary or permenant view doesn't work in certain cases. For instance, in the following: {code:sql} CREATE TABLE t (key bigint, value string) USING parquet CREATE VIEW v1 AS SELECT key FROM t CACHE TABLE v1 SELECT key FROM t {code} The last SELECT query will hit the cached {{v1}}. On the other hand: {code:sql} CREATE TABLE t (key bigint, value string) USING parquet CREATE VIEW v1 AS SELECT key FROM t ORDER by key CACHE TABLE v1 SELECT key FROM t ORDER BY key {code} The SELECT won't hit the cache. It seems this is related to {{EliminateView}}. In the second case, it will insert an extra project operator which makes the comparison on canonicalized plan during cache lookup fail. was: Currently, caching a permanent view doesn't work in certain cases. For instance, in the following: {code:sql} CREATE TABLE t (key bigint, value string) USING parquet CREATE VIEW v1 AS SELECT key FROM t CACHE TABLE v1 SELECT key FROM t {code} The last SELECT query will hit the cached {{v1}}. On the other hand: {code:sql} CREATE TABLE t (key bigint, value string) USING parquet CREATE VIEW v1 AS SELECT key FROM t ORDER by key CACHE TABLE v1 SELECT key FROM t ORDER BY key {code} The SELECT won't hit the cache. It seems this is related to {{EliminateView}}. In the second case, it will insert an extra project operator which makes the comparison on canonicalized plan during cache lookup fail. > Cache lookup doesn't work in certain cases > -- > > Key: SPARK-34108 > URL: https://issues.apache.org/jira/browse/SPARK-34108 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Priority: Major > > Currently, caching a temporary or permenant view doesn't work in certain > cases. For instance, in the following: > {code:sql} > CREATE TABLE t (key bigint, value string) USING parquet > CREATE VIEW v1 AS SELECT key FROM t > CACHE TABLE v1 > SELECT key FROM t > {code} > The last SELECT query will hit the cached {{v1}}. On the other hand: > {code:sql} > CREATE TABLE t (key bigint, value string) USING parquet > CREATE VIEW v1 AS SELECT key FROM t ORDER by key > CACHE TABLE v1 > SELECT key FROM t ORDER BY key > {code} > The SELECT won't hit the cache. > It seems this is related to {{EliminateView}}. In the second case, it will > insert an extra project operator which makes the comparison on canonicalized > plan during cache lookup fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34108) Cache lookup doesn't work in certain cases
[ https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-34108: - Summary: Cache lookup doesn't work in certain cases (was: Caching with permanent view doesn't work in certain cases) > Cache lookup doesn't work in certain cases > -- > > Key: SPARK-34108 > URL: https://issues.apache.org/jira/browse/SPARK-34108 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Priority: Major > > Currently, caching a permanent view doesn't work in certain cases. For > instance, in the following: > {code:sql} > CREATE TABLE t (key bigint, value string) USING parquet > CREATE VIEW v1 AS SELECT key FROM t > CACHE TABLE v1 > SELECT key FROM t > {code} > The last SELECT query will hit the cached {{v1}}. On the other hand: > {code:sql} > CREATE TABLE t (key bigint, value string) USING parquet > CREATE VIEW v1 AS SELECT key FROM t ORDER by key > CACHE TABLE v1 > SELECT key FROM t ORDER BY key > {code} > The SELECT won't hit the cache. > It seems this is related to {{EliminateView}}. In the second case, it will > insert an extra project operator which makes the comparison on canonicalized > plan during cache lookup fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34271) Use majorMinorPatchVersion for Hive version parsing
Chao Sun created SPARK-34271: Summary: Use majorMinorPatchVersion for Hive version parsing Key: SPARK-34271 URL: https://issues.apache.org/jira/browse/SPARK-34271 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun Currently {{IsolatedClientLoader}} need to enumerate all Hive patch versions. Therefore, whenever we upgrade Hive version we'd have to remember to update the method. It would be better if we just check major & minor version. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27589) Spark file source V2
[ https://issues.apache.org/jira/browse/SPARK-27589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273161#comment-17273161 ] Chao Sun commented on SPARK-27589: -- [~xkrogen] FWIW I'm working on a POC for SPARK-32935 at the moment. There is also a design doc under working. Hopefully we'll be able to share it soon. cc [~rdblue] too. > Spark file source V2 > > > Key: SPARK-27589 > URL: https://issues.apache.org/jira/browse/SPARK-27589 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > Re-implement file sources with data source V2 API -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34052) A cached view should become invalid after a table is dropped
[ https://issues.apache.org/jira/browse/SPARK-34052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272333#comment-17272333 ] Chao Sun commented on SPARK-34052: -- [~hyukjin.kwon] [~cloud_fan] do you think we should include this in 3.1.1? since we've changed how temp view work in SPARK-33142 it may be better to add this too to make it consistent. > A cached view should become invalid after a table is dropped > > > Key: SPARK-34052 > URL: https://issues.apache.org/jira/browse/SPARK-34052 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.2.0, 3.1.2 > > > It seems a view doesn't become invalid after a DSv2 table is dropped or > replaced. This is different from V1 and may cause correctness issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33507) Improve and fix cache behavior in v1 and v2
[ https://issues.apache.org/jira/browse/SPARK-33507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17268063#comment-17268063 ] Chao Sun commented on SPARK-33507: -- [~aokolnychyi] could you elaborate on the question? currently Spark doesn't support caching streaming tables yet. > Improve and fix cache behavior in v1 and v2 > --- > > Key: SPARK-33507 > URL: https://issues.apache.org/jira/browse/SPARK-33507 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Priority: Critical > > This is an umbrella JIRA to track fixes & improvements for caching behavior > in Spark datasource v1 and v2, which includes: > - fix existing cache behavior in v1 and v2. > - fix inconsistent cache behavior between v1 and v2 > - implement missing features in v2 to align with those in v1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33507) Improve and fix cache behavior in v1 and v2
[ https://issues.apache.org/jira/browse/SPARK-33507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264608#comment-17264608 ] Chao Sun edited comment on SPARK-33507 at 1/14/21, 5:23 AM: Thanks [~hyukjin.kwon]. From my side, there is no regression. Although I feel SPARK-34052 is a bit important since it concerns correctness. I'n working on a fix but got delayed by a few other issues found along the way :(. The issue has been there for a long time though so I'm fine moving this to the next release. was (Author: csun): Thanks [~hyukjin.kwon]. From my side, there is no regression. Although I feel SPARK-34052 is a bit important since it concerns correctness. I'n working on a fix but got delayed by a few other issues found during the process :( > Improve and fix cache behavior in v1 and v2 > --- > > Key: SPARK-33507 > URL: https://issues.apache.org/jira/browse/SPARK-33507 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Priority: Critical > > This is an umbrella JIRA to track fixes & improvements for caching behavior > in Spark datasource v1 and v2, which includes: > - fix existing cache behavior in v1 and v2. > - fix inconsistent cache behavior between v1 and v2 > - implement missing features in v2 to align with those in v1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33507) Improve and fix cache behavior in v1 and v2
[ https://issues.apache.org/jira/browse/SPARK-33507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264608#comment-17264608 ] Chao Sun commented on SPARK-33507: -- Thanks [~hyukjin.kwon]. From my side, there is no regression. Although I feel SPARK-34052 is a bit important since it concerns correctness. I'n working on a fix but got delayed by a few other issues found during the process :( > Improve and fix cache behavior in v1 and v2 > --- > > Key: SPARK-33507 > URL: https://issues.apache.org/jira/browse/SPARK-33507 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Priority: Critical > > This is an umbrella JIRA to track fixes & improvements for caching behavior > in Spark datasource v1 and v2, which includes: > - fix existing cache behavior in v1 and v2. > - fix inconsistent cache behavior between v1 and v2 > - implement missing features in v2 to align with those in v1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34108) Caching with permanent view doesn't work in certain cases
[ https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-34108: - Summary: Caching with permanent view doesn't work in certain cases (was: Caching doesn't work completely with permanent view) > Caching with permanent view doesn't work in certain cases > - > > Key: SPARK-34108 > URL: https://issues.apache.org/jira/browse/SPARK-34108 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Priority: Major > > Currently, caching a permanent view doesn't work in certain cases. For > instance, in the following: > {code:sql} > CREATE TABLE t (key bigint, value string) USING parquet > CREATE VIEW v1 AS SELECT key FROM t > CACHE TABLE v1 > SELECT key FROM t > {code} > The last SELECT query will hit the cached {{v1}}. On the other hand: > {code:sql} > CREATE TABLE t (key bigint, value string) USING parquet > CREATE VIEW v1 AS SELECT key FROM t ORDER by key > CACHE TABLE v1 > SELECT key FROM t ORDER BY key > {code} > The SELECT won't hit the cache. > It seems this is related to {{EliminateView}}. In the second case, it will > insert an extra project operator which makes the comparison on canonicalized > plan during cache lookup fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34108) Caching doesn't work completely with permanent view
[ https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-34108: - Description: Currently, caching a permanent view doesn't work in certain cases. For instance, in the following: {code:sql} CREATE TABLE t (key bigint, value string) USING parquet CREATE VIEW v1 AS SELECT key FROM t CACHE TABLE v1 SELECT key FROM t {code} The last SELECT query will hit the cached {{v1}}. On the other hand: {code:sql} CREATE TABLE t (key bigint, value string) USING parquet CREATE VIEW v1 AS SELECT key FROM t ORDER by key CACHE TABLE v1 SELECT key FROM t ORDER BY key {code} The SELECT won't hit the cache. It seems this is related to {{EliminateView}}. In the second case, it will insert an extra project operator which makes the comparison on canonicalized plan during cache lookup fail. was: Currently, caching a permanent view doesn't work in some cases. For instance, in the following: {code:sql} CREATE TABLE t (key bigint, value string) USING parquet CREATE VIEW v1 AS SELECT key FROM t CACHE TABLE v1 SELECT key FROM t {code} The last SELECT query will hit the cached {{v1}}. However, in the following: {code:sql} CREATE TABLE t (key bigint, value string) USING parquet CREATE VIEW v1 AS SELECT key FROM t ORDER by key CACHE TABLE v1 SELECT key FROM t ORDER BY key {code} The SELECT won't hit the cache. It seems this is related to {{EliminateView}}. In the second case, it will insert an extra project operator which makes the comparison on canonicalized plan during cache lookup fail. > Caching doesn't work completely with permanent view > --- > > Key: SPARK-34108 > URL: https://issues.apache.org/jira/browse/SPARK-34108 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Priority: Major > > Currently, caching a permanent view doesn't work in certain cases. For > instance, in the following: > {code:sql} > CREATE TABLE t (key bigint, value string) USING parquet > CREATE VIEW v1 AS SELECT key FROM t > CACHE TABLE v1 > SELECT key FROM t > {code} > The last SELECT query will hit the cached {{v1}}. On the other hand: > {code:sql} > CREATE TABLE t (key bigint, value string) USING parquet > CREATE VIEW v1 AS SELECT key FROM t ORDER by key > CACHE TABLE v1 > SELECT key FROM t ORDER BY key > {code} > The SELECT won't hit the cache. > It seems this is related to {{EliminateView}}. In the second case, it will > insert an extra project operator which makes the comparison on canonicalized > plan during cache lookup fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org