[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset
[ https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415703#comment-17415703 ] DB Tsai commented on SPARK-36696: - This issue is addressed by https://issues.apache.org/jira/browse/SPARK-34542 can we close this JIRA? > spark.read.parquet loads empty dataset > -- > > Key: SPARK-36696 > URL: https://issues.apache.org/jira/browse/SPARK-36696 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Blocker > Attachments: example.parquet > > > Here's a parquet file Spark 3.2/master can't read properly. > The file was stored by pandas and must contain 3650 rows, but Spark > 3.2/master returns an empty dataset. > {code:python} > >>> import pandas as pd > >>> len(pd.read_parquet('/path/to/example.parquet')) > 3650 > >>> spark.read.parquet('/path/to/example.parquet').count() > 0 > {code} > I guess it's caused by the parquet 1.12.0. > When I reverted two commits related to the parquet 1.12.0 from branch-3.2: > - > [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa] > - > [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da] > it reads the data successfully. > We need to add some workaround, or revert the commits. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset
[ https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414721#comment-17414721 ] Micah Kornfield commented on SPARK-36696: - What [~gershinsky] wrote seems to make sense from my reading of the code. I think the issue here PARQUET-2089. > spark.read.parquet loads empty dataset > -- > > Key: SPARK-36696 > URL: https://issues.apache.org/jira/browse/SPARK-36696 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Blocker > Attachments: example.parquet > > > Here's a parquet file Spark 3.2/master can't read properly. > The file was stored by pandas and must contain 3650 rows, but Spark > 3.2/master returns an empty dataset. > {code:python} > >>> import pandas as pd > >>> len(pd.read_parquet('/path/to/example.parquet')) > 3650 > >>> spark.read.parquet('/path/to/example.parquet').count() > 0 > {code} > I guess it's caused by the parquet 1.12.0. > When I reverted two commits related to the parquet 1.12.0 from branch-3.2: > - > [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa] > - > [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da] > it reads the data successfully. > We need to add some workaround, or revert the commits. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset
[ https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412599#comment-17412599 ] Gidon Gershinsky commented on SPARK-36696: -- |why column chunk file offset = dictionary/data page offset + compressed size of the column chunk?| A Java (parquet-mr) specific comment - this version uses mostly the offsets in the ColumnMetaData structure. Recently, it started to use the offset in RowGroup structure. But it doesn't use the offset in the ColumnChunk (AFAIK; at least my IJ couldn't find its usage :) > spark.read.parquet loads empty dataset > -- > > Key: SPARK-36696 > URL: https://issues.apache.org/jira/browse/SPARK-36696 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Blocker > Attachments: example.parquet > > > Here's a parquet file Spark 3.2/master can't read properly. > The file was stored by pandas and must contain 3650 rows, but Spark > 3.2/master returns an empty dataset. > {code:python} > >>> import pandas as pd > >>> len(pd.read_parquet('/path/to/example.parquet')) > 3650 > >>> spark.read.parquet('/path/to/example.parquet').count() > 0 > {code} > I guess it's caused by the parquet 1.12.0. > When I reverted two commits related to the parquet 1.12.0 from branch-3.2: > - > [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa] > - > [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da] > it reads the data successfully. > We need to add some workaround, or revert the commits. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset
[ https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412595#comment-17412595 ] Gidon Gershinsky commented on SPARK-36696: -- The [fix|https://github.com/apache/parquet-mr/pull/925] for PARQUET-2078 solves this problem. But the Arrow folks need to fix the `RowGroup.offset` computation, since it might affect some of the encrypted files. > spark.read.parquet loads empty dataset > -- > > Key: SPARK-36696 > URL: https://issues.apache.org/jira/browse/SPARK-36696 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Blocker > Attachments: example.parquet > > > Here's a parquet file Spark 3.2/master can't read properly. > The file was stored by pandas and must contain 3650 rows, but Spark > 3.2/master returns an empty dataset. > {code:python} > >>> import pandas as pd > >>> len(pd.read_parquet('/path/to/example.parquet')) > 3650 > >>> spark.read.parquet('/path/to/example.parquet').count() > 0 > {code} > I guess it's caused by the parquet 1.12.0. > When I reverted two commits related to the parquet 1.12.0 from branch-3.2: > - > [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa] > - > [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da] > it reads the data successfully. > We need to add some workaround, or revert the commits. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset
[ https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412167#comment-17412167 ] Chao Sun commented on SPARK-36696: -- [This|https://github.com/apache/arrow/blob/master/cpp/src/parquet/metadata.cc#L1331] looks suspicious: why column chunk file offset = dictionary/data page offset + compressed size of the column chunk? > spark.read.parquet loads empty dataset > -- > > Key: SPARK-36696 > URL: https://issues.apache.org/jira/browse/SPARK-36696 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Blocker > Attachments: example.parquet > > > Here's a parquet file Spark 3.2/master can't read properly. > The file was stored by pandas and must contain 3650 rows, but Spark > 3.2/master returns an empty dataset. > {code:python} > >>> import pandas as pd > >>> len(pd.read_parquet('/path/to/example.parquet')) > 3650 > >>> spark.read.parquet('/path/to/example.parquet').count() > 0 > {code} > I guess it's caused by the parquet 1.12.0. > When I reverted two commits related to the parquet 1.12.0 from branch-3.2: > - > [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa] > - > [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da] > it reads the data successfully. > We need to add some workaround, or revert the commits. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset
[ https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412164#comment-17412164 ] Chao Sun commented on SPARK-36696: -- This looks like the same issue as in PARQUET-2078. The file offset for the first row group is set to 31173 which causes {{filterFileMetaDataByMidpoint}} to filter out the only row group (range filter is [0, 37968], while startIndex is 31173 and total size is 35820). Seems there is a bug in Apache Arrow which writes incorrect file offset. cc [~gershinsky] to see if you know any info there. > spark.read.parquet loads empty dataset > -- > > Key: SPARK-36696 > URL: https://issues.apache.org/jira/browse/SPARK-36696 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Blocker > Attachments: example.parquet > > > Here's a parquet file Spark 3.2/master can't read properly. > The file was stored by pandas and must contain 3650 rows, but Spark > 3.2/master returns an empty dataset. > {code:python} > >>> import pandas as pd > >>> len(pd.read_parquet('/path/to/example.parquet')) > 3650 > >>> spark.read.parquet('/path/to/example.parquet').count() > 0 > {code} > I guess it's caused by the parquet 1.12.0. > When I reverted two commits related to the parquet 1.12.0 from branch-3.2: > - > [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa] > - > [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da] > it reads the data successfully. > We need to add some workaround, or revert the commits. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org