[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset

2021-09-15 Thread DB Tsai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415703#comment-17415703
 ] 

DB Tsai commented on SPARK-36696:
-

This issue is addressed by https://issues.apache.org/jira/browse/SPARK-34542 
can we close this JIRA?

> spark.read.parquet loads empty dataset
> --
>
> Key: SPARK-36696
> URL: https://issues.apache.org/jira/browse/SPARK-36696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Blocker
> Attachments: example.parquet
>
>
> Here's a parquet file Spark 3.2/master can't read properly.
> The file was stored by pandas and must contain 3650 rows, but Spark 
> 3.2/master returns an empty dataset.
> {code:python}
> >>> import pandas as pd
> >>> len(pd.read_parquet('/path/to/example.parquet'))
> 3650
> >>> spark.read.parquet('/path/to/example.parquet').count()
> 0
> {code}
> I guess it's caused by the parquet 1.12.0.
> When I reverted two commits related to the parquet 1.12.0 from branch-3.2:
>  - 
> [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa]
>  - 
> [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da]
> it reads the data successfully.
> We need to add some workaround, or revert the commits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset

2021-09-13 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414721#comment-17414721
 ] 

Micah Kornfield commented on SPARK-36696:
-

What [~gershinsky]  wrote seems to make sense from my reading of the code. I 
think the issue here PARQUET-2089.

> spark.read.parquet loads empty dataset
> --
>
> Key: SPARK-36696
> URL: https://issues.apache.org/jira/browse/SPARK-36696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Blocker
> Attachments: example.parquet
>
>
> Here's a parquet file Spark 3.2/master can't read properly.
> The file was stored by pandas and must contain 3650 rows, but Spark 
> 3.2/master returns an empty dataset.
> {code:python}
> >>> import pandas as pd
> >>> len(pd.read_parquet('/path/to/example.parquet'))
> 3650
> >>> spark.read.parquet('/path/to/example.parquet').count()
> 0
> {code}
> I guess it's caused by the parquet 1.12.0.
> When I reverted two commits related to the parquet 1.12.0 from branch-3.2:
>  - 
> [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa]
>  - 
> [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da]
> it reads the data successfully.
> We need to add some workaround, or revert the commits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset

2021-09-09 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412599#comment-17412599
 ] 

Gidon Gershinsky commented on SPARK-36696:
--

|why column chunk file offset = dictionary/data page offset + compressed size 
of the column chunk?|

A Java (parquet-mr) specific comment - this version uses mostly the offsets in 
the ColumnMetaData structure. Recently, it started to use the offset in 
RowGroup structure. But it doesn't use the offset in the ColumnChunk (AFAIK; at 
least my IJ couldn't find its usage :)

> spark.read.parquet loads empty dataset
> --
>
> Key: SPARK-36696
> URL: https://issues.apache.org/jira/browse/SPARK-36696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Blocker
> Attachments: example.parquet
>
>
> Here's a parquet file Spark 3.2/master can't read properly.
> The file was stored by pandas and must contain 3650 rows, but Spark 
> 3.2/master returns an empty dataset.
> {code:python}
> >>> import pandas as pd
> >>> len(pd.read_parquet('/path/to/example.parquet'))
> 3650
> >>> spark.read.parquet('/path/to/example.parquet').count()
> 0
> {code}
> I guess it's caused by the parquet 1.12.0.
> When I reverted two commits related to the parquet 1.12.0 from branch-3.2:
>  - 
> [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa]
>  - 
> [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da]
> it reads the data successfully.
> We need to add some workaround, or revert the commits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset

2021-09-09 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412595#comment-17412595
 ] 

Gidon Gershinsky commented on SPARK-36696:
--

The [fix|https://github.com/apache/parquet-mr/pull/925] for PARQUET-2078 solves 
this problem. But the Arrow folks need to fix the `RowGroup.offset` 
computation, since it might affect some of the encrypted files.

> spark.read.parquet loads empty dataset
> --
>
> Key: SPARK-36696
> URL: https://issues.apache.org/jira/browse/SPARK-36696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Blocker
> Attachments: example.parquet
>
>
> Here's a parquet file Spark 3.2/master can't read properly.
> The file was stored by pandas and must contain 3650 rows, but Spark 
> 3.2/master returns an empty dataset.
> {code:python}
> >>> import pandas as pd
> >>> len(pd.read_parquet('/path/to/example.parquet'))
> 3650
> >>> spark.read.parquet('/path/to/example.parquet').count()
> 0
> {code}
> I guess it's caused by the parquet 1.12.0.
> When I reverted two commits related to the parquet 1.12.0 from branch-3.2:
>  - 
> [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa]
>  - 
> [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da]
> it reads the data successfully.
> We need to add some workaround, or revert the commits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset

2021-09-08 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412167#comment-17412167
 ] 

Chao Sun commented on SPARK-36696:
--

[This|https://github.com/apache/arrow/blob/master/cpp/src/parquet/metadata.cc#L1331]
 looks suspicious: why column chunk file offset = dictionary/data page offset + 
compressed size of the column chunk?

> spark.read.parquet loads empty dataset
> --
>
> Key: SPARK-36696
> URL: https://issues.apache.org/jira/browse/SPARK-36696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Blocker
> Attachments: example.parquet
>
>
> Here's a parquet file Spark 3.2/master can't read properly.
> The file was stored by pandas and must contain 3650 rows, but Spark 
> 3.2/master returns an empty dataset.
> {code:python}
> >>> import pandas as pd
> >>> len(pd.read_parquet('/path/to/example.parquet'))
> 3650
> >>> spark.read.parquet('/path/to/example.parquet').count()
> 0
> {code}
> I guess it's caused by the parquet 1.12.0.
> When I reverted two commits related to the parquet 1.12.0 from branch-3.2:
>  - 
> [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa]
>  - 
> [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da]
> it reads the data successfully.
> We need to add some workaround, or revert the commits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset

2021-09-08 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412164#comment-17412164
 ] 

Chao Sun commented on SPARK-36696:
--

This looks like the same issue as in PARQUET-2078. The file offset for the 
first row group is set to 31173 which causes {{filterFileMetaDataByMidpoint}} 
to filter out the only row group (range filter is [0, 37968], while startIndex 
is 31173 and total size is 35820).

Seems there is a bug in Apache Arrow which writes incorrect file offset. cc 
[~gershinsky] to see if you know any info there.

> spark.read.parquet loads empty dataset
> --
>
> Key: SPARK-36696
> URL: https://issues.apache.org/jira/browse/SPARK-36696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Blocker
> Attachments: example.parquet
>
>
> Here's a parquet file Spark 3.2/master can't read properly.
> The file was stored by pandas and must contain 3650 rows, but Spark 
> 3.2/master returns an empty dataset.
> {code:python}
> >>> import pandas as pd
> >>> len(pd.read_parquet('/path/to/example.parquet'))
> 3650
> >>> spark.read.parquet('/path/to/example.parquet').count()
> 0
> {code}
> I guess it's caused by the parquet 1.12.0.
> When I reverted two commits related to the parquet 1.12.0 from branch-3.2:
>  - 
> [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa]
>  - 
> [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da]
> it reads the data successfully.
> We need to add some workaround, or revert the commits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org