[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12

2021-10-07 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17425959#comment-17425959
 ] 

Micah Kornfield commented on SPARK-34276:
-

Sorry for the late reply.  PARQUET-2089 has been a long standing bug in the C++ 
implementation where we were setting file_offset to the beginning of 
column_chunk metatadata and not the actual data page.  It's not clear to me if 
this was a problem before parquet-mr 1.12 in practice.  [~gershinsky] Would the 
fix in PARQUET-2078 make parquet-mr resilient to this bug?

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Chao Sun
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should 
> upgrade/revert Parquet. At the same time, we should encourage the whole 
> community to do the compatibility and performance tests for their production 
> workloads, including both read and write code paths.
> More details: 
> [https://github.com/apache/spark/pull/26804#issuecomment-768790620]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12

2021-09-15 Thread Takuya Ueshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415725#comment-17415725
 ] 

Takuya Ueshin commented on SPARK-34276:
---

SPARK-36696 was fixed by upgrading parquet to 1.12.1.

Btw, another issue was raised in the ticket. PARQUET-2089 cc [~gershinsky] 
[~emkornfield]

I'm not sure whether the issue affects anything or not, though.

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should 
> upgrade/revert Parquet. At the same time, we should encourage the whole 
> community to do the compatibility and performance tests for their production 
> workloads, including both read and write code paths.
> More details: 
> [https://github.com/apache/spark/pull/26804#issuecomment-768790620]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12

2021-09-08 Thread Takuya Ueshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412183#comment-17412183
 ] 

Takuya Ueshin commented on SPARK-34276:
---

SPARK-36696 seems to be the actual issue caused by PARQUET-2078.

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should 
> upgrade/revert Parquet. At the same time, we should encourage the whole 
> community to do the compatibility and performance tests for their production 
> workloads, including both read and write code paths.
> More details: 
> [https://github.com/apache/spark/pull/26804#issuecomment-768790620]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12

2021-09-07 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411391#comment-17411391
 ] 

Gengliang Wang commented on SPARK-34276:


[~nemon][~gszadovszky][~csun] the PR 
https://github.com/apache/parquet-mr/pull/925 is still open. If we can't have a 
new Parquet release in one week, I am afraid we will have to consider reverting 
Parquet 1.12 in Spark 3.2

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should 
> upgrade/revert Parquet. At the same time, we should encourage the whole 
> community to do the compatibility and performance tests for their production 
> workloads, including both read and write code paths.
> More details: 
> [https://github.com/apache/spark/pull/26804#issuecomment-768790620]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12

2021-09-01 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408324#comment-17408324
 ] 

Chao Sun commented on SPARK-34276:
--

I did some study on the code and it seems this will only affect Spark when 
{{spark.sql.hive.convertMetastoreParquet}} is set to false, as [~nemon] pointed 
above. By default Spark uses {{filterFileMetaDataByMidpoint}} (see 
[here|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1226]),
 which is not impacted much by this bug. In the worst case it could cause 
imbalance when assigning Parquet row groups to Spark tasks but nothing like 
read error or data loss.

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should 
> upgrade/revert Parquet. At the same time, we should encourage the whole 
> community to do the compatibility and performance tests for their production 
> workloads, including both read and write code paths.
> More details: 
> [https://github.com/apache/spark/pull/26804#issuecomment-768790620]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12

2021-09-01 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407960#comment-17407960
 ] 

Gabor Szadovszky commented on SPARK-34276:
--

[~csun], any application is using parquet-mr 1.12.0 is impacted by PARQUET-2078.

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should 
> upgrade/revert Parquet. At the same time, we should encourage the whole 
> community to do the compatibility and performance tests for their production 
> workloads, including both read and write code paths.
> More details: 
> [https://github.com/apache/spark/pull/26804#issuecomment-768790620]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12

2021-08-31 Thread Nemon Lou (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407737#comment-17407737
 ] 

Nemon Lou commented on SPARK-34276:
---

[~csun] yes,the same as PARQUET-2078

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should 
> upgrade/revert Parquet. At the same time, we should encourage the whole 
> community to do the compatibility and performance tests for their production 
> workloads, including both read and write code paths.
> More details: 
> [https://github.com/apache/spark/pull/26804#issuecomment-768790620]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12

2021-08-31 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407554#comment-17407554
 ] 

Chao Sun commented on SPARK-34276:
--

[~smilegator] yea seems like Spark will be affected. cc [~gszadovszky] to 
confirm. [~nemon] is the issue you mentioned the same as PARQUET-2078? 

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should 
> upgrade/revert Parquet. At the same time, we should encourage the whole 
> community to do the compatibility and performance tests for their production 
> workloads, including both read and write code paths.
> More details: 
> [https://github.com/apache/spark/pull/26804#issuecomment-768790620]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12

2021-08-31 Thread Nemon Lou (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407203#comment-17407203
 ] 

Nemon Lou commented on SPARK-34276:
---

Spark also fails to read parquet file if setting 
spark.sql.hive.convertMetastoreParquet=false 

This setting is true by default.

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should 
> upgrade/revert Parquet. At the same time, we should encourage the whole 
> community to do the compatibility and performance tests for their production 
> workloads, including both read and write code paths.
> More details: 
> [https://github.com/apache/spark/pull/26804#issuecomment-768790620]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12

2021-08-27 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405878#comment-17405878
 ] 

Xiao Li commented on SPARK-34276:
-

https://issues.apache.org/jira/browse/PARQUET-2078 Do we have this problem?

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11 and then decide whether we should upgrade/revert 
> Parquet. At the same time, we should encourage the whole community to do the 
> compatibility and performance tests for their production workloads, including 
> both read and write code paths.
> More details: 
> https://github.com/apache/spark/pull/26804#issuecomment-768790620



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org