[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
[ https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17425959#comment-17425959 ] Micah Kornfield commented on SPARK-34276: - Sorry for the late reply. PARQUET-2089 has been a long standing bug in the C++ implementation where we were setting file_offset to the beginning of column_chunk metatadata and not the actual data page. It's not clear to me if this was a problem before parquet-mr 1.12 in practice. [~gershinsky] Would the fix in PARQUET-2078 make parquet-mr resilient to this bug? > Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12 > -- > > Key: SPARK-34276 > URL: https://issues.apache.org/jira/browse/SPARK-34276 > Project: Spark > Issue Type: Task > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Chao Sun >Priority: Blocker > > Before the release, we need to double check the unreleased/unresolved > JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should > upgrade/revert Parquet. At the same time, we should encourage the whole > community to do the compatibility and performance tests for their production > workloads, including both read and write code paths. > More details: > [https://github.com/apache/spark/pull/26804#issuecomment-768790620] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
[ https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415725#comment-17415725 ] Takuya Ueshin commented on SPARK-34276: --- SPARK-36696 was fixed by upgrading parquet to 1.12.1. Btw, another issue was raised in the ticket. PARQUET-2089 cc [~gershinsky] [~emkornfield] I'm not sure whether the issue affects anything or not, though. > Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12 > -- > > Key: SPARK-34276 > URL: https://issues.apache.org/jira/browse/SPARK-34276 > Project: Spark > Issue Type: Task > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Blocker > > Before the release, we need to double check the unreleased/unresolved > JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should > upgrade/revert Parquet. At the same time, we should encourage the whole > community to do the compatibility and performance tests for their production > workloads, including both read and write code paths. > More details: > [https://github.com/apache/spark/pull/26804#issuecomment-768790620] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
[ https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412183#comment-17412183 ] Takuya Ueshin commented on SPARK-34276: --- SPARK-36696 seems to be the actual issue caused by PARQUET-2078. > Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12 > -- > > Key: SPARK-34276 > URL: https://issues.apache.org/jira/browse/SPARK-34276 > Project: Spark > Issue Type: Task > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Blocker > > Before the release, we need to double check the unreleased/unresolved > JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should > upgrade/revert Parquet. At the same time, we should encourage the whole > community to do the compatibility and performance tests for their production > workloads, including both read and write code paths. > More details: > [https://github.com/apache/spark/pull/26804#issuecomment-768790620] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
[ https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411391#comment-17411391 ] Gengliang Wang commented on SPARK-34276: [~nemon][~gszadovszky][~csun] the PR https://github.com/apache/parquet-mr/pull/925 is still open. If we can't have a new Parquet release in one week, I am afraid we will have to consider reverting Parquet 1.12 in Spark 3.2 > Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12 > -- > > Key: SPARK-34276 > URL: https://issues.apache.org/jira/browse/SPARK-34276 > Project: Spark > Issue Type: Task > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Blocker > > Before the release, we need to double check the unreleased/unresolved > JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should > upgrade/revert Parquet. At the same time, we should encourage the whole > community to do the compatibility and performance tests for their production > workloads, including both read and write code paths. > More details: > [https://github.com/apache/spark/pull/26804#issuecomment-768790620] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
[ https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408324#comment-17408324 ] Chao Sun commented on SPARK-34276: -- I did some study on the code and it seems this will only affect Spark when {{spark.sql.hive.convertMetastoreParquet}} is set to false, as [~nemon] pointed above. By default Spark uses {{filterFileMetaDataByMidpoint}} (see [here|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1226]), which is not impacted much by this bug. In the worst case it could cause imbalance when assigning Parquet row groups to Spark tasks but nothing like read error or data loss. > Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12 > -- > > Key: SPARK-34276 > URL: https://issues.apache.org/jira/browse/SPARK-34276 > Project: Spark > Issue Type: Task > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Blocker > > Before the release, we need to double check the unreleased/unresolved > JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should > upgrade/revert Parquet. At the same time, we should encourage the whole > community to do the compatibility and performance tests for their production > workloads, including both read and write code paths. > More details: > [https://github.com/apache/spark/pull/26804#issuecomment-768790620] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
[ https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407960#comment-17407960 ] Gabor Szadovszky commented on SPARK-34276: -- [~csun], any application is using parquet-mr 1.12.0 is impacted by PARQUET-2078. > Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12 > -- > > Key: SPARK-34276 > URL: https://issues.apache.org/jira/browse/SPARK-34276 > Project: Spark > Issue Type: Task > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Blocker > > Before the release, we need to double check the unreleased/unresolved > JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should > upgrade/revert Parquet. At the same time, we should encourage the whole > community to do the compatibility and performance tests for their production > workloads, including both read and write code paths. > More details: > [https://github.com/apache/spark/pull/26804#issuecomment-768790620] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
[ https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407737#comment-17407737 ] Nemon Lou commented on SPARK-34276: --- [~csun] yes,the same as PARQUET-2078 > Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12 > -- > > Key: SPARK-34276 > URL: https://issues.apache.org/jira/browse/SPARK-34276 > Project: Spark > Issue Type: Task > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Blocker > > Before the release, we need to double check the unreleased/unresolved > JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should > upgrade/revert Parquet. At the same time, we should encourage the whole > community to do the compatibility and performance tests for their production > workloads, including both read and write code paths. > More details: > [https://github.com/apache/spark/pull/26804#issuecomment-768790620] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
[ https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407554#comment-17407554 ] Chao Sun commented on SPARK-34276: -- [~smilegator] yea seems like Spark will be affected. cc [~gszadovszky] to confirm. [~nemon] is the issue you mentioned the same as PARQUET-2078? > Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12 > -- > > Key: SPARK-34276 > URL: https://issues.apache.org/jira/browse/SPARK-34276 > Project: Spark > Issue Type: Task > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Blocker > > Before the release, we need to double check the unreleased/unresolved > JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should > upgrade/revert Parquet. At the same time, we should encourage the whole > community to do the compatibility and performance tests for their production > workloads, including both read and write code paths. > More details: > [https://github.com/apache/spark/pull/26804#issuecomment-768790620] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
[ https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407203#comment-17407203 ] Nemon Lou commented on SPARK-34276: --- Spark also fails to read parquet file if setting spark.sql.hive.convertMetastoreParquet=false This setting is true by default. > Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12 > -- > > Key: SPARK-34276 > URL: https://issues.apache.org/jira/browse/SPARK-34276 > Project: Spark > Issue Type: Task > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Blocker > > Before the release, we need to double check the unreleased/unresolved > JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should > upgrade/revert Parquet. At the same time, we should encourage the whole > community to do the compatibility and performance tests for their production > workloads, including both read and write code paths. > More details: > [https://github.com/apache/spark/pull/26804#issuecomment-768790620] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
[ https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405878#comment-17405878 ] Xiao Li commented on SPARK-34276: - https://issues.apache.org/jira/browse/PARQUET-2078 Do we have this problem? > Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12 > -- > > Key: SPARK-34276 > URL: https://issues.apache.org/jira/browse/SPARK-34276 > Project: Spark > Issue Type: Task > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Blocker > > Before the release, we need to double check the unreleased/unresolved > JIRAs/PRs of Parquet 1.11 and then decide whether we should upgrade/revert > Parquet. At the same time, we should encourage the whole community to do the > compatibility and performance tests for their production workloads, including > both read and write code paths. > More details: > https://github.com/apache/spark/pull/26804#issuecomment-768790620 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org