Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/22343
Thank YOU for your PR and open discussion on this, @seancxmao . Let's see
in another PRs.
---
-
To unsubscribe, e-mail:
Github user seancxmao commented on the issue:
https://github.com/apache/spark/pull/22343
Sure, close this PR. Thank you all for your time and insights.
---
-
To unsubscribe, e-mail:
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/22343
Could you close this PR and JIRA, @seancxmao ?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user seancxmao commented on the issue:
https://github.com/apache/spark/pull/22343
I agree that correctness is more important. If we should not make behaviors
consistent when do the convertion, I will close this PR. @cloud-fan @gatorsmile
what do you think?
---
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/22343
Compatibility is not a gold rule if it sacrifices correctness. Fast and
**wrong** result doesn't looks like benefits to me. Do you think the customer
want to get a wrong result like Hive?
Github user seancxmao commented on the issue:
https://github.com/apache/spark/pull/22343
It keeps Hive compatibility but loses performance benefit by setting
spark.sql.hive.convertMetastoreParquet=false. We can do better by enabling the
conversion and still keeping Hive
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/22343
@seancxmao . For Hive compatibility,
`spark.sql.hive.convertMetastoreParquet=false` looks enough to me.
---
-
To
Github user seancxmao commented on the issue:
https://github.com/apache/spark/pull/22343
Could we see this as a behavior change? We can add a legacy conf (e.g.
`spark.sql.hive.legacy.convertMetastoreParquet`, may be defined in HiveUtils)
to enable users to revert back to the previous
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/22343
Thank you for the pointer, @seancxmao . And thank you for clarification,
@cloud-fan .
It looks like we are re-creating correctness issue somewhat in this PR when
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22343
To clarify: this is just a workaround when we hit a problematic(having
case-insensitive duplicated filed names in the parquet file) hive parquet
tables and we want to read it with the native
Github user seancxmao commented on the issue:
https://github.com/apache/spark/pull/22343
@dongjoon-hyun It is a little complicated. There has been a discussion
about this in #22184. Below are some key comments from @cloud-fan and
@gatorsmile, just FYI.
*
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/22343
What I asked was the following, wasn't it?
> In case-insensitive mode, when converting hive parquet table to parquet
data source, we switch the duplicated fields resolution mode to ask
Github user seancxmao commented on the issue:
https://github.com/apache/spark/pull/22343
Hi, @dongjoon-hyun
When we find duplicated field names in the case of convertMetastoreXXX, we
have 2 options
(1) raise exception as parquet data source. To most of end users, they do
not
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/22343
Hi, @seancxmao . Should we be consistent? IIRC, all the previous PR raises
Exception to prevent any potential issues. In this case, I have a feeling that
`convertMetastoreXXX` should be used
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22343
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95864/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22343
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22343
**[Test build #95864 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95864/testReport)**
for PR 22343 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22343
**[Test build #95864 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95864/testReport)**
for PR 22343 at commit
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22343
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22343
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22343
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95857/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22343
**[Test build #95857 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95857/testReport)**
for PR 22343 at commit
Github user seancxmao commented on the issue:
https://github.com/apache/spark/pull/22343
@dongjoon-hyun @HyukjinKwon I created a new JIRA ticket and try to use a
more complete and clear title for this PR. What do you think?
---
23 matches
Mail list logo