[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-16 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22343
  
Thank YOU for your PR and open discussion on this, @seancxmao . Let's see 
in another PRs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-16 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22343
  
Sure, close this PR. Thank you all for your time and insights.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22343
  
Could you close this PR and JIRA, @seancxmao ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-12 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22343
  
I agree that correctness is more important. If we should not make behaviors 
consistent when do the convertion, I will close this PR. @cloud-fan @gatorsmile 
what do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-11 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22343
  
Compatibility is not a gold rule if it sacrifices correctness. Fast and 
**wrong** result doesn't looks like benefits to me. Do you think the customer 
want to get a wrong result like Hive?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-11 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22343
  
It keeps Hive compatibility but loses performance benefit by setting 
spark.sql.hive.convertMetastoreParquet=false. We can do better by enabling the 
conversion and still keeping Hive compatibility. Though this makes our 
implementation more complex, I guess most end users may keep 
`spark.sql.hive.convertMetastoreParquet=true` and 
`spark.sql.caseSensitive=false` which are default values, this brings benefits 
to end users.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-11 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22343
  
@seancxmao . For Hive compatibility, 
`spark.sql.hive.convertMetastoreParquet=false` looks enough to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-11 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22343
  
Could we see this as a behavior change? We can add a legacy conf (e.g. 
`spark.sql.hive.legacy.convertMetastoreParquet`, may be defined in HiveUtils) 
to enable users to revert back to the previous behavior for backward 
compatibility. If this legacy conf is set to true, behaviors will be reverted 
both in case-sensitive and case-insensitive mode.

|caseSensitive|legacy behavior|new behavior|
|-|-|-|
|true|convert anyway|skip conversion, log warning message|
|false|convert, fail if there's ambiguity|convert, first match if there's 
ambiguity|


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-11 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22343
  
Thank you for the pointer, @seancxmao . And thank you for clarification, 
@cloud-fan .

It looks like we are re-creating correctness issue somewhat in this PR when 
`caseSensitive=true`. 

**BEFORE THIS PR (master)**
```scala
scala> sql("INSERT OVERWRITE DIRECTORY '/tmp/hive_t' STORED AS PARQUET 
SELECT 'A', 'a'")
scala> sql("CREATE TABLE hive_t(a STRING) STORED AS PARQUET LOCATION 
'/tmp/hive_t'")
scala> sql("CREATE TABLE spark_t(a STRING) USING PARQUET LOCATION 
'/tmp/hive_t'")
scala> sql("set spark.sql.caseSensitive=true")
scala> spark.table("hive_t").show
+---+
|  a|
+---+
|  a|
+---+

scala> spark.table("spark_t").show
+---+
|  a|
+---+
|  a|
+---+
```

**AFTER THIS PR**
```scala
scala> sql("set spark.sql.caseSensitive=true")
scala> spark.table("hive_t").show
+---+
|  a|
+---+
|  A|
+---+

scala> spark.table("spark_t").show
+---+
|  a|
+---+
|  a|
+---+
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-11 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22343
  
To clarify: this is just a workaround when we hit a problematic(having 
case-insensitive duplicated filed names in the parquet file) hive parquet 
tables and we want to read it with the native parquet reader. The hive behaivor 
is weird but we need to follow it as we are reading a hive table.

Personally I think it's not a big deal. If the hive table is malformed, I 
think we don't have to follow hive's bugy behavior. If people are confused by 
this patch and think this doesn't worth, I'm ok to just leave it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-10 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22343
  
@dongjoon-hyun It is a little complicated. There has been a discussion 
about this in #22184. Below are some key comments from @cloud-fan and 
@gatorsmile, just FYI.

* https://github.com/apache/spark/pull/22184#discussion_r212834477
* https://github.com/apache/spark/pull/22184#issuecomment-416885728


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-10 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22343
  
What I asked was the following, wasn't it? 
> In case-insensitive mode, when converting hive parquet table to parquet 
data source, we switch the duplicated fields resolution mode to ask parquet 
data source to pick the first matched field - the same behavior as hive parquet 
table - to keep behaviors consistent.

Spark should not pick up the first matched field in any cases because it's 
considered as a correctness behavior in previous PR which is backported to 
`branch-2.3` https://github.com/apache/spark/pull/22183. I don't think we need 
to follow incorrect Hive behavior.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-10 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22343
  
Hi, @dongjoon-hyun
When we find duplicated field names in the case of convertMetastoreXXX, we 
have 2 options
(1) raise exception as parquet data source. To most of end users, they do 
not know the difference between hive parquet table and parquet data source. If 
the conversion leads to different behaviors, they may be confused, and in some 
cases even lead to tricky data issues silently.
(2) Adjust behaviors of parquet data source to keep behaviors consistent. 
This seems more friendly to end users, and avoid any potential issues 
introduced by the conversion.

BTW, for parquet data source which is not converted from hive parquet 
table, we raise exception when there is ambiguity, sine this is more intuitive 
and reasonable.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-10 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22343
  
Hi, @seancxmao . Should we be consistent? IIRC, all the previous PR raises 
Exception to prevent any potential issues. In this case, I have a feeling that 
`convertMetastoreXXX` should be used to prevent the problem of Hive behavior by 
raising Exception, not hiding the problem of Hive behavior.
> In case-insensitive mode, when converting hive parquet table to parquet 
data source, we switch the duplicated fields resolution mode to ask parquet 
data source to pick the first matched field - the same behavior as hive parquet 
table - to keep behaviors consistent.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22343
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95864/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22343
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22343
  
**[Test build #95864 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95864/testReport)**
 for PR 22343 at commit 
[`95673cd`](https://github.com/apache/spark/commit/95673cdacb1da65d7ac45c212a365e167ae0d713).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22343
  
**[Test build #95864 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95864/testReport)**
 for PR 22343 at commit 
[`95673cd`](https://github.com/apache/spark/commit/95673cdacb1da65d7ac45c212a365e167ae0d713).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-10 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22343
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22343
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22343
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95857/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22343
  
**[Test build #95857 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95857/testReport)**
 for PR 22343 at commit 
[`95673cd`](https://github.com/apache/spark/commit/95673cdacb1da65d7ac45c212a365e167ae0d713).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-09 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22343
  
@dongjoon-hyun @HyukjinKwon I created a new JIRA ticket and try to use a 
more complete and clear title for this PR. What do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org