[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-15 Thread budde
Github user budde commented on the issue: https://github.com/apache/spark/pull/16797 Thanks for all the feedback on this PR, folks. I'm going to close this PR/JIRA and open new ones for enabling configurable schema inference as a fallback. I'll ping each of you who has been active in

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-10 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16797 ok if we think we should support tables created by hive(or other systems) even the data schema mismatches the table schema(and matches if lowercased), I'm ok to fall back to schema inference when

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-09 Thread budde
Github user budde commented on the issue: https://github.com/apache/spark/pull/16797 @mallman The Parquet schema merging methods take me back to #5214 :) I haven't been following changes here very closely but I would guess use of this method was replaced to the

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-09 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16797 BTW @budde, given that this represents a regression in behavior from previous versions of Spark, I think it is too generous of you to label the Jira issue as an "improvement" instead of a "bug". I

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-09 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16797 >> Like you said, users can still create a hive table with mixed-case-schema parquet/orc files, by hive or other systems like presto. This table is readable for hive, and for Spark prior to 2.1,

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-09 Thread budde
Github user budde commented on the issue: https://github.com/apache/spark/pull/16797 @cloud-fan: > Spark does support mixed-case-schema tables, and it has always been. It's because we write table schema to metastore case-preserving, via table properties. Spark prior

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-08 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16797 @budde Spark does support mixed-case-schema tables, and it has always been. It's because we write table schema to metastore case-preserving, via table properties. When we read a table, we get

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-08 Thread budde
Github user budde commented on the issue: https://github.com/apache/spark/pull/16797 > For better user experience, we should automatically infer the schema and write it back to metastore, if there is no case-sensitive table schema in metastore. This has the cost of detection the need

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-07 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16797 For better user experience, we should automatically infer the schema and write it back to metastore, if there is no case-sensitive table schema in metastore. This has the cost of detection the

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-07 Thread budde
Github user budde commented on the issue: https://github.com/apache/spark/pull/16797 > is it a completely compatibility issue? Seems like the only problem is, when we write out mixed-case-schema parquet files directly, and create an external table pointing to these files with Spark

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-07 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16797 is it a completely compatibility issue? Seems like the only problem is, when we write out mixed-case-schema parquet files directly, and create an external table pointing to these files with Spark

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-07 Thread budde
Github user budde commented on the issue: https://github.com/apache/spark/pull/16797 > Can we write such schema (conflicting columns after lower-casing) into metastore? I think the scenario here would be that the metastore contains a single lower-case column name that could

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-07 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16797 > BTW, what behavior do we expect if a parquet file has two columns whose lower-cased names are identical? Can we write such schema (conflicting columns after lower-casing) into metastore?

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-06 Thread budde
Github user budde commented on the issue: https://github.com/apache/spark/pull/16797 > BTW, what behavior do we expect if a parquet file has two columns whose lower-cased names are identical? I can take a look at how Spark handled this prior to 2.1, although I'm not sure if

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-06 Thread budde
Github user budde commented on the issue: https://github.com/apache/spark/pull/16797 > how about we add a new SQL command to refresh the table schema in metastore by inferring schema with data files? This is a compatibility issue and we should have provided a way for users to

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-06 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16797 The proposal to restore schema inference with finer grained control on when it is performed sounds reasonable to me. The case I'm most interested in is turning off schema inference entirely,

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-06 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16797 If the use case where we want to infer the schema but not attempt to write it back as a property as suggested by @budde, is making sense, then the new SQL command approach might not work for it. But

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-06 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16797 how about we add a new SQL command to refresh the table schema in metastore by inferring schema with data files? This is a compatibility issue and we should have provided a way for users to

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-06 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16797 > I'll double check, but I don't think spark.sql.hive.manageFilesourcePartitions=false would solve this issue since we're still deriving the file relation's dataSchema parameter from the schema of

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-06 Thread budde
Github user budde commented on the issue: https://github.com/apache/spark/pull/16797 > Should we roll these behaviors into one flag? e.g. ```spark.sql.hive.mixedCaseSchemaSupport``` That sounds reasonable to me. The only thing I wonder about is if there's any use case where

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-06 Thread ericl
Github user ericl commented on the issue: https://github.com/apache/spark/pull/16797 > I'll double check, but I don't think spark.sql.hive.manageFilesourcePartitions=false would solve this issue since we're still deriving the file relation's dataSchema parameter from the schema of

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-06 Thread budde
Github user budde commented on the issue: https://github.com/apache/spark/pull/16797 I'll double check, but I don't think ```spark.sql.hive.manageFilesourcePartitions=false``` would solve this issue since we're still deriving the file relation's dataSchema parameter from the schema

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-05 Thread ericl
Github user ericl commented on the issue: https://github.com/apache/spark/pull/16797 I agree that bringing back schema inference would be cleaner. One problem with doing something parquet-specific is that this would need to be repeated with each file format, e.g. orc, csv, json,

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-04 Thread budde
Github user budde commented on the issue: https://github.com/apache/spark/pull/16797 Bringing back schema inference is certainly a much cleaner option, although I imagine doing this in the old manner would negate the performance improvements brought by #14690 for any non-Spark 2.1

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16797 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72352/ Test PASSed. ---

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16797 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16797 **[Test build #72352 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72352/testReport)** for PR 16797 at commit

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16797 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16797 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72347/ Test PASSed. ---

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16797 **[Test build #72347 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72347/testReport)** for PR 16797 at commit

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16797 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72343/ Test PASSed. ---

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16797 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16797 **[Test build #72343 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72343/testReport)** for PR 16797 at commit

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16797 Shall we just go to infer schema from files once accessing any Hive table that was not created by Spark or created by a version prior to 2.1.0, i.e., for which the schema has not been embedded as a

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16797 **[Test build #72352 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72352/testReport)** for PR 16797 at commit

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16797 **[Test build #72347 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72347/testReport)** for PR 16797 at commit

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16797 **[Test build #72343 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72343/testReport)** for PR 16797 at commit

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16797 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16797 We are waiting for https://github.com/apache/spark/pull/16799. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread budde
Github user budde commented on the issue: https://github.com/apache/spark/pull/16797 Looks like SparkR unit tests have been failing for all or most PRs after [this commit.](https://github.com/apache/spark/commit/48aafeda7db879491ed36fff89d59ca7ec3136fa) --- If your project is set

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread budde
Github user budde commented on the issue: https://github.com/apache/spark/pull/16797 Relevant part of [Jenkins output](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72326/console) for SparkR tests: ``` Error: processing vignette

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16797 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72326/ Test FAILed. ---

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16797 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16797 **[Test build #72326 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72326/testReport)** for PR 16797 at commit

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16797 **[Test build #72326 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72326/testReport)** for PR 16797 at commit

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread budde
Github user budde commented on the issue: https://github.com/apache/spark/pull/16797 Pinging @ericl, @cloud-fan and @davies, committers who have all reviewed or submitted changes related to this. --- If your project is set up for it, you can reply to this email and have your reply