Github user budde commented on the issue:
https://github.com/apache/spark/pull/16797
Thanks for all the feedback on this PR, folks. I'm going to close this
PR/JIRA and open new ones for enabling configurable schema inference as a
fallback. I'll ping each of you who has been active in
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/16797
ok if we think we should support tables created by hive(or other systems)
even the data schema mismatches the table schema(and matches if lowercased),
I'm ok to fall back to schema inference when
Github user budde commented on the issue:
https://github.com/apache/spark/pull/16797
@mallman The Parquet schema merging methods take me back to #5214 :)
I haven't been following changes here very closely but I would guess use of
this method was replaced to the
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16797
BTW @budde, given that this represents a regression in behavior from
previous versions of Spark, I think it is too generous of you to label the Jira
issue as an "improvement" instead of a "bug". I
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16797
>> Like you said, users can still create a hive table with
mixed-case-schema parquet/orc files, by hive or other systems like presto. This
table is readable for hive, and for Spark prior to 2.1,
Github user budde commented on the issue:
https://github.com/apache/spark/pull/16797
@cloud-fan:
> Spark does support mixed-case-schema tables, and it has always been. It's
because we write table schema to metastore case-preserving, via table
properties.
Spark prior
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/16797
@budde Spark does support mixed-case-schema tables, and it has always been.
It's because we write table schema to metastore case-preserving, via table
properties. When we read a table, we get
Github user budde commented on the issue:
https://github.com/apache/spark/pull/16797
> For better user experience, we should automatically infer the schema and
write it back to metastore, if there is no case-sensitive table schema in
metastore. This has the cost of detection the need
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/16797
For better user experience, we should automatically infer the schema and
write it back to metastore, if there is no case-sensitive table schema in
metastore. This has the cost of detection the
Github user budde commented on the issue:
https://github.com/apache/spark/pull/16797
> is it a completely compatibility issue? Seems like the only problem is,
when we write out mixed-case-schema parquet files directly, and create an
external table pointing to these files with Spark
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/16797
is it a completely compatibility issue? Seems like the only problem is,
when we write out mixed-case-schema parquet files directly, and create an
external table pointing to these files with Spark
Github user budde commented on the issue:
https://github.com/apache/spark/pull/16797
> Can we write such schema (conflicting columns after lower-casing) into
metastore?
I think the scenario here would be that the metastore contains a single
lower-case column name that could
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16797
> BTW, what behavior do we expect if a parquet file has two columns whose
lower-cased names are identical?
Can we write such schema (conflicting columns after lower-casing) into
metastore?
Github user budde commented on the issue:
https://github.com/apache/spark/pull/16797
> BTW, what behavior do we expect if a parquet file has two columns whose
lower-cased names are identical?
I can take a look at how Spark handled this prior to 2.1, although I'm not
sure if
Github user budde commented on the issue:
https://github.com/apache/spark/pull/16797
> how about we add a new SQL command to refresh the table schema in
metastore by inferring schema with data files? This is a compatibility issue
and we should have provided a way for users to
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16797
The proposal to restore schema inference with finer grained control on when
it is performed sounds reasonable to me. The case I'm most interested in is
turning off schema inference entirely,
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16797
If the use case where we want to infer the schema but not attempt to write
it back as a property as suggested by @budde, is making sense, then the new SQL
command approach might not work for it. But
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/16797
how about we add a new SQL command to refresh the table schema in metastore
by inferring schema with data files? This is a compatibility issue and we
should have provided a way for users to
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16797
> I'll double check, but I don't think
spark.sql.hive.manageFilesourcePartitions=false would solve this issue since
we're still deriving the file relation's dataSchema parameter from the schema
of
Github user budde commented on the issue:
https://github.com/apache/spark/pull/16797
> Should we roll these behaviors into one flag? e.g.
```spark.sql.hive.mixedCaseSchemaSupport```
That sounds reasonable to me. The only thing I wonder about is if there's
any use case where
Github user ericl commented on the issue:
https://github.com/apache/spark/pull/16797
> I'll double check, but I don't think
spark.sql.hive.manageFilesourcePartitions=false would solve this issue since
we're still deriving the file relation's dataSchema parameter from the schema
of
Github user budde commented on the issue:
https://github.com/apache/spark/pull/16797
I'll double check, but I don't think
```spark.sql.hive.manageFilesourcePartitions=false``` would solve this issue
since we're still deriving the file relation's dataSchema parameter from the
schema
Github user ericl commented on the issue:
https://github.com/apache/spark/pull/16797
I agree that bringing back schema inference would be cleaner. One problem
with doing something parquet-specific is that this would need to be repeated
with each file format, e.g. orc, csv, json,
Github user budde commented on the issue:
https://github.com/apache/spark/pull/16797
Bringing back schema inference is certainly a much cleaner option, although
I imagine doing this in the old manner would negate the performance
improvements brought by #14690 for any non-Spark 2.1
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16797
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72352/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16797
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16797
**[Test build #72352 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72352/testReport)**
for PR 16797 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16797
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16797
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72347/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16797
**[Test build #72347 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72347/testReport)**
for PR 16797 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16797
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72343/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16797
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16797
**[Test build #72343 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72343/testReport)**
for PR 16797 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16797
Shall we just go to infer schema from files once accessing any Hive table
that was not created by Spark or created by a version prior to 2.1.0, i.e., for
which the schema has not been embedded as a
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16797
**[Test build #72352 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72352/testReport)**
for PR 16797 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16797
**[Test build #72347 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72347/testReport)**
for PR 16797 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16797
**[Test build #72343 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72343/testReport)**
for PR 16797 at commit
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16797
retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16797
We are waiting for https://github.com/apache/spark/pull/16799.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user budde commented on the issue:
https://github.com/apache/spark/pull/16797
Looks like SparkR unit tests have been failing for all or most PRs after
[this
commit.](https://github.com/apache/spark/commit/48aafeda7db879491ed36fff89d59ca7ec3136fa)
---
If your project is set
Github user budde commented on the issue:
https://github.com/apache/spark/pull/16797
Relevant part of [Jenkins
output](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72326/console)
for SparkR tests:
```
Error: processing vignette
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16797
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72326/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16797
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16797
**[Test build #72326 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72326/testReport)**
for PR 16797 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16797
**[Test build #72326 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72326/testReport)**
for PR 16797 at commit
Github user budde commented on the issue:
https://github.com/apache/spark/pull/16797
Pinging @ericl, @cloud-fan and @davies, committers who have all reviewed or
submitted changes related to this.
---
If your project is set up for it, you can reply to this email and have your
reply
46 matches
Mail list logo