[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-15 Thread budde
Github user budde commented on the issue:

https://github.com/apache/spark/pull/16797
  
Thanks for all the feedback on this PR, folks. I'm going to close this 
PR/JIRA and open new ones for enabling configurable schema inference as a 
fallback. I'll ping each of you who has been active in this discussion on the 
new PR once it is open.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-10 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16797
  
ok if we think we should support tables created by hive(or other systems) 
even the data schema mismatches the table schema(and matches if lowercased), 
I'm ok to fall back to schema inference when reading.

> Incidentally, what happens when a program outside of Spark (such as Hive) 
updates the Hive metastore schema of a table with the embedded Spark SQL schema?

I think we will have a bug. But this is not a regression right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-09 Thread budde
Github user budde commented on the issue:

https://github.com/apache/spark/pull/16797
  
@mallman The Parquet schema merging methods take me back to #5214 :)

I haven't been following changes here very closely but I would guess use of 
this method was replaced to the previously-used call to 
```ParquetFileFormat.inferSchema()```. But I think it is important to point out 
that this functionality was explicitly added to support case sensitivity 
differences.

In regard to the JIRA, I'll either modify it or replace it with a new one 
for bringing back (configurable) inference. I can mark it a "bug" at that point.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-09 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16797
  
BTW @budde, given that this represents a regression in behavior from 
previous versions of Spark, I think it is too generous of you to label the Jira 
issue as an "improvement" instead of a "bug". I would support you changing the 
type to "bug" if you want to.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-09 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16797
  
>> Like you said, users can still create a hive table with 
mixed-case-schema parquet/orc files, by hive or other systems like presto. This 
table is readable for hive, and for Spark prior to 2.1, because of the runtime 
schema inference But this is not intentional, and Spark should not support it 
as the data file schema and table schema mismatch.
>
> I will continue to argue strongly against reducing the number of usecases 
Spark SQL supports out of the box. While offering a migration command can offer 
a helpful optimization I don't think it is acceptable as the only option for 
the reasons I've detailed here.
>
> Simply put, I think relying on the presence of Spark-specific key/value 
pairs in the table properties in order for Spark SQL to function properly and 
assuming that Spark (or Spark users) can easily alter those properties to add 
the table schema is too brittle for large-scale production use.

I would have to agree with @budde in this case. In versions of Spark prior 
to 2.1, an effort was made to reconcile metastore and file format case 
mismatching using the method `ParquetFileFormat.mergeMetastoreParquetSchema`. 
The code docs for that method state that here: 
https://github.com/apache/spark/blob/1b02f8820ddaf3f2a0e7acc9a7f27afc20683cca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L711-L719.
 I don't see anything here that suggests this was a "hack" or was intended to 
be removed in a later version. It seems we've simply broken compatibility with 
a certain class of Hive tables in Spark 2.1.

Schema inference is very expensive, and doing it at query time on large 
tables was painful in versions prior to Spark 2.1 because all metadata files 
were read. But it seems some people were using it nonetheless and found it 
useful. At least in Spark 2.1, only the files for partitions read in a query 
will be read for schema inference. That would significantly enhance the schema 
inference performance at query time for partitioned tables.

Incidentally, what happens when a program outside of Spark (such as Hive) 
updates the Hive metastore schema of a table with the embedded Spark SQL 
schema? Does Spark detect that change and update the embedded schema? Does it 
have to redo the schema inference across all files in the table?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-09 Thread budde
Github user budde commented on the issue:

https://github.com/apache/spark/pull/16797
  
@cloud-fan:

> Spark does support mixed-case-schema tables, and it has always been. It's 
because we write table schema to metastore case-preserving, via table 
properties.

Spark prior to 2.1 supported *any* case-sensitive table, regardless of what 
table properties are set. Spark 2.1 supports these tables if and only if Spark 
2.1 was used to create them and embedded the schema as a metadata property.

> So the data file schema must match the table schema, or Spark will fail, 
it has always been.

This absolutely not how it's always been. Spark would infer the schema from 
the source files and use that schema when constructing a logical relation. 
We've been relying on this behavior for years.

> For any table, as long as hive can read it, Spark can read it.

I've double checked this and Hive can query against tables backed by 
case-sensitive Parquet files. Spark 2.1 is currently the only Hive-compatible 
query engine I'm familiar with that won't support this usecase.

> But we supported it, with the cost of runtime schema inference.

My argument is that is should be possible to fall back to this level of 
support if the properties aren't present.

> This problem was solved in Spark 2.1, by writing table schema to 
metastore case-preserving for hive serde tables. Now we can say that, the data 
schema must match the table schema, or Spark should fail.

Spark does not explicitly fail in this case. It falls back to the downcased 
metastore schema, which will silently fail and return 0 results if a 
case-sensitive field name is used in your projection or filter predicate.

> That's why I prefer the migration command approach, it keeps the concept 
clean: data schema must match table schema.

This links Spark upgrades to potentially-costly data migrations. From an 
end-user perspective, prior to 2.1 you could simply point Spark SQL to an 
external Hive metastore and query any data in it. Now you have to make sure the 
table has been migrated to the appropriate version of Spark or your queries may 
silently return incorrect results.

The migration approach also assumes that Spark has write access to the Hive 
metastore it is querying against. If you have read-only access to a metastore 
administered by another team or organization you are at their mercy to run 
migrations on your behalf against the latest version of Spark in order to allow 
you to query their tables from Spark. I think anybody who's ever found 
themselves in a similar situation can attest that it's never good to be 
beholden to someone else to enable a feature that only matters to you.

And again, in some cases migrating all tables in a large Hive warehouse 
could be an extremely expensive operation that potentially touches petabytes of 
data.

> Like you said, users can still create a hive table with mixed-case-schema 
parquet/orc files, by hive or other systems like presto. This table is readable 
for hive, and for Spark prior to 2.1, because of the runtime schema inference 
But this is not intentional, and Spark should not support it as the data file 
schema and table schema mismatch. We can make the migration command cover this 
case too.

I will continue to argue strongly against reducing the number of usecases 
Spark SQL supports out of the box. While offering a migration command can offer 
a helpful optimization I don't think it is acceptable as the only option for 
the reasons I've detailed here.

Simply put, I think relying on the presence of Spark-specific key/value 
pairs in the table properties in order for Spark SQL to function properly and 
assuming that Spark (or Spark users) can easily alter those properties to add 
the table schema is too brittle for large-scale production use.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-08 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16797
  
@budde Spark does support mixed-case-schema tables, and it has always been. 
It's because we write table schema to metastore case-preserving, via table 
properties. When we read a table, we get schema from metastore and assume it's 
the schema of the table data files. So the data file schema must match the 
table schema, or Spark will fail, it has always been.

However, there is one exception. There are 2 kinds of tables in Spark: data 
source tables and hive serde tables(we have different SQL syntax to create 
them). Data source tables are totally managed by Spark, we read/write data 
files directly and only use hive metastore as a persistent layer, which means 
data source tables are not compatible with hive, hive can't read/write it.

For hive serde tables, it should be compatible with hive and we use hive 
api to read/write it. For any table, as long as hive can read it, Spark can 
read it. However, the exception is, for parquet and orc formats, we will read 
data files directly, as an optimization(reading using hive api is slow). Before 
Spark 2.1, we save schema to hive metastore directly, which means schema will 
be lowercased. w.r.t. this, ideally we should not support mixed-case-schema 
parquet/orc data files for this kind of table, or the data schema will mismatch 
the table schema. But we supported it, with the cost of runtime schema 
inference.

This problem was solved in Spark 2.1, by writing table schema to metastore 
case-preserving for hive serde tables. Now we can say that, the data schema 
must match the table schema, or Spark should fail.

Then comes to this problem: for parquet/orc format hive serde tables 
created by Spark prior to 2.1, the data file schema may not match the table 
schema, but we need to still support it for compatibility.

That's why I prefer the migration command approach, it keeps the concept 
clean: data schema must match table schema.

Like you said, users can still create a hive table with mixed-case-schema 
parquet/orc files, by hive or other systems like presto. This table is readable 
for hive, and for Spark prior to 2.1, because of the runtime schema inference. 
But this is not intentional, and Spark should not support it as the data file 
schema and table schema mismatch. We can make the migration command cover this 
case too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-08 Thread budde
Github user budde commented on the issue:

https://github.com/apache/spark/pull/16797
  
> For better user experience, we should automatically infer the schema and 
write it back to metastore, if there is no case-sensitive table schema in 
metastore. This has the cost of detection the need of schema inference, and 
complicating table read code path.

Totally agree. I think the default behavior should be to infer and backfill 
a case-sensitive schema into the table properties if one isn't already there. 
An option should also be provided to disable all inference and just fall back 
to the case-insensitive metastore schema if none is found (i.e. the current 
behavior in 2.1.0).

> If this is only a compatibility issue, I think it's fair to ask the 
cluster maintainers to run some commands after upgrade Spark cluster. Even 
there are a lot of tables, it's easy to write a script to automate it.

I don't think this is fair. For one, as I've mentioned, in some cases Spark 
may not be the tool being used to maintain the metastore. This will now require 
the warehouse admins to set up a Spark cluster and run these migration commands 
on every table with case-sensitive underlying data if they'd like them to be 
accessible from Spark. As a second point, while writing an automation script 
may be trivial the execution costs aren't, especially if the data is stored in 
a format like JSON where each and every record in the table must be read in 
order to infer the schema.

> If there is no Spark specific table properties, we assume this table is 
created by hive(not by external systems like Presto), so the schema of parquet 
files should be all lowercased.

This isn't an assumption made by Spark prior to 2.1.0, whether this was an 
explicit decision or not. All I'm asking for is a way to configure Spark to 
continue supporting a use case it has for years.

Also, in our case, the table was created by Spark, not Presto. Presto is 
just an example of another execution engine we've put in front of our warehouse 
that hasn't had a problem with the underlying Parquet data being 
case-sensitive. We just used an older version of Spark to create the tables. I 
would think long and hard about whether requiring warehouse admins to run 
potentially-costly migrations between Spark versions to update table metadata 
is a preferable option to offering a way for being backwards-compatible with 
the old behavior.

Again, I think introducing a mechanism to migrate the table properties is a 
good idea. I just don't think it should be the only option.

> Another proposal is to make parquet reader case-insensitive, so that we 
can solve this problem without schema inference. But the problem is, Spark can 
be configured to be case-sensitive, so that it's possible to write such a 
schema (conflicting columns after lower-casing) into metastore. I think this 
proposal is the best if we can totally make Spark case-insensitive.

I don't think this would be a bad option if this could be enabled at the 
Parquet level, but it seems as their work towards enabling case-insensitive 
file access has stalled. As @ericl pointed out above, moving this to the 
ParquetReadSupport level may make the situation better for Parquet but the 
behavior won't be consistent across file formats like ORC or JSON.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-07 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16797
  
For better user experience, we should automatically infer the schema and 
write it back to metastore, if there is no case-sensitive table schema in 
metastore. This has the cost of detection the need of schema inference, and 
complicating table read code path.

If this is only a compatibility issue, I think it's fair to ask the cluster 
maintainers to run some commands after upgrade Spark cluster. Even there are a 
lot of tables, it's easy to write a script to automate it.

> I wouldn't make the assumption that Spark is being used to create and 
maintain the tables in the Hive Metastore that Spark is querying against. We're 
currently using Spark to add and update metastore tables in our usecase, but I 
don't think Spark should make any assumptions about how the table was created 
or what properties may be set.

Actually we do have this assumption... When we create a table with Spark, 
we will put a lot of Spark specific table properties. When we read a table, if 
the Spark specific table properties are there, we will treat it differently. If 
there is no Spark specific table properties, we assume this table is created by 
hive(not by external systems like Presto), so the schema of parquet files 
should be all lowercased.


Another proposal is to make parquet reader case-insensitive, so that we can 
solve this problem without schema inference. But the problem is, Spark can be 
configured to be case-sensitive, so that it's possible to write such a schema 
(conflicting columns after lower-casing) into metastore. I think this proposal 
is the best if we can totally make Spark case-insensitive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-07 Thread budde
Github user budde commented on the issue:

https://github.com/apache/spark/pull/16797
  
> is it a completely compatibility issue? Seems like the only problem is, 
when we write out mixed-case-schema parquet files directly, and create an 
external table pointing to these files with Spark prior to 2.1, then read this 
table with Spark 2.1+.

Fundamentally, I wouldn't make the assumption that Spark is being used to 
create and maintain the tables in the Hive Metastore that Spark is querying 
against. We're currently using Spark to add and update metastore tables in our 
usecase, but I don't think Spark should make any assumptions about how the 
table was created with or what properties may be set.

In regard to the underlying issue, we've been using Spark in production for 
over two years and have several petabytes of case-sensitive Parquet data we've 
both written and queried using Spark. As of Spark 2.1, we are no longer able to 
use Spark to query any of this data as any query containing a case-sensitive 
field name will return 0 results. I would argue this is a compatibility 
regression.

> For tables in hive, as long as long hive can read it, Spark should be 
able to read it too.

In our case, other Hive-compatible query engines like Presto don't have a 
problem with case-sensitive  Parquet files. I haven't tried Hive itself in a 
long time but as far as I remember we didn't have a problem there either.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-07 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16797
  
is it a completely compatibility issue? Seems like the only problem is, 
when we write out mixed-case-schema parquet files directly, and create an 
external table pointing to these files with Spark prior to 2.1, then read this 
table with Spark 2.1+.

For tables in hive, as long as long hive can read it, Spark should be able 
to read it too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-07 Thread budde
Github user budde commented on the issue:

https://github.com/apache/spark/pull/16797
  
> Can we write such schema (conflicting columns after lower-casing) into 
metastore?

I think the scenario here would be that the metastore contains a single 
lower-case column name that could resolve to multiple case-sesnitive column 
names in the underlying Parquet file. This could've happened via the user 
manually executing a ```CREATE TABE...``` query with an explicit schema. Since 
the metastore itself isn't really defining expected behavior in this case I 
think we can just consider this undefined behavior and return the first field 
that matches alphabetically.

I don't think this is very likely to be a legitimate usecase, but it's good 
to point out edge cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-07 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16797
  
> BTW, what behavior do we expect if a parquet file has two columns whose 
lower-cased names are identical?

Can we write such schema (conflicting columns after lower-casing) into 
metastore?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-06 Thread budde
Github user budde commented on the issue:

https://github.com/apache/spark/pull/16797
  
> BTW, what behavior do we expect if a parquet file has two columns whose 
lower-cased names are identical?

I can take a look at how Spark handled this prior to 2.1, although I'm not 
sure if the behavior we'll see there was the result of a conscious decision or 
"undefined" behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-06 Thread budde
Github user budde commented on the issue:

https://github.com/apache/spark/pull/16797
  
> how about we add a new SQL command to refresh the table schema in 
metastore by inferring schema with data files? This is a compatibility issue 
and we should have provided a way for users to migrate, before the 2.1 release. 
I think this approach is much simpler than adding a flag.

While I think introducing a command for inferring and storing the table's 
case-sensitive schema as a property would be a welcome addition, I think 
requiring this property to be there in order for Spark SQL to function properly 
with case-sensitive data files could really restrict the settings Spark SQL can 
be used in.

If a user wanted to use Spark SQL to query over an existing warehouse 
containing hundreds or even thousands of tables, under the suggested approach a 
Spark job would have to be run to infer the schema of each and every table. 
file formats such as Parquet store their schemas as metadata there still could 
potentially be millions of files to inspect for the warehouse. A less amenable 
format like JSON might require scanning all the data in the warehouse.

This also doesn't cover the use case @viirya pointed our where the user may 
not have write access to the metastore they are querying against. In this case, 
the user would have to rely on the warehouse administrator to create the Spark 
schema property for every table they wish to query.

> For tables created by hive, as hive is a case-insensitive system, will 
the parquet files have mixed-case schema?

I think the Hive Metastore has become a bit of an open standard for 
maintaining a data warehouse catalog since so many tools integrate with it. I 
wouldn't assume that the underlying data pointed to by an external metastore 
was created or managed by Hive itself. For example, we maintain a Hive 
Metastore that catalogs case-sensitive files written by our Spark-based ETL 
pipeline, which parses case classes from string data and writes them as Parquet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-06 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16797
  
The proposal to restore schema inference with finer grained control on when 
it is performed sounds reasonable to me. The case I'm most interested in is 
turning off schema inference entirely, because we do not use parquet files with 
upper-case characters in their column names.

BTW, what behavior do we expect if a parquet file has two columns whose 
lower-cased names are identical?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-06 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16797
  
If the use case where we want to infer the schema but not attempt to write 
it back as a property as suggested by @budde, is making sense, then the new SQL 
command approach might not work for it. But actually in this case, it is needed 
to ask someone who is permitted to migrate and update the table property.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-06 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16797
  
how about we add a new SQL command to refresh the table schema in metastore 
by inferring schema with data files? This is a compatibility issue and we 
should have provided a way for users to migrate, before the 2.1 release. I 
think this approach is much simpler than adding a flag.

For tables created by hive, as hive is a case-insensitive system, will the 
parquet files have mixed-case schema?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-06 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16797
  
> I'll double check, but I don't think 
spark.sql.hive.manageFilesourcePartitions=false would solve this issue since 
we're still deriving the file relation's dataSchema parameter from the schema 
of MetastoreRelation. The call to fileFormat.inferSchema() has been removed 
entirely.

I think It is correct. `spark.sql.hive.manageFilesourcePartitions` won't 
control whether to infer schema or not.

The approach @ericl suggested (`spark.sql.hive.mixedCaseSchemaSupport`) 
sounds good to me.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-06 Thread budde
Github user budde commented on the issue:

https://github.com/apache/spark/pull/16797
  
> Should we roll these behaviors into one flag? e.g. 
```spark.sql.hive.mixedCaseSchemaSupport```

That sounds reasonable to me. The only thing I wonder about is if there's 
any use case where we want to infer the schema but not attempt to write it back 
as a property, say if the external metastore doesn't permit table property 
updates from the user. We can always just log the failure, but this could be 
noisy for users expecting this behavior by default. This could be solved by 
adding an INFER_WITHOUT_SAVING mode.

I'll leave the PR open for now so we can hear and discuss @mallman's input 
but if we're all on board with this approach I'll eventually close out this out 
in favor of a new PR adding configurable schema inference behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-06 Thread ericl
Github user ericl commented on the issue:

https://github.com/apache/spark/pull/16797
  
> I'll double check, but I don't think 
spark.sql.hive.manageFilesourcePartitions=false would solve this issue since 
we're still deriving the file relation's dataSchema parameter from the schema 
of MetastoreRelation. The call to fileFormat.inferSchema() has been removed 
entirely.

Makes sense. I guess that would have to be restored as an option.

> If Spark SQL is set on using a table property to store the case-sesnitive 
schema then I think having a way to backfill this property for existing < 2.1 
tables as well as tables not created or managed by Spark will be a necessity. 
If the cleanest way to deal with this case sensitivity problem is to bring back 
schema inference then I think a good option would be to introduce a 
configuration param to indicate whether or not an inferred schema should be 
written back to the table as a property.

Yeah, the table property was added in 2.1 (see 
HiveExternalCatalog:getSchemaFromTableProperties).

> We could also introduce another config param that allows a user to bypass 
schema inference even if a case-sensitive schema can't be read from the table 
properties. This could be helpful for users who would like to query external 
Hive tables that aren't managed by Spark and that they know aren't backed by 
files containing case-sensitive field names.
> 
> This would basically allow us to support the following use cases:
> 
> The MetastoreRelation is able to read a case-sensitive schema from the 
table properties. No inference is necessary.
> The MetastoreRelation can't read a case-sensitive schema from the table 
properties. A case-sensitive schema is inferred and, if configured, written 
back as a table property.
> The MetastoreRelation can't read a case-sensitive schema from the table 
properties. The user knows the underlying data files don't contain 
case-sensitive field names and has explicitly set a config param to skip the 
inference step.

Should we roll these behaviors into one flag?
e.g. `spark.sql.mixedCaseSchemaSupport` could have a few modes: 
INFER_IF_NEEDED (default), NEVER_INFER (the third use case above), and 
FAIL_FAST (instead of falling back to inference, throw an exception).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-06 Thread budde
Github user budde commented on the issue:

https://github.com/apache/spark/pull/16797
  
I'll double check, but I don't think 
```spark.sql.hive.manageFilesourcePartitions=false``` would solve this issue 
since we're still deriving the file relation's dataSchema parameter from the 
schema of MetastoreRelation. The call to ```fileFormat.inferSchema()``` has 
been removed entirely.

If Spark SQL is set on using a table property to store the case-sesnitive 
schema then I think having a way to backfill this property for existing < 2.1 
tables as well as tables not created or managed by Spark will be a necessity. 
If the cleanest way to deal with this case sensitivity problem is to bring back 
schema inference then I think a good option would be to introduce a 
configuration param to indicate whether or not an inferred schema should be 
written back to the table as a property.

We could also introduce another config param that allows a user to bypass 
schema inference even if a case-sensitive schema can't be read from the table 
properties. This could be helpful for users who would like to query external 
Hive tables that aren't managed by Spark and that they know aren't backed by 
files containing case-sensitive field names.

This would basically allow us to support the following use cases:

1) The MetastoreRelation is able to read a case-sensitive schema from the 
table properties. No inference is necessary.
2) The MetastoreRelation can't read a case-sensitive schema from the table 
properties. A case-sensitive schema is inferred and, if configured, written 
back as a table property.
3) The MetastoreRelation can't read a case-sensitive schema from the table 
properties. The user knows the underlying data files don't contain 
case-sensitive field names and has explicitly set a config param to skip the 
inference step.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-05 Thread ericl
Github user ericl commented on the issue:

https://github.com/apache/spark/pull/16797
  
I agree that bringing back schema inference would be cleaner. One problem 
with doing something parquet-specific is that this would need to be repeated 
with each file format, e.g. orc, csv, json, unless we made case sensitivity an 
API option as noted above.

Per @viirya 's suggestion, it seems ideal to only do schema inference for 
older tables, and provide some command to update the table schema to preserve 
the case. This way we don't regress performance on 2.1 datasets or have to 
worry about schemas changing during optimization.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-04 Thread budde
Github user budde commented on the issue:

https://github.com/apache/spark/pull/16797
  
Bringing back schema inference is certainly a much cleaner option, although 
I imagine doing this in the old manner would negate the performance 
improvements brought by #14690 for any non-Spark 2.1 dataset.

Ideally, I think we would infer the schema only from the pruned partition 
list for tables we can't read a case-sensitive schema for. Unless I'm mistaken, 
this would have to happen during optimization of the logical plan, after the 
PruneFileSourcePartitions rule has been applied. My thought is that we could 
write a rule that passes the pruned file list to the file format's 
inferSchema() method to replace the HadoopFsRelation's dataSchema with the 
result. I'm not very familiar with Catalyst though, so I'm not sure if changing 
the relation's schema during optimization will cause problems.

There is [an open PR to add support for case-insensitive schemas to 
Parquet](https://github.com/apache/parquet-mr/pull/210) which would be helpful 
here since it would provide a way to avoid schema inference when your Parquet 
files have case-sensitive fields but you don't care about case sensitivity when 
querying. Unfortunately the PR seems to be more or less abandoned though.

Pinging @mallman, the author of #14690, to see if he has any input on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16797
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72352/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16797
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16797
  
**[Test build #72352 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72352/testReport)**
 for PR 16797 at commit 
[`fd4c444`](https://github.com/apache/spark/commit/fd4c4440e34fedf1a5f08850da8b203c71d9071c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16797
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16797
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72347/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16797
  
**[Test build #72347 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72347/testReport)**
 for PR 16797 at commit 
[`a993928`](https://github.com/apache/spark/commit/a993928baba66e6ab6e592094012acdb4ce428ff).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16797
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72343/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16797
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16797
  
**[Test build #72343 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72343/testReport)**
 for PR 16797 at commit 
[`5426271`](https://github.com/apache/spark/commit/5426271946419a9defb59bb84575501bc8296578).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16797
  
Shall we just go to infer schema from files once accessing any Hive table 
that was not created by Spark or created by a version prior to 2.1.0, i.e., for 
which the schema has not been embedded as a table attributes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16797
  
**[Test build #72352 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72352/testReport)**
 for PR 16797 at commit 
[`fd4c444`](https://github.com/apache/spark/commit/fd4c4440e34fedf1a5f08850da8b203c71d9071c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16797
  
**[Test build #72347 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72347/testReport)**
 for PR 16797 at commit 
[`a993928`](https://github.com/apache/spark/commit/a993928baba66e6ab6e592094012acdb4ce428ff).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16797
  
**[Test build #72343 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72343/testReport)**
 for PR 16797 at commit 
[`5426271`](https://github.com/apache/spark/commit/5426271946419a9defb59bb84575501bc8296578).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16797
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16797
  
We are waiting for https://github.com/apache/spark/pull/16799. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread budde
Github user budde commented on the issue:

https://github.com/apache/spark/pull/16797
  
Looks like SparkR unit tests have been failing for all or most PRs after 
[this 
commit.](https://github.com/apache/spark/commit/48aafeda7db879491ed36fff89d59ca7ec3136fa)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread budde
Github user budde commented on the issue:

https://github.com/apache/spark/pull/16797
  
Relevant part of [Jenkins 
output](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72326/console)
 for SparkR tests:

```
Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
error in evaluating the argument 'object' in selecting a method for 
function 'summary': Error: object 'kmeansModel' not found
```

Doesn't appear to be related to this change. I'll investigate and see if I 
can reproduce it locally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16797
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72326/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16797
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16797
  
**[Test build #72326 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72326/testReport)**
 for PR 16797 at commit 
[`5426271`](https://github.com/apache/spark/commit/5426271946419a9defb59bb84575501bc8296578).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16797
  
**[Test build #72326 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72326/testReport)**
 for PR 16797 at commit 
[`5426271`](https://github.com/apache/spark/commit/5426271946419a9defb59bb84575501bc8296578).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-03 Thread budde
Github user budde commented on the issue:

https://github.com/apache/spark/pull/16797
  
Pinging @ericl, @cloud-fan and @davies, committers who have all reviewed or 
submitted changes related to this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org