Github user wgtmac commented on the issue:
https://github.com/apache/spark/pull/15035
Just confirmed that this also doesn't work with vectorized reader. What I
did is as follows:
1. Created a flat hive table with schema "name: String, id: Long". But the
parquet file which con
Github user sameeragarwal commented on the issue:
https://github.com/apache/spark/pull/15035
For our vectorized parquet reader, we try to take care of these type
conversions here:
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasou
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/15035
We definitely shouldn't change SpecificMutableRow to do this upcast;
otherwise we might introduce subtle bugs with type mismatches in the future.
cc @sameeragarwal to see if there is a better p
Github user wgtmac commented on the issue:
https://github.com/apache/spark/pull/15035
@HyukjinKwon Yup that makes sense. Do you have any idea where is the best
place to fix this?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/15035
Hm.. are you sure this is a problem in all data sources? IIUC, JSON and CSV
kind of allows permissive upcasting whereas ORC and Parquet do not - so this
would be rather ORC and Parquet specific
Github user wgtmac commented on the issue:
https://github.com/apache/spark/pull/15035
@JoshRosen yes it may have mask overflow risk. This conversion happens when
user provided schema or hive metastore schema has Long but the parquet files
have Int as the schema. We cannot avoid this r
Github user wgtmac commented on the issue:
https://github.com/apache/spark/pull/15035
@HyukjinKwon This is not parquet specific, it applies to other data sources
as well.
1. Change the reading path for parquet: It does not solve the problem. Some
queries need to read all parquet f
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/15035
Do you mind if I ask whether this work with vectorized parquet reader too?
I know normal Parquet reader uses `SpecificMutableRow` but IIRC, Parquet
vectorized reader replies on `ColumnarBatch` w
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/15035
Shouldn't we change the reading path for Parquet rather than changing the
target row to avoid per-record type dispatch? Also, it seems a Parquet specific
issue but I wonder making changes in row
Github user JoshRosen commented on the issue:
https://github.com/apache/spark/pull/15035
+1 on adding a test, otherwise this risks regressing in future
refactorings. Also, I'm not sure whether `SpecificMutableRow` itself is
necessarily the right place to be performing this type wideni
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/15035
Would it maybe make sense to add an automated test for this?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15035
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feat
12 matches
Mail list logo