[GitHub] spark issue #15035: [SPARK-17477]: SparkSQL cannot handle schema evolution f...
Github user wgtmac commented on the issue: https://github.com/apache/spark/pull/15035 Just confirmed that this also doesn't work with vectorized reader. What I did is as follows: 1. Created a flat hive table with schema "name: String, id: Long". But the parquet file which contains 100 rows is using "name: String, id: Int". 2. Then just did a query "select * from table" and show the result. It works fine with DataFrame.count and DataFrame .printSchema() Got the following exception: ``` Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924) at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562) at org.apache.spark.sql.Dataset.head(Dataset.scala:1924) at org.apache.spark.sql.Dataset.take(Dataset.scala:2139) at org.apache.spark.sql.Dataset.showString(Dataset.scala:239) at org.apache.spark.sql.Dataset.show(Dataset.scala:526) at org.apache.spark.sql.Dataset.show(Dataset.scala:486) at org.apache.spark.sql.Dataset.show(Dataset.scala:495) ... 48 elided Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:272) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.
[GitHub] spark issue #15035: [SPARK-17477]: SparkSQL cannot handle schema evolution f...
Github user sameeragarwal commented on the issue: https://github.com/apache/spark/pull/15035 For our vectorized parquet reader, we try to take care of these type conversions here: https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java#L360-L369 although that wouldn't work with nested schemas. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15035: [SPARK-17477]: SparkSQL cannot handle schema evolution f...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15035 We definitely shouldn't change SpecificMutableRow to do this upcast; otherwise we might introduce subtle bugs with type mismatches in the future. cc @sameeragarwal to see if there is a better place to do this -- I think doing this in Parquet itself is pretty reasonable? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15035: [SPARK-17477]: SparkSQL cannot handle schema evolution f...
Github user wgtmac commented on the issue: https://github.com/apache/spark/pull/15035 @HyukjinKwon Yup that makes sense. Do you have any idea where is the best place to fix this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15035: [SPARK-17477]: SparkSQL cannot handle schema evolution f...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15035 Hm.. are you sure this is a problem in all data sources? IIUC, JSON and CSV kind of allows permissive upcasting whereas ORC and Parquet do not - so this would be rather ORC and Parquet specific problems. Could you confirm if this happens in other datasources please? Also, I believe this then will generally downgrade the performanxe in `SpecificMutableRow`. I wonder if it is worth doing this to support this case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15035: [SPARK-17477]: SparkSQL cannot handle schema evolution f...
Github user wgtmac commented on the issue: https://github.com/apache/spark/pull/15035 @JoshRosen yes it may have mask overflow risk. This conversion happens when user provided schema or hive metastore schema has Long but the parquet files have Int as the schema. We cannot avoid this risk in this case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15035: [SPARK-17477]: SparkSQL cannot handle schema evolution f...
Github user wgtmac commented on the issue: https://github.com/apache/spark/pull/15035 @HyukjinKwon This is not parquet specific, it applies to other data sources as well. 1. Change the reading path for parquet: It does not solve the problem. Some queries need to read all parquet files. 2. Make changes in row: yes, I have to change it per row because some parquet files have int while some parquet files have long. We can't know which row is good or problematic. 3. Vectorized parquet reader: This is a good catch. I haven't considered this yet. It would be great if you can come up with other good ideas and continue to work on it. Feedbacks and discussions are welcome. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15035: [SPARK-17477]: SparkSQL cannot handle schema evolution f...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15035 Do you mind if I ask whether this work with vectorized parquet reader too? I know normal Parquet reader uses `SpecificMutableRow` but IIRC, Parquet vectorized reader replies on `ColumnarBatch` which does not use `SpecificMutableRow`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15035: [SPARK-17477]: SparkSQL cannot handle schema evolution f...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15035 Shouldn't we change the reading path for Parquet rather than changing the target row to avoid per-record type dispatch? Also, it seems a Parquet specific issue but I wonder making changes in row is a good approach. I remember my PR to support upcasting in schema for Parquet, https://github.com/apache/spark/pull/14215 which I decided to close for a better approach. I haven't taken a look so closely but I will and leave some comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15035: [SPARK-17477]: SparkSQL cannot handle schema evolution f...
Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/15035 +1 on adding a test, otherwise this risks regressing in future refactorings. Also, I'm not sure whether `SpecificMutableRow` itself is necessarily the right place to be performing this type widening since that could introduce / mask bugs in other uses of this API. Also seems slightly dodgy to let users call `setInt` with a long since that could cause / mask overflow bugs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15035: [SPARK-17477]: SparkSQL cannot handle schema evolution f...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/15035 Would it maybe make sense to add an automated test for this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15035: [SPARK-17477]: SparkSQL cannot handle schema evolution f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15035 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org