[ 
https://issues.apache.org/jira/browse/SPARK-31116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31116:
----------------------------------
    Target Version/s: 3.0.0

> PrquetRowConverter does not follow case sensitivity
> ---------------------------------------------------
>
>                 Key: SPARK-31116
>                 URL: https://issues.apache.org/jira/browse/SPARK-31116
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Tae-kyeom, Kim
>            Priority: Blocker
>
> After upgrading spark versrion to 3.0.0-SNAPSHOT. Selecting parquet columns 
> got exception in case insensitive manner. Even we set spark.sql.caseSensitive 
> to false. Reading parquet with case ignored schema (which means columns in 
> parquet and catalyst types are same with respect to case insensitive manner)
>  
> To reproduce error executing follow code cause 
> java.lang.IllegalArgumentException
>  
> {code:java}
> import org.apache.spark.sql.types._
> val path = "/some/temp/path"
> spark
>   .range(1L)
>   .selectExpr("NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS 
> StructColumn")
>   .write.parquet(path)
> val caseInsensitiveSchema = new StructType()
>   .add(
>     "StructColumn",
>     new StructType()
>       .add("LowerCase", LongType)
>       .add("camelcase", LongType))
> spark.read.schema(caseInsensitiveSchema).parquet(path).show(){code}
> Then we got following error.
> {code:java}
> 23:57:09.077 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
> in stage 215.0 (TID 366)23:57:09.077 ERROR 
> org.apache.spark.executor.Executor: Exception in task 0.0 in stage 215.0 (TID 
> 366)java.lang.IllegalArgumentException: lowercase does not exist. Available: 
> LowerCase, camelcase at 
> org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:306)
>  at scala.collection.immutable.Map$Map2.getOrElse(Map.scala:147) at 
> org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:305) at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.$anonfun$fieldConverters$1(ParquetRowConverter.scala:182)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
> scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.<init>(ParquetRowConverter.scala:181)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetRowConverter$$newConverter(ParquetRowConverter.scala:351)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.$anonfun$fieldConverters$1(ParquetRowConverter.scala:185)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
> scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.<init>(ParquetRowConverter.scala:181)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRecordMaterializer.<init>(ParquetRecordMaterializer.scala:43)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport.prepareForRead(ParquetReadSupport.scala:130)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:204)
>  at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
>  at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:341)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1804) at 
> org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1229) at 
> org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1229) at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2144) at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.run(Task.scala:127) at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:460)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:463) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> I think from 3.0.0, 
> `org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter` does 
> not have equal number of fields between parquetRequestedSchema and 
> catalystRequestedSchema ([https://github.com/apache/spark/pull/22880]). So we 
> consider case sensitivity in ParquetRowConverter or some related classes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to