[ 
https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damien Doucet-Girard updated SPARK-26188:
-----------------------------------------
    Description: 
My team uses spark to partition and output parquet files to amazon S3. We 
typically use 256 partitions, from 00 to ff.

We've observed that in spark 2.3.2 and prior, it reads the partitions as 
strings by default. However, in spark 2.4.0 and later, the type of each 
partition is inferred by default, and partitions such as 00 become 0 and 4d 
become 4.0.
Here is a log sample of this behavior from one of our jobs:
2.4.0:
{code:java}
18/11/27 14:02:27 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, 
range: 0-662, partition values: [0]
18/11/27 14:02:28 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, 
range: 0-662, partition values: [ef]
18/11/27 14:02:29 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, 
range: 0-662, partition values: [4a]
18/11/27 14:02:30 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, 
range: 0-662, partition values: [74]
18/11/27 14:02:32 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, 
range: 0-662, partition values: [f5]
18/11/27 14:02:33 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, 
range: 0-662, partition values: [50]
18/11/27 14:02:34 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, 
range: 0-662, partition values: [70]
18/11/27 14:02:35 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, 
range: 0-662, partition values: [b9]
18/11/27 14:02:37 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, 
range: 0-662, partition values: [d2]
18/11/27 14:02:38 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=51/part-00003-hashredacted.parquet, 
range: 0-662, partition values: [51]
18/11/27 14:02:39 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, 
range: 0-662, partition values: [84]
18/11/27 14:02:40 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, 
range: 0-662, partition values: [b5]
18/11/27 14:02:41 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, 
range: 0-662, partition values: [88]
18/11/27 14:02:42 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, 
range: 0-662, partition values: [4.0]
18/11/27 14:02:43 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, 
range: 0-662, partition values: [ac]
18/11/27 14:02:44 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, 
range: 0-662, partition values: [24]
18/11/27 14:02:45 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, 
range: 0-662, partition values: [fd]
18/11/27 14:02:46 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, 
range: 0-662, partition values: [52]
18/11/27 14:02:47 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, 
range: 0-662, partition values: [ab]
18/11/27 14:02:48 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, 
range: 0-662, partition values: [f8]
18/11/27 14:02:49 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, 
range: 0-662, partition values: [7a]
18/11/27 14:02:50 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=ba/part-00020-hashredacted.parquet, 
range: 0-662, partition values: [ba]
18/11/27 14:02:51 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=2d/part-00085-hashredacted.parquet, 
range: 0-662, partition values: [2.0]
18/11/27 14:02:52 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=03/part-00099-hashredacted.parquet, 
range: 0-662, partition values: [3]
18/11/27 14:02:53 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=57/part-00196-hashredacted.parquet, 
range: 0-662, partition values: [57]
18/11/27 14:02:54 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=81/part-00122-hashredacted.parquet, 
range: 0-662, partition values: [81]
18/11/27 14:02:55 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=6d/part-00167-hashredacted.parquet, 
range: 0-662, partition values: [6.0]
18/11/27 14:02:56 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=36/part-00154-hashredacted.parquet, 
range: 0-662, partition values: [36]
18/11/27 14:02:57 INFO FileScanRDD: Reading File path: 
s3a://bucketnamereadacted/ddgirard/suffix=4b/part-00093-hashredacted.parquet, 
range: 0-662, partition values: [4b]{code}




2.3.2:


{code:java}
18/11/27 14:09:00 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=60/part-00082-hashredacted.parquet,
 range: 0-662, partition values: [60]
18/11/27 14:09:01 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=00/part-00061-hashredacted.parquet,
 range: 0-662, partition values: [00]
18/11/27 14:09:02 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=ef/part-00034-hashredacted.parquet,
 range: 0-662, partition values: [ef]
18/11/27 14:09:02 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=4a/part-00151-hashredacted.parquet,
 range: 0-662, partition values: [4a]
18/11/27 14:09:03 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=74/part-00180-hashredacted.parquet,
 range: 0-662, partition values: [74]
18/11/27 14:09:04 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=f5/part-00156-hashredacted.parquet,
 range: 0-662, partition values: [f5]
18/11/27 14:09:04 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=50/part-00195-hashredacted.parquet,
 range: 0-662, partition values: [50]
18/11/27 14:09:05 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=70/part-00054-hashredacted.parquet,
 range: 0-662, partition values: [70]
18/11/27 14:09:05 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=b9/part-00012-hashredacted.parquet,
 range: 0-662, partition values: [b9]
18/11/27 14:09:06 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=d2/part-00016-hashredacted.parquet,
 range: 0-662, partition values: [d2]
18/11/27 14:09:06 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=51/part-00003-hashredacted.parquet,
 range: 0-662, partition values: [51]
18/11/27 14:09:07 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=84/part-00135-hashredacted.parquet,
 range: 0-662, partition values: [84]
18/11/27 14:09:08 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=b5/part-00190-hashredacted.parquet,
 range: 0-662, partition values: [b5]
18/11/27 14:09:08 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=88/part-00143-hashredacted.parquet,
 range: 0-662, partition values: [88]
18/11/27 14:09:09 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=4d/part-00120-hashredacted.parquet,
 range: 0-662, partition values: [4d]
18/11/27 14:09:09 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=ac/part-00119-hashredacted.parquet,
 range: 0-662, partition values: [ac]
18/11/27 14:09:10 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=24/part-00139-hashredacted.parquet,
 range: 0-662, partition values: [24]
18/11/27 14:09:11 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=fd/part-00167-hashredacted.parquet,
 range: 0-662, partition values: [fd]
18/11/27 14:09:11 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=52/part-00033-hashredacted.parquet,
 range: 0-662, partition values: [52]
18/11/27 14:09:12 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=ab/part-00083-hashredacted.parquet,
 range: 0-662, partition values: [ab]
18/11/27 14:09:12 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=f8/part-00018-hashredacted.parquet,
 range: 0-662, partition values: [f8]
18/11/27 14:09:13 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=7a/part-00093-hashredacted.parquet,
 range: 0-662, partition values: [7a]
18/11/27 14:09:13 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=ba/part-00020-hashredacted.parquet,
 range: 0-662, partition values: [ba]
18/11/27 14:09:14 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=2d/part-00085-hashredacted.parquet,
 range: 0-662, partition values: [2d]
18/11/27 14:09:15 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=03/part-00099-hashredacted.parquet,
 range: 0-662, partition values: [03]
18/11/27 14:09:15 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=57/part-00196-hashredacted.parquet,
 range: 0-662, partition values: [57]
18/11/27 14:09:16 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=81/part-00122-hashredacted.parquet,
 range: 0-662, partition values: [81]
18/11/27 14:09:17 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=6d/part-00167-hashredacted.parquet,
 range: 0-662, partition values: [6d]
18/11/27 14:09:17 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=36/part-00154-hashredacted.parquet,
 range: 0-662, partition values: [36]
18/11/27 14:09:18 INFO FileScanRDD: Reading File path: 
s3a://cogo-emr-research-scratch/ddgirard/suffix=4b/part-00093-hashredacted.parquet,
 range: 0-662, partition values: [4b]
{code}
After some investigation, we've isolated the issue to
 
[https://github.com/apache/spark/blob/02b510728c31b70e6035ad541bfcdc2b59dcd79a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L132-L136]
  

In the inferPartitioning method, 2.3.2 sets the type inference to false by 
default:
{code:java}
val spec = PartitioningUtils.parsePartitions(
  leafDirs,
  typeInference = false,
  basePaths = basePaths,
  timeZoneId = timeZoneId){code}
However, in version 2.4.0, the typeInference flag has been replace with a 
config flag

[https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L129-L133]

 
{code:java}
val inferredPartitionSpec = PartitioningUtils.parsePartitions(
  leafDirs,
  typeInference = 
sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled,
  basePaths = basePaths,
  timeZoneId = timeZoneId){code}
And this conf's default value is true
{code:java}
val PARTITION_COLUMN_TYPE_INFERENCE =
buildConf("spark.sql.sources.partitionColumnTypeInference.enabled")
.doc("When true, automatically infer the data types for partitioned columns.")
.booleanConf
.createWithDefault(true){code}
[https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L636-L640]
  

I was wondering if a bug report would be appropriate to preserve backwards 
compatibility and change the default conf value to false.

 
  

  was:
My team uses spark to partition and output parquet files to amazon S3. We 
typically use 256 partitions, from 00 to ff.

We've observed that in spark 2.3.2 and prior, it reads the partitions as 
strings by default. However, in spark 2.4.0 and later, the type of each 
partition is inferred by default, and partitions such as 00 become 0 and 4d 
become 4.0.

After some investigation, we've isolated the issue to
 
[https://github.com/apache/spark/blob/02b510728c31b70e6035ad541bfcdc2b59dcd79a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L132-L136]
  

In the inferPartitioning method, 2.3.2 sets the type inference to false by 
default:
{code:java}
val spec = PartitioningUtils.parsePartitions(
  leafDirs,
  typeInference = false,
  basePaths = basePaths,
  timeZoneId = timeZoneId){code}
However, in version 2.4.0, the typeInference flag has been replace with a 
config flag

[https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L129-L133]

 
{code:java}
val inferredPartitionSpec = PartitioningUtils.parsePartitions(
  leafDirs,
  typeInference = 
sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled,
  basePaths = basePaths,
  timeZoneId = timeZoneId){code}
And this conf's default value is true
{code:java}
val PARTITION_COLUMN_TYPE_INFERENCE =
buildConf("spark.sql.sources.partitionColumnTypeInference.enabled")
.doc("When true, automatically infer the data types for partitioned columns.")
.booleanConf
.createWithDefault(true){code}
[https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L636-L640]
  

I was wondering if a bug report would be appropriate to preserve backwards 
compatibility and change the default conf value to false.

 
  


> Spark 2.4.0 behavior breaks backwards compatibility
> ---------------------------------------------------
>
>                 Key: SPARK-26188
>                 URL: https://issues.apache.org/jira/browse/SPARK-26188
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Damien Doucet-Girard
>            Priority: Minor
>
> My team uses spark to partition and output parquet files to amazon S3. We 
> typically use 256 partitions, from 00 to ff.
> We've observed that in spark 2.3.2 and prior, it reads the partitions as 
> strings by default. However, in spark 2.4.0 and later, the type of each 
> partition is inferred by default, and partitions such as 00 become 0 and 4d 
> become 4.0.
> Here is a log sample of this behavior from one of our jobs:
> 2.4.0:
> {code:java}
> 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, 
> range: 0-662, partition values: [0]
> 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, 
> range: 0-662, partition values: [ef]
> 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, 
> range: 0-662, partition values: [4a]
> 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, 
> range: 0-662, partition values: [74]
> 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, 
> range: 0-662, partition values: [f5]
> 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, 
> range: 0-662, partition values: [50]
> 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, 
> range: 0-662, partition values: [70]
> 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, 
> range: 0-662, partition values: [b9]
> 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, 
> range: 0-662, partition values: [d2]
> 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=51/part-00003-hashredacted.parquet, 
> range: 0-662, partition values: [51]
> 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, 
> range: 0-662, partition values: [84]
> 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, 
> range: 0-662, partition values: [b5]
> 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, 
> range: 0-662, partition values: [88]
> 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, 
> range: 0-662, partition values: [4.0]
> 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, 
> range: 0-662, partition values: [ac]
> 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, 
> range: 0-662, partition values: [24]
> 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, 
> range: 0-662, partition values: [fd]
> 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, 
> range: 0-662, partition values: [52]
> 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, 
> range: 0-662, partition values: [ab]
> 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, 
> range: 0-662, partition values: [f8]
> 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, 
> range: 0-662, partition values: [7a]
> 18/11/27 14:02:50 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ba/part-00020-hashredacted.parquet, 
> range: 0-662, partition values: [ba]
> 18/11/27 14:02:51 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=2d/part-00085-hashredacted.parquet, 
> range: 0-662, partition values: [2.0]
> 18/11/27 14:02:52 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=03/part-00099-hashredacted.parquet, 
> range: 0-662, partition values: [3]
> 18/11/27 14:02:53 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=57/part-00196-hashredacted.parquet, 
> range: 0-662, partition values: [57]
> 18/11/27 14:02:54 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=81/part-00122-hashredacted.parquet, 
> range: 0-662, partition values: [81]
> 18/11/27 14:02:55 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=6d/part-00167-hashredacted.parquet, 
> range: 0-662, partition values: [6.0]
> 18/11/27 14:02:56 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=36/part-00154-hashredacted.parquet, 
> range: 0-662, partition values: [36]
> 18/11/27 14:02:57 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4b/part-00093-hashredacted.parquet, 
> range: 0-662, partition values: [4b]{code}
> 2.3.2:
> {code:java}
> 18/11/27 14:09:00 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=60/part-00082-hashredacted.parquet,
>  range: 0-662, partition values: [60]
> 18/11/27 14:09:01 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=00/part-00061-hashredacted.parquet,
>  range: 0-662, partition values: [00]
> 18/11/27 14:09:02 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=ef/part-00034-hashredacted.parquet,
>  range: 0-662, partition values: [ef]
> 18/11/27 14:09:02 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=4a/part-00151-hashredacted.parquet,
>  range: 0-662, partition values: [4a]
> 18/11/27 14:09:03 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=74/part-00180-hashredacted.parquet,
>  range: 0-662, partition values: [74]
> 18/11/27 14:09:04 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=f5/part-00156-hashredacted.parquet,
>  range: 0-662, partition values: [f5]
> 18/11/27 14:09:04 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=50/part-00195-hashredacted.parquet,
>  range: 0-662, partition values: [50]
> 18/11/27 14:09:05 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=70/part-00054-hashredacted.parquet,
>  range: 0-662, partition values: [70]
> 18/11/27 14:09:05 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=b9/part-00012-hashredacted.parquet,
>  range: 0-662, partition values: [b9]
> 18/11/27 14:09:06 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=d2/part-00016-hashredacted.parquet,
>  range: 0-662, partition values: [d2]
> 18/11/27 14:09:06 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=51/part-00003-hashredacted.parquet,
>  range: 0-662, partition values: [51]
> 18/11/27 14:09:07 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=84/part-00135-hashredacted.parquet,
>  range: 0-662, partition values: [84]
> 18/11/27 14:09:08 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=b5/part-00190-hashredacted.parquet,
>  range: 0-662, partition values: [b5]
> 18/11/27 14:09:08 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=88/part-00143-hashredacted.parquet,
>  range: 0-662, partition values: [88]
> 18/11/27 14:09:09 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=4d/part-00120-hashredacted.parquet,
>  range: 0-662, partition values: [4d]
> 18/11/27 14:09:09 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=ac/part-00119-hashredacted.parquet,
>  range: 0-662, partition values: [ac]
> 18/11/27 14:09:10 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=24/part-00139-hashredacted.parquet,
>  range: 0-662, partition values: [24]
> 18/11/27 14:09:11 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=fd/part-00167-hashredacted.parquet,
>  range: 0-662, partition values: [fd]
> 18/11/27 14:09:11 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=52/part-00033-hashredacted.parquet,
>  range: 0-662, partition values: [52]
> 18/11/27 14:09:12 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=ab/part-00083-hashredacted.parquet,
>  range: 0-662, partition values: [ab]
> 18/11/27 14:09:12 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=f8/part-00018-hashredacted.parquet,
>  range: 0-662, partition values: [f8]
> 18/11/27 14:09:13 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=7a/part-00093-hashredacted.parquet,
>  range: 0-662, partition values: [7a]
> 18/11/27 14:09:13 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=ba/part-00020-hashredacted.parquet,
>  range: 0-662, partition values: [ba]
> 18/11/27 14:09:14 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=2d/part-00085-hashredacted.parquet,
>  range: 0-662, partition values: [2d]
> 18/11/27 14:09:15 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=03/part-00099-hashredacted.parquet,
>  range: 0-662, partition values: [03]
> 18/11/27 14:09:15 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=57/part-00196-hashredacted.parquet,
>  range: 0-662, partition values: [57]
> 18/11/27 14:09:16 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=81/part-00122-hashredacted.parquet,
>  range: 0-662, partition values: [81]
> 18/11/27 14:09:17 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=6d/part-00167-hashredacted.parquet,
>  range: 0-662, partition values: [6d]
> 18/11/27 14:09:17 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=36/part-00154-hashredacted.parquet,
>  range: 0-662, partition values: [36]
> 18/11/27 14:09:18 INFO FileScanRDD: Reading File path: 
> s3a://cogo-emr-research-scratch/ddgirard/suffix=4b/part-00093-hashredacted.parquet,
>  range: 0-662, partition values: [4b]
> {code}
> After some investigation, we've isolated the issue to
>  
> [https://github.com/apache/spark/blob/02b510728c31b70e6035ad541bfcdc2b59dcd79a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L132-L136]
>   
> In the inferPartitioning method, 2.3.2 sets the type inference to false by 
> default:
> {code:java}
> val spec = PartitioningUtils.parsePartitions(
>   leafDirs,
>   typeInference = false,
>   basePaths = basePaths,
>   timeZoneId = timeZoneId){code}
> However, in version 2.4.0, the typeInference flag has been replace with a 
> config flag
> [https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L129-L133]
>  
> {code:java}
> val inferredPartitionSpec = PartitioningUtils.parsePartitions(
>   leafDirs,
>   typeInference = 
> sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled,
>   basePaths = basePaths,
>   timeZoneId = timeZoneId){code}
> And this conf's default value is true
> {code:java}
> val PARTITION_COLUMN_TYPE_INFERENCE =
> buildConf("spark.sql.sources.partitionColumnTypeInference.enabled")
> .doc("When true, automatically infer the data types for partitioned columns.")
> .booleanConf
> .createWithDefault(true){code}
> [https://github.com/apache/spark/blob/075447b3965489ffba4e6afb2b120880bc307505/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L636-L640]
>   
> I was wondering if a bug report would be appropriate to preserve backwards 
> compatibility and change the default conf value to false.
>  
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to