[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

2021-01-18 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267408#comment-17267408
 ] 

Steve Loughran commented on SPARK-32582:


Returning to this. 

The incremental listStatusIterator() lister may return partial results. 
Specifically it pages results back in: HDFS, webHDFS, s3a and soon, ABFS. 

This means if you use it to scan the dir, you can stop as soon as you are 
finished.

Please do a check first: if the iterator is Closeable, call close() on it. This 
is to provide a hint to those connectors which prefetch pages that they should 
stop it.

> Spark SQL Infer Schema Performance
> --
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Jarred Li
>Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L88|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

2020-11-13 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231440#comment-17231440
 ] 

Steve Loughran commented on SPARK-32582:


If you use listStatusIterator() then those clients which do paged downloads 
(hdfs, webhdfs, s3a on hadoop 3.3.1+) will not do a full list of the dir tree. 
you'll get results as soon as the first page of data This will be faster and 
more efficient (i.e. cloud stores won't bill you for so many LIST calls)

> Spark SQL Infer Schema Performance
> --
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Jarred Li
>Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L88|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

2020-08-16 Thread Jarred Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178445#comment-17178445
 ] 

Jarred Li commented on SPARK-32582:
---

??I am not sure it would be helpful since there is no API in Hadoop to list 
partial files in a folder.??



We don't need to list all partitions in one table. The "sample" here means we 
sample some of the partitions not all the partitions. In the partition level, 
we can list all the files in that folder. 

 

> Spark SQL Infer Schema Performance
> --
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Jarred Li
>Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L88|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

2020-08-11 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175981#comment-17175981
 ] 

Lantao Jin commented on SPARK-32582:


{quote}
 I remember I investigated this issue and Hadoop API itself lists in batch. 
There streaming way of listing isn't possible.
{quote}

Yes, we could list status in one partition if it is a partitioned table. For a 
non-partitioned table, it still lists all files. We assume too many files in a 
non-partitioned table is a bad design in data warehouse.

{quote}
We can add one more mode "INFER_WITH_SAMPLE".
{quote}

I am not sure it would be helpful since there is no API in Hadoop to list 
partial files in a folder.

> Spark SQL Infer Schema Performance
> --
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Jarred Li
>Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L88|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

2020-08-11 Thread Jarred Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175428#comment-17175428
 ] 

Jarred Li commented on SPARK-32582:
---

I think this is one limitation of ORC file infer schema. "fileIndex.listFiles" 
list all the files in the table, while "" only use one file to get file 
schema([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L96]).

 

If ORC does not support mergeSchema, we should add one more method to 
readSchema from one file so that it does not need to list all files. Of course, 
this is not long term solution. 

 

For other file format, for example Parquet, it is time consuming to iterate all 
the files for schema infer. I am thinking whether we shall add parameter to 
sample the files for schema infer to improve performance.  We can add one more 
schema infer mode for HIVE_CASE_SENSITIVE_INFERENCE to support sample. 
Currenly, there are 3 categories: 

INFER_AND_SAVE, INFER_ONLY, NEVER_INFER

We can add one more mode "INFER_WITH_SAMPLE". By control the sample percentage, 
we can control how many files should be read for schema infer. 

Welcome your comments for the solution.

 

 
{code:java}
val inferredSchema = fileFormat
  .inferSchema(
sparkSession,
options,
fileIndex.listFiles(Nil, Nil).flatMap(_.files))
  .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))
{code}
 

 

> Spark SQL Infer Schema Performance
> --
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Jarred Li
>Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L88|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

2020-08-10 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175273#comment-17175273
 ] 

Hyukjin Kwon commented on SPARK-32582:
--

[~leejianwei] do you mean we shouldn't list files at all? How do you know which 
files the path contains? I remember I investigated this issue and Hadoop API 
itself lists in batch. There streaming way of listing isn't possible.

> Spark SQL Infer Schema Performance
> --
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Jarred Li
>Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

2020-08-10 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175203#comment-17175203
 ] 

Lantao Jin commented on SPARK-32582:


Maybe we could offer a new interface to break out in one iteration when 
mergeSchema is false. I am not sure.
{code}
  def inferSchema(
  sparkSession: SparkSession,
  options: Map[String, String],
  f: (FileIndex) => Seq[FileStatus]): Option[StructType]
{code}

Do you already have any fixing? PR is welcome.

> Spark SQL Infer Schema Performance
> --
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Jarred Li
>Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

2020-08-10 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175197#comment-17175197
 ] 

Lantao Jin commented on SPARK-32582:


I see. The implementation of {{inferSchema}} method depends on the underlay 
file format. Even for Orc, we still need all files since the given Orc files 
can have different schemas and we want to get a merged schema.

> Spark SQL Infer Schema Performance
> --
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Jarred Li
>Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

2020-08-10 Thread Jarred Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175165#comment-17175165
 ] 

Jarred Li commented on SPARK-32582:
---

The performance I mentioned here is not the read file, but "LIST" the files. 
For example, one table have 1000 partitions,  the files in that 1000 partitions 
are listed first. However only one file is read for schema inference.  The 
"LIST" operation is time consumping especially for object store such as S3.

 

See the list files code: 
[https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#300]

 

 

> Spark SQL Infer Schema Performance
> --
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Jarred Li
>Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

2020-08-10 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175143#comment-17175143
 ] 

Lantao Jin commented on SPARK-32582:


{code}
files.toIterator.map(file => readSchema(file.getPath, conf, 
ignoreCorruptFiles)).collectFirst
{code}
{{collectFirst()}} will break out when its iterator found a matching value.
{code}
  def collectFirst[B](pf: PartialFunction[A, B]): Option[B] = {
for (x <- self.toIterator) { // make sure to use an iterator or `seq`
  if (pf isDefinedAt x)
return Some(pf(x))
}
None
  }
{code}
So in most cases, it just reads only one file.

> Spark SQL Infer Schema Performance
> --
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Jarred Li
>Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org