[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance
[ https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267408#comment-17267408 ] Steve Loughran commented on SPARK-32582: Returning to this. The incremental listStatusIterator() lister may return partial results. Specifically it pages results back in: HDFS, webHDFS, s3a and soon, ABFS. This means if you use it to scan the dir, you can stop as soon as you are finished. Please do a check first: if the iterator is Closeable, call close() on it. This is to provide a hint to those connectors which prefetch pages that they should stop it. > Spark SQL Infer Schema Performance > -- > > Key: SPARK-32582 > URL: https://issues.apache.org/jira/browse/SPARK-32582 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Jarred Li >Priority: Major > > When infer schema is enabled, it tries to list all the files in the table, > however only one of the file is used to read schema informaiton. The > performance is impacted due to list all the files in the table when the > number of partitions is larger. > > See the code in > "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L88|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";, > all the files in the table are input, however only one of the file's schema > is used to infer schema. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance
[ https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231440#comment-17231440 ] Steve Loughran commented on SPARK-32582: If you use listStatusIterator() then those clients which do paged downloads (hdfs, webhdfs, s3a on hadoop 3.3.1+) will not do a full list of the dir tree. you'll get results as soon as the first page of data This will be faster and more efficient (i.e. cloud stores won't bill you for so many LIST calls) > Spark SQL Infer Schema Performance > -- > > Key: SPARK-32582 > URL: https://issues.apache.org/jira/browse/SPARK-32582 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Jarred Li >Priority: Major > > When infer schema is enabled, it tries to list all the files in the table, > however only one of the file is used to read schema informaiton. The > performance is impacted due to list all the files in the table when the > number of partitions is larger. > > See the code in > "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L88|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";, > all the files in the table are input, however only one of the file's schema > is used to infer schema. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance
[ https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178445#comment-17178445 ] Jarred Li commented on SPARK-32582: --- ??I am not sure it would be helpful since there is no API in Hadoop to list partial files in a folder.?? We don't need to list all partitions in one table. The "sample" here means we sample some of the partitions not all the partitions. In the partition level, we can list all the files in that folder. > Spark SQL Infer Schema Performance > -- > > Key: SPARK-32582 > URL: https://issues.apache.org/jira/browse/SPARK-32582 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Jarred Li >Priority: Major > > When infer schema is enabled, it tries to list all the files in the table, > however only one of the file is used to read schema informaiton. The > performance is impacted due to list all the files in the table when the > number of partitions is larger. > > See the code in > "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L88|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";, > all the files in the table are input, however only one of the file's schema > is used to infer schema. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance
[ https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175981#comment-17175981 ] Lantao Jin commented on SPARK-32582: {quote} I remember I investigated this issue and Hadoop API itself lists in batch. There streaming way of listing isn't possible. {quote} Yes, we could list status in one partition if it is a partitioned table. For a non-partitioned table, it still lists all files. We assume too many files in a non-partitioned table is a bad design in data warehouse. {quote} We can add one more mode "INFER_WITH_SAMPLE". {quote} I am not sure it would be helpful since there is no API in Hadoop to list partial files in a folder. > Spark SQL Infer Schema Performance > -- > > Key: SPARK-32582 > URL: https://issues.apache.org/jira/browse/SPARK-32582 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Jarred Li >Priority: Major > > When infer schema is enabled, it tries to list all the files in the table, > however only one of the file is used to read schema informaiton. The > performance is impacted due to list all the files in the table when the > number of partitions is larger. > > See the code in > "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L88|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";, > all the files in the table are input, however only one of the file's schema > is used to infer schema. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance
[ https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175428#comment-17175428 ] Jarred Li commented on SPARK-32582: --- I think this is one limitation of ORC file infer schema. "fileIndex.listFiles" list all the files in the table, while "" only use one file to get file schema([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L96]). If ORC does not support mergeSchema, we should add one more method to readSchema from one file so that it does not need to list all files. Of course, this is not long term solution. For other file format, for example Parquet, it is time consuming to iterate all the files for schema infer. I am thinking whether we shall add parameter to sample the files for schema infer to improve performance. We can add one more schema infer mode for HIVE_CASE_SENSITIVE_INFERENCE to support sample. Currenly, there are 3 categories: INFER_AND_SAVE, INFER_ONLY, NEVER_INFER We can add one more mode "INFER_WITH_SAMPLE". By control the sample percentage, we can control how many files should be read for schema infer. Welcome your comments for the solution. {code:java} val inferredSchema = fileFormat .inferSchema( sparkSession, options, fileIndex.listFiles(Nil, Nil).flatMap(_.files)) .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _)) {code} > Spark SQL Infer Schema Performance > -- > > Key: SPARK-32582 > URL: https://issues.apache.org/jira/browse/SPARK-32582 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Jarred Li >Priority: Major > > When infer schema is enabled, it tries to list all the files in the table, > however only one of the file is used to read schema informaiton. The > performance is impacted due to list all the files in the table when the > number of partitions is larger. > > See the code in > "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L88|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";, > all the files in the table are input, however only one of the file's schema > is used to infer schema. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance
[ https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175273#comment-17175273 ] Hyukjin Kwon commented on SPARK-32582: -- [~leejianwei] do you mean we shouldn't list files at all? How do you know which files the path contains? I remember I investigated this issue and Hadoop API itself lists in batch. There streaming way of listing isn't possible. > Spark SQL Infer Schema Performance > -- > > Key: SPARK-32582 > URL: https://issues.apache.org/jira/browse/SPARK-32582 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Jarred Li >Priority: Major > > When infer schema is enabled, it tries to list all the files in the table, > however only one of the file is used to read schema informaiton. The > performance is impacted due to list all the files in the table when the > number of partitions is larger. > > See the code in > "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";, > all the files in the table are input, however only one of the file's schema > is used to infer schema. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance
[ https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175203#comment-17175203 ] Lantao Jin commented on SPARK-32582: Maybe we could offer a new interface to break out in one iteration when mergeSchema is false. I am not sure. {code} def inferSchema( sparkSession: SparkSession, options: Map[String, String], f: (FileIndex) => Seq[FileStatus]): Option[StructType] {code} Do you already have any fixing? PR is welcome. > Spark SQL Infer Schema Performance > -- > > Key: SPARK-32582 > URL: https://issues.apache.org/jira/browse/SPARK-32582 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Jarred Li >Priority: Major > > When infer schema is enabled, it tries to list all the files in the table, > however only one of the file is used to read schema informaiton. The > performance is impacted due to list all the files in the table when the > number of partitions is larger. > > See the code in > "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";, > all the files in the table are input, however only one of the file's schema > is used to infer schema. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance
[ https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175197#comment-17175197 ] Lantao Jin commented on SPARK-32582: I see. The implementation of {{inferSchema}} method depends on the underlay file format. Even for Orc, we still need all files since the given Orc files can have different schemas and we want to get a merged schema. > Spark SQL Infer Schema Performance > -- > > Key: SPARK-32582 > URL: https://issues.apache.org/jira/browse/SPARK-32582 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Jarred Li >Priority: Major > > When infer schema is enabled, it tries to list all the files in the table, > however only one of the file is used to read schema informaiton. The > performance is impacted due to list all the files in the table when the > number of partitions is larger. > > See the code in > "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";, > all the files in the table are input, however only one of the file's schema > is used to infer schema. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance
[ https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175165#comment-17175165 ] Jarred Li commented on SPARK-32582: --- The performance I mentioned here is not the read file, but "LIST" the files. For example, one table have 1000 partitions, the files in that 1000 partitions are listed first. However only one file is read for schema inference. The "LIST" operation is time consumping especially for object store such as S3. See the list files code: [https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#300] > Spark SQL Infer Schema Performance > -- > > Key: SPARK-32582 > URL: https://issues.apache.org/jira/browse/SPARK-32582 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Jarred Li >Priority: Major > > When infer schema is enabled, it tries to list all the files in the table, > however only one of the file is used to read schema informaiton. The > performance is impacted due to list all the files in the table when the > number of partitions is larger. > > See the code in > "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";, > all the files in the table are input, however only one of the file's schema > is used to infer schema. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance
[ https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175143#comment-17175143 ] Lantao Jin commented on SPARK-32582: {code} files.toIterator.map(file => readSchema(file.getPath, conf, ignoreCorruptFiles)).collectFirst {code} {{collectFirst()}} will break out when its iterator found a matching value. {code} def collectFirst[B](pf: PartialFunction[A, B]): Option[B] = { for (x <- self.toIterator) { // make sure to use an iterator or `seq` if (pf isDefinedAt x) return Some(pf(x)) } None } {code} So in most cases, it just reads only one file. > Spark SQL Infer Schema Performance > -- > > Key: SPARK-32582 > URL: https://issues.apache.org/jira/browse/SPARK-32582 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Jarred Li >Priority: Major > > When infer schema is enabled, it tries to list all the files in the table, > however only one of the file is used to read schema informaiton. The > performance is impacted due to list all the files in the table when the > number of partitions is larger. > > See the code in > "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]";, > all the files in the table are input, however only one of the file's schema > is used to infer schema. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org