[jira] [Commented] (HIVE-26699) Iceberg: S3 fadvise can hurt JSON parsing significantly in DWX

Ayush Saxena (Jira) Wed, 14 Dec 2022 10:06:06 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-26699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17647654#comment-17647654
 ]


Ayush Saxena commented on HIVE-26699:
-------------------------------------

bq. you should be using the openFile() api call and set the read policy option 
to whole-file (assuming that is the intent), and ideally pass in the file 
status...or at least file length, which is enough for s3a to skip the HEAD, 
though not abfs.
see org.apache.hadoop.util.JsonSerialization for its max-performance json load, 
which the s3a and manifest committers both use

Thought of patching the HadoopInputFile and using that, but I think that it is 
about HADOOP-16202, which is only available 3.3.5+

> Iceberg: S3 fadvise can hurt JSON parsing significantly in DWX
> --------------------------------------------------------------
>
>                 Key: HIVE-26699
>                 URL: https://issues.apache.org/jira/browse/HIVE-26699
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Priority: Major
>
> Hive reads JSON metadata information (TableMetadataParser::read()) multiple 
> times; E.g during query compilation, AM split computation, stats computation, 
> during commits  etc.
>  
> With large JSON files (due to multiple inserts), it takes a lot longer time 
> with S3 FS with "fs.s3a.experimental.input.fadvise" set to "random". (e.g in 
> the order of 10x).To be on safer side, it will be good to set this to 
> "normal" mode in configs, when reading iceberg tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-26699) Iceberg: S3 fadvise can hurt JSON parsing significantly in DWX

Reply via email to