Yanjia Gary Li has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )
Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table ...................................................................... Patch Set 22: (1 comment) http://gerrit.cloudera.org:8080/#/c/14711/22/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/14711/22/testdata/data/README@489 PS22, Line 489: `ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-5-10_20200112194517.parquet` : `ca51fa17-681b-4497-85b7-4f68e7a63ee7-0` is the bloom index hash of this file : `20200112194517` is the timestamp of this version > I agree with Csaba, and it seems we can easily make the file sizes smaller. Thanks for pointing this out. I definitely agree here. Those parquet files are generated by a test in Hudi and the bloom.num_entries was set as default 60000. I am not familiar with the indexing part of Hudi's code so I am not sure if this is using any built-in bloom filter feature of PARQUET. But reducing this number to 100 will makes each parquet file to ~10KB. If this size is acceptable then I will update those files. -- To view, visit http://gerrit.cloudera.org:8080/14711 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf Gerrit-Change-Number: 14711 Gerrit-PatchSet: 22 Gerrit-Owner: Yanjia Gary Li <yanjia.gary...@gmail.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Norbert Luksa <norbert.lu...@cloudera.com> Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Yanjia Gary Li <yanjia.gary...@gmail.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Fri, 07 Feb 2020 21:48:56 +0000 Gerrit-HasComments: Yes