[jira] [Updated] (HUDI-4318) IndexOutOfBoundException when recordKey has List values for Bucket index table
[ https://issues.apache.org/jira/browse/HUDI-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-4318: Description: Currently, the Bucket index is supported only if the record key has columns with simple values. [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/BucketIdentifier.java#L71] Example record for which this breaks column1:value1,column2:value2,column3:[value1,value2] was: Currently, the Bucket index is supported only if the record key has columns with simple values. [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/BucketIdentifier.java#L71] > IndexOutOfBoundException when recordKey has List values for Bucket index table > -- > > Key: HUDI-4318 > URL: https://issues.apache.org/jira/browse/HUDI-4318 > Project: Apache Hudi > Issue Type: Bug > Components: core >Affects Versions: 0.11.1 >Reporter: Harsha Teja Kanna >Priority: Minor > > Currently, the Bucket index is supported only if the record key has columns > with simple values. > [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/BucketIdentifier.java#L71] > Example record for which this breaks > column1:value1,column2:value2,column3:[value1,value2] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HUDI-4318) IndexOutOfBoundException when recordKey has List values for Bucket index table
Harsha Teja Kanna created HUDI-4318: --- Summary: IndexOutOfBoundException when recordKey has List values for Bucket index table Key: HUDI-4318 URL: https://issues.apache.org/jira/browse/HUDI-4318 Project: Apache Hudi Issue Type: Bug Components: core Affects Versions: 0.11.1 Reporter: Harsha Teja Kanna Currently, the Bucket index is supported only if the record key has columns with simple values. [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/BucketIdentifier.java#L71] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487008#comment-17487008 ] Harsha Teja Kanna commented on HUDI-3335: - This is happening again on the same table after running sync for a while. I will try to gather the details needed. > Loading Hudi table fails with NullPointerException > -- > > Key: HUDI-3335 > URL: https://issues.apache.org/jira/browse/HUDI-3335 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Affects Versions: 0.10.1 >Reporter: Harsha Teja Kanna >Priority: Critical > Labels: hudi-on-call, user-support-issues > Fix For: 0.11.0 > > > Have a COW table with metadata enabled. Loading from Spark query fails with > java.lang.NullPointerException > *Environment* > Spark 3.1.2 > Hudi 0.10.1 > *Query* > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val basePath = "s3a://datalake-hudi/v1" > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Passing an individual partition works though* > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/date=2022/01/25") > df.createOrReplaceTempView(table) > *Also, disabling metadata works, but the query taking very long time* > val df = spark. > read. > format("org.apache.hudi"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Loading files with stacktrace:* > at > org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) > at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) > at > org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) > at > org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) > at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at > org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) > at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) > at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) > at > org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) > at $anonfun$res3$1(:46) > at $anonfun$res3$1$adapted(:40) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > *Writer config* > ** > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 4 \ > --driver-memory 4g \ > --executor-cores 4 \ > --executor-memory 6g \ > --num-executors 8 \ > --jars > s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \ > s3://datalake/jars/hudi-0.10.
[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484469#comment-17484469 ] Harsha Teja Kanna commented on HUDI-3335: - Hi, Thanks. by deleting the metadata and running the sync I am able to load table again But the corrupted metadata is gone now). I will run it on another instance of the table to provide the above info. > Loading Hudi table fails with NullPointerException > -- > > Key: HUDI-3335 > URL: https://issues.apache.org/jira/browse/HUDI-3335 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Affects Versions: 0.10.1 >Reporter: Harsha Teja Kanna >Priority: Critical > Labels: hudi-on-call, user-support-issues > Fix For: 0.11.0 > > > Have a COW table with metadata enabled. Loading from Spark query fails with > java.lang.NullPointerException > *Environment* > Spark 3.1.2 > Hudi 0.10.1 > *Query* > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val basePath = "s3a://datalake-hudi/v1" > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Passing an individual partition works though* > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/date=2022/01/25") > df.createOrReplaceTempView(table) > *Also, disabling metadata works, but the query taking very long time* > val df = spark. > read. > format("org.apache.hudi"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Loading files with stacktrace:* > at > org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) > at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) > at > org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) > at > org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) > at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at > org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) > at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) > at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) > at > org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) > at $anonfun$res3$1(:46) > at $anonfun$res3$1$adapted(:40) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > *Writer config* > ** > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 4 \ > --driver-memory 4g \ > --executor-cores 4 \ > --executor-memory 6g \ > --num-executors 8 \ > --jars > s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --conf spark
[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484420#comment-17484420 ] Harsha Teja Kanna edited comment on HUDI-3335 at 1/30/22, 8:17 PM: --- Hi, I cannot do that immediately(I will have to check), also this is a very large table to reproduce. But I have see this happen on the same table after creating it for the second time. I will try to delete the metadata folder and re-run sync to see if that helps. Also will try to see if I can reproduce this on any small table. was (Author: h7kanna): Hi, I cannot do that immediately(I will have to check), also this is a very large to reproduce. But I have see this happen on the same table after creating it for the second time. I will try to delete the metadata folder and re-run sync to see if that helps. Also will try to see if I can reproduce this on any small table. > Loading Hudi table fails with NullPointerException > -- > > Key: HUDI-3335 > URL: https://issues.apache.org/jira/browse/HUDI-3335 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Affects Versions: 0.10.1 >Reporter: Harsha Teja Kanna >Priority: Critical > Labels: hudi-on-call, user-support-issues > Fix For: 0.11.0 > > > Have a COW table with metadata enabled. Loading from Spark query fails with > java.lang.NullPointerException > *Environment* > Spark 3.1.2 > Hudi 0.10.1 > *Query* > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val basePath = "s3a://datalake-hudi/v1" > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Passing an individual partition works though* > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/date=2022/01/25") > df.createOrReplaceTempView(table) > *Also, disabling metadata works, but the query taking very long time* > val df = spark. > read. > format("org.apache.hudi"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Loading files with stacktrace:* > at > org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) > at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) > at > org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) > at > org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) > at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at > org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) > at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) > at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) > at > org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) > at $anonfun$res3$1(:46) > at $anonfun$res3$1$adapted(:40) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach
[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484420#comment-17484420 ] Harsha Teja Kanna commented on HUDI-3335: - Hi, I cannot do that immediately(I will have to check), also this is a very large to reproduce. But I have see this happen on the same table after creating it for the second time. I will try to delete the metadata folder and re-run sync to see if that helps. Also will try to see if I can reproduce this on any small table. > Loading Hudi table fails with NullPointerException > -- > > Key: HUDI-3335 > URL: https://issues.apache.org/jira/browse/HUDI-3335 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Affects Versions: 0.10.1 >Reporter: Harsha Teja Kanna >Priority: Critical > Labels: hudi-on-call, user-support-issues > Fix For: 0.11.0 > > > Have a COW table with metadata enabled. Loading from Spark query fails with > java.lang.NullPointerException > *Environment* > Spark 3.1.2 > Hudi 0.10.1 > *Query* > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val basePath = "s3a://datalake-hudi/v1" > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Passing an individual partition works though* > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/date=2022/01/25") > df.createOrReplaceTempView(table) > *Also, disabling metadata works, but the query taking very long time* > val df = spark. > read. > format("org.apache.hudi"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Loading files with stacktrace:* > at > org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) > at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) > at > org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) > at > org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) > at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at > org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) > at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) > at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) > at > org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) > at $anonfun$res3$1(:46) > at $anonfun$res3$1$adapted(:40) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > *Writer config* > ** > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 4 \ > --driver-memory 4g \ > --executor-cores 4 \ > --executor-memory 6g \ > --num-executors 8 \ > --jars > s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.ut
[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483970#comment-17483970 ] Harsha Teja Kanna commented on HUDI-3335: - 22/01/28 15:14:14 INFO HoodieFileIndex: partition path PartitionRowPath([2021/07/20],date=2021/07/20), total files 0 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/04/29],date=2021/04/29), total files 21 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/02/02],date=2021/02/02), total files 26 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/09/30],date=2021/09/30), total files 0 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/12/30],date=2021/12/30), total files 16 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/03/10],date=2021/03/10), total files 25 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/08/31],date=2021/08/31), total files 19 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/01/08],date=2021/01/08), total files 14 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/12/19],date=2021/12/19), total files 0 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/04/26],date=2021/04/26), total files 23 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/08/06],date=2021/08/06), total files 19 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/05/23],date=2021/05/23), total files 0 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/04/09],date=2021/04/09), total files 22 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/10/12],date=2021/10/12), total files 0 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/06/12],date=2021/06/12), total files 0 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/05/20],date=2021/05/20), total files 22 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/03/04],date=2021/03/04), total files 25 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/08/20],date=2021/08/20), total files 0 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/02/13],date=2021/02/13), total files 21 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/08/02],date=2021/08/02), total files 0 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/05/05],date=2021/05/05), total files 0 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/03/03],date=2021/03/03), total files 24 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/06/05],date=2021/06/05), total files 0 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/10/23],date=2021/10/23), total files 0 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/09/29],date=2021/09/29), total files 0 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/04/14],date=2021/04/14), total files 22 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/02/18],date=2021/02/18), total files 29 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/03/21],date=2021/03/21), total files 17 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/07/24],date=2021/07/24), total files 0 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/04/30],date=2021/04/30), total files 17 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/01/21],date=2021/01/21), total files 15 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/10/29],date=2021/10/29), total files 0 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/10/01],date=2021/10/01), total files 0 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/06/23],date=2021/06/23), total files 0 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/12/27],date=2021/12/27), total files 0 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/04/08],date=2021/04/08), total files 21 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/03/14],date=2021/03/14), total files 14 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/03/26],date=2021/03/26), total files 24 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([2021/12/04],date=2021/12/04), total files 0 22/01/28 15:14:15 INFO HoodieFileIndex: partition path PartitionRowPath([202
[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483603#comment-17483603 ] Harsha Teja Kanna edited comment on HUDI-3335 at 1/28/22, 8:34 AM: --- Hi, 1) Yes partitions are of type '/mm/dd'. No using wild cards takes a long time. In fact we switch to only using base path from the suggestion from the other ticket https://issues.apache.org/jira/browse/HUDI-3066 2) 'metadata list-partitions' shown at the end 3) No delete_partition operations were performed 4) hive_sync disabled intentionally, 5) 'metadata validate-files' is running for all partitions for a while now, total 389 of them, But I see these below errors for many partitions 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of log files scanned => 7 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - MaxMemoryInBytes allowed for compaction => 1073741824 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of entries in MemoryBasedMap in ExternalSpillableMap => 3 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Total size in bytes of MemoryBasedMap in ExternalSpillableMap => 1800 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of entries in BitCaskDiskMap in ExternalSpillableMap => 0 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Size of file spilled to disk => 0 1640607 [Spring Shell] INFO org.apache.hudi.metadata.HoodieBackedTableMetadata - Opened 7 metadata log files (dataset instant=20220126024720121, metadata instant=20220126024720121) in 3577 ms 1640806 [Spring Shell] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor [.gz] 1640808 [Spring Shell] INFO org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata read for 1 keys took [baseFileRead, logMerge] [0, 201] ms 1640809 [Spring Shell] INFO org.apache.hudi.metadata.BaseTableMetadata - Listed file in partition from metadata: partition=date=2022/01/15, #files=0 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS and metadata files count not matching for date=2022/01/15. FS files count 19, metadata base files count 0 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 91eeab55-ee54-4285-a493-01d04a237691-0_775-489-15939_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 91eeab55-ee54-4285-a493-01d04a237691-1_775-489-15939_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 96c4d9d1-9de4-4c50-bd9c-f3f6a7bfb40b-0_133-10-6713_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata e266b306-b376-4954-8403-8f858dec34ee-0_134-10-6714_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 399fa917-1198-4c9a-bf63-754f40a5ad09-0_806-501-16124_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dac6d5d0-ca15-48e2-9b13-f37b4d113e64-0_804-501-16163_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dde98b16-326c-4890-b100-4548d3161328-0_138-10-6718_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 80cd989e-296c-4dda-ae2d-4c7577fdc351-0_805-501-15994_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 83409360-e1c2-4bd6-ac22-b59e52eaf6ad-0_776-489-16054_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 5887f02d-fd66-4678-ac34-fe585e9946a4-0_137-10-6717_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata a03a35be-db55-4a96-a584-42b8007d507c-0_132-10-6712_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata ddd0500e-9434-451b-9066-dfc7c5f6260b-0_777-489-16040_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dac6d5d0-ca15-48e2-9b13-f37b4d113e64-1_804-501-16163_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands
[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483610#comment-17483610 ] Harsha Teja Kanna commented on HUDI-3335: - Log 22/01/28 01:29:34 INFO Executor: Adding file:/private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-823349b0-aeeb-494d-bdc6-c276419a0fe1/userFiles-644e376c-59bb-4837-a421-590697992dc6/hudi-utilities-bundle_2.12-0.10.1.jar to class loader 22/01/28 01:29:34 INFO Executor: Fetching spark://192.168.86.5:49947/jars/org.spark-project.spark_unused-1.0.0.jar with timestamp 1643354959702 22/01/28 01:29:34 INFO Utils: Fetching spark://192.168.86.5:49947/jars/org.spark-project.spark_unused-1.0.0.jar to /private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-823349b0-aeeb-494d-bdc6-c276419a0fe1/userFiles-644e376c-59bb-4837-a421-590697992dc6/fetchFileTemp5819832321479921719.tmp 22/01/28 01:29:34 INFO Executor: Adding file:/private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-823349b0-aeeb-494d-bdc6-c276419a0fe1/userFiles-644e376c-59bb-4837-a421-590697992dc6/org.spark-project.spark_unused-1.0.0.jar to class loader 22/01/28 01:29:34 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 49956. 22/01/28 01:29:34 INFO NettyBlockTransferService: Server created on 192.168.86.5:49956 22/01/28 01:29:34 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 22/01/28 01:29:34 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.86.5, 49956, None) 22/01/28 01:29:34 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.86.5:49956 with 2004.6 MiB RAM, BlockManagerId(driver, 192.168.86.5, 49956, None) 22/01/28 01:29:34 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.86.5, 49956, None) 22/01/28 01:29:34 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.86.5, 49956, None) 22/01/28 01:29:35 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/Users/harshakanna/spark-warehouse/'). 22/01/28 01:29:35 INFO SharedState: Warehouse path is 'file:/Users/harshakanna/spark-warehouse/'. 22/01/28 01:29:36 INFO DataSourceUtils: Getting table path.. 22/01/28 01:29:36 INFO TablePathUtils: Getting table path from path : s3a://datalake-hudi/sessions 22/01/28 01:29:36 INFO DefaultSource: Obtained hudi table path: s3a://datalake-hudi/sessions 22/01/28 01:29:36 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://datalake-hudi/sessions 22/01/28 01:29:36 INFO HoodieTableConfig: Loading table properties from s3a://datalake-hudi/sessions/.hoodie/hoodie.properties 22/01/28 01:29:36 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://datalake-hudi/sessions 22/01/28 01:29:36 INFO DefaultSource: Is bootstrapped table => false, tableType is: COPY_ON_WRITE, queryType is: snapshot 22/01/28 01:29:36 INFO DefaultSource: Loading Base File Only View with options :Map(hoodie.datasource.query.type -> snapshot, hoodie.metadata.enable -> true, path -> s3a://datalake-hudi/sessions/) 22/01/28 01:29:37 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://datalake-hudi/sessions 22/01/28 01:29:37 INFO HoodieTableConfig: Loading table properties from s3a://datalake-hudi/sessions/.hoodie/hoodie.properties 22/01/28 01:29:37 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://datalake-hudi/sessions 22/01/28 01:29:37 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://datalake-hudi/sessions/.hoodie/metadata 22/01/28 01:29:37 INFO HoodieTableConfig: Loading table properties from s3a://datalake-hudi/sessions/.hoodie/metadata/.hoodie/hoodie.properties 22/01/28 01:29:37 INFO HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=HFILE) from s3a://datalake-hudi/sessions/.hoodie/metadata 22/01/28 01:29:37 INFO HoodieTableMetadataUtil: Loading latest merged file slices for metadata table partition files 22/01/28 01:29:38 INFO HoodieActiveTimeline: Loaded instants upto : Option\{val=[20220126024720121__deltacommit__COMPLETED]} 22/01/28 01:29:38 INFO AbstractTableFileSystemView: Took 2 ms to read 0 instants, 0 replaced file groups 22/01/28 01:29:38 INFO ClusteringUtils: Found 0 files in pending clustering operations 22/01/28 01:29:38 INFO AbstractTableFileSystemView: Building file system view for partition (files) 22/01/28 01:29:38 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=9, NumFileGroups=1, FileGroupsCreationTime=11, StoreTimeTaken=0 22/01/28 01:29:38 INFO CacheConfig: Allocating LruBlockCache size=1.42 GB, blockSize=64 KB 22/01/28 01:29:38 INFO CacheConfig: Created c
[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483603#comment-17483603 ] Harsha Teja Kanna edited comment on HUDI-3335 at 1/28/22, 7:30 AM: --- Hi, 1) Yes partitions are of type '/mm/dd'. No using wild cards takes a long time. In fact we switch to only using base path from the suggestion from the other ticket https://issues.apache.org/jira/browse/HUDI-3066 2) 'metadata list-partitions, I am not able to run it succesfully, will give the info once I can do it. Fails with parsing SPARK_MASTER url=yarn or somethig 3) No delete_partition operations were performed 4) hive_sync disabled intentionally, 5) 'metadata validate-files' is running for all partitions for a while now, total 389 of them, But I see these below errors for many partitions 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of log files scanned => 7 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - MaxMemoryInBytes allowed for compaction => 1073741824 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of entries in MemoryBasedMap in ExternalSpillableMap => 3 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Total size in bytes of MemoryBasedMap in ExternalSpillableMap => 1800 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of entries in BitCaskDiskMap in ExternalSpillableMap => 0 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Size of file spilled to disk => 0 1640607 [Spring Shell] INFO org.apache.hudi.metadata.HoodieBackedTableMetadata - Opened 7 metadata log files (dataset instant=20220126024720121, metadata instant=20220126024720121) in 3577 ms 1640806 [Spring Shell] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor [.gz] 1640808 [Spring Shell] INFO org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata read for 1 keys took [baseFileRead, logMerge] [0, 201] ms 1640809 [Spring Shell] INFO org.apache.hudi.metadata.BaseTableMetadata - Listed file in partition from metadata: partition=date=2022/01/15, #files=0 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS and metadata files count not matching for date=2022/01/15. FS files count 19, metadata base files count 0 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 91eeab55-ee54-4285-a493-01d04a237691-0_775-489-15939_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 91eeab55-ee54-4285-a493-01d04a237691-1_775-489-15939_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 96c4d9d1-9de4-4c50-bd9c-f3f6a7bfb40b-0_133-10-6713_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata e266b306-b376-4954-8403-8f858dec34ee-0_134-10-6714_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 399fa917-1198-4c9a-bf63-754f40a5ad09-0_806-501-16124_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dac6d5d0-ca15-48e2-9b13-f37b4d113e64-0_804-501-16163_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dde98b16-326c-4890-b100-4548d3161328-0_138-10-6718_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 80cd989e-296c-4dda-ae2d-4c7577fdc351-0_805-501-15994_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 83409360-e1c2-4bd6-ac22-b59e52eaf6ad-0_776-489-16054_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 5887f02d-fd66-4678-ac34-fe585e9946a4-0_137-10-6717_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata a03a35be-db55-4a96-a584-42b8007d507c-0_132-10-6712_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata ddd0500e-9434-451b-9066-dfc7c5f6260b-0_777-489-16040_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dac6d5d0-ca15-48e2-9b13-f
[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483603#comment-17483603 ] Harsha Teja Kanna edited comment on HUDI-3335 at 1/28/22, 7:27 AM: --- Hi, 1) Yes partitions are of type '/mm/dd'. No using wild cards takes a long time. In fact we switch to only using base path from the suggestion from the other ticket https://issues.apache.org/jira/browse/HUDI-3066 2) 'metadata list-partitions, I am not able to run it succesfully, will give the info once I can do it. Fails with parsing SPARK_MASTER url=yarn or somethig 3) No delete_partition operations were performed 4) hive_sync disabled intentionally, 5) 'metadata validate-files' is running for all partitions for a while now, total 329 of them, But I see these below errors for many partitions 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of log files scanned => 7 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - MaxMemoryInBytes allowed for compaction => 1073741824 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of entries in MemoryBasedMap in ExternalSpillableMap => 3 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Total size in bytes of MemoryBasedMap in ExternalSpillableMap => 1800 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of entries in BitCaskDiskMap in ExternalSpillableMap => 0 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Size of file spilled to disk => 0 1640607 [Spring Shell] INFO org.apache.hudi.metadata.HoodieBackedTableMetadata - Opened 7 metadata log files (dataset instant=20220126024720121, metadata instant=20220126024720121) in 3577 ms 1640806 [Spring Shell] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor [.gz] 1640808 [Spring Shell] INFO org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata read for 1 keys took [baseFileRead, logMerge] [0, 201] ms 1640809 [Spring Shell] INFO org.apache.hudi.metadata.BaseTableMetadata - Listed file in partition from metadata: partition=date=2022/01/15, #files=0 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS and metadata files count not matching for date=2022/01/15. FS files count 19, metadata base files count 0 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 91eeab55-ee54-4285-a493-01d04a237691-0_775-489-15939_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 91eeab55-ee54-4285-a493-01d04a237691-1_775-489-15939_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 96c4d9d1-9de4-4c50-bd9c-f3f6a7bfb40b-0_133-10-6713_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata e266b306-b376-4954-8403-8f858dec34ee-0_134-10-6714_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 399fa917-1198-4c9a-bf63-754f40a5ad09-0_806-501-16124_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dac6d5d0-ca15-48e2-9b13-f37b4d113e64-0_804-501-16163_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dde98b16-326c-4890-b100-4548d3161328-0_138-10-6718_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 80cd989e-296c-4dda-ae2d-4c7577fdc351-0_805-501-15994_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 83409360-e1c2-4bd6-ac22-b59e52eaf6ad-0_776-489-16054_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 5887f02d-fd66-4678-ac34-fe585e9946a4-0_137-10-6717_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata a03a35be-db55-4a96-a584-42b8007d507c-0_132-10-6712_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata ddd0500e-9434-451b-9066-dfc7c5f6260b-0_777-489-16040_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dac6d5d0-ca15-48e2-9b13-f
[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483603#comment-17483603 ] Harsha Teja Kanna edited comment on HUDI-3335 at 1/28/22, 7:25 AM: --- Hi, 1) Yes partitions are of type '/mm/dd'. No using wild cards takes a long time. In fact we switch to only using base path from the suggestion from the other ticket https://issues.apache.org/jira/browse/HUDI-3066 2) 'metadata list-partitions, I am able to run it succesfully, will give the info once I can do it. Fails with parsing SPARK_MASTER url=yarn or somethig 3) No delete_partition operations were performed 4) hive_sync disabled intentionally, 5) 'metadata validate-files' is running for all partitions for a while now, total 329 of them, But I see these below errors for many partitions 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of log files scanned => 7 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - MaxMemoryInBytes allowed for compaction => 1073741824 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of entries in MemoryBasedMap in ExternalSpillableMap => 3 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Total size in bytes of MemoryBasedMap in ExternalSpillableMap => 1800 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of entries in BitCaskDiskMap in ExternalSpillableMap => 0 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Size of file spilled to disk => 0 1640607 [Spring Shell] INFO org.apache.hudi.metadata.HoodieBackedTableMetadata - Opened 7 metadata log files (dataset instant=20220126024720121, metadata instant=20220126024720121) in 3577 ms 1640806 [Spring Shell] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor [.gz] 1640808 [Spring Shell] INFO org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata read for 1 keys took [baseFileRead, logMerge] [0, 201] ms 1640809 [Spring Shell] INFO org.apache.hudi.metadata.BaseTableMetadata - Listed file in partition from metadata: partition=date=2022/01/15, #files=0 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS and metadata files count not matching for date=2022/01/15. FS files count 19, metadata base files count 0 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 91eeab55-ee54-4285-a493-01d04a237691-0_775-489-15939_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 91eeab55-ee54-4285-a493-01d04a237691-1_775-489-15939_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 96c4d9d1-9de4-4c50-bd9c-f3f6a7bfb40b-0_133-10-6713_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata e266b306-b376-4954-8403-8f858dec34ee-0_134-10-6714_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 399fa917-1198-4c9a-bf63-754f40a5ad09-0_806-501-16124_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dac6d5d0-ca15-48e2-9b13-f37b4d113e64-0_804-501-16163_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dde98b16-326c-4890-b100-4548d3161328-0_138-10-6718_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 80cd989e-296c-4dda-ae2d-4c7577fdc351-0_805-501-15994_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 83409360-e1c2-4bd6-ac22-b59e52eaf6ad-0_776-489-16054_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 5887f02d-fd66-4678-ac34-fe585e9946a4-0_137-10-6717_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata a03a35be-db55-4a96-a584-42b8007d507c-0_132-10-6712_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata ddd0500e-9434-451b-9066-dfc7c5f6260b-0_777-489-16040_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dac6d5d0-ca15-48e2-9b13-f37b4
[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483603#comment-17483603 ] Harsha Teja Kanna commented on HUDI-3335: - Hi, 1) Yes partitions are of type '/mm/dd'. No using wild cards takes a long time. In fact we switch to only using base path from the suggestion from the other ticket 3066 2) 'metadata list-partitions, I am able to run it succesfully, will give the info once I can do it. Fails with parsing SPARK_MASTER url=yarn or somethig 3) No delete_partition operations were performed 4) hive_sync disabled intentionally, 5) 'metadata validate-files' is running for all partitions for a while now, total 329 of them, But I see these below errors for many partitions 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of log files scanned => 7 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - MaxMemoryInBytes allowed for compaction => 1073741824 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of entries in MemoryBasedMap in ExternalSpillableMap => 3 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Total size in bytes of MemoryBasedMap in ExternalSpillableMap => 1800 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of entries in BitCaskDiskMap in ExternalSpillableMap => 0 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Size of file spilled to disk => 0 1640607 [Spring Shell] INFO org.apache.hudi.metadata.HoodieBackedTableMetadata - Opened 7 metadata log files (dataset instant=20220126024720121, metadata instant=20220126024720121) in 3577 ms 1640806 [Spring Shell] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor [.gz] 1640808 [Spring Shell] INFO org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata read for 1 keys took [baseFileRead, logMerge] [0, 201] ms 1640809 [Spring Shell] INFO org.apache.hudi.metadata.BaseTableMetadata - Listed file in partition from metadata: partition=date=2022/01/15, #files=0 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS and metadata files count not matching for date=2022/01/15. FS files count 19, metadata base files count 0 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 91eeab55-ee54-4285-a493-01d04a237691-0_775-489-15939_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 91eeab55-ee54-4285-a493-01d04a237691-1_775-489-15939_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 96c4d9d1-9de4-4c50-bd9c-f3f6a7bfb40b-0_133-10-6713_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata e266b306-b376-4954-8403-8f858dec34ee-0_134-10-6714_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 399fa917-1198-4c9a-bf63-754f40a5ad09-0_806-501-16124_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dac6d5d0-ca15-48e2-9b13-f37b4d113e64-0_804-501-16163_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dde98b16-326c-4890-b100-4548d3161328-0_138-10-6718_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 80cd989e-296c-4dda-ae2d-4c7577fdc351-0_805-501-15994_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 83409360-e1c2-4bd6-ac22-b59e52eaf6ad-0_776-489-16054_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 5887f02d-fd66-4678-ac34-fe585e9946a4-0_137-10-6717_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata a03a35be-db55-4a96-a584-42b8007d507c-0_132-10-6712_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata ddd0500e-9434-451b-9066-dfc7c5f6260b-0_777-489-16040_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dac6d5d0-ca15-48e2-9b13-f37b4d113e64-1_804-501-16163_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi
[jira] [Updated] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3335: Description: Have a COW table with metadata enabled. Loading from Spark query fails with java.lang.NullPointerException *Environment* Spark 3.1.2 Hudi 0.10.1 *Query* import org.apache.hudi.DataSourceReadOptions import org.apache.hudi.common.config.HoodieMetadataConfig val basePath = "s3a://datalake-hudi/v1" val df = spark. read. format("org.apache.hudi"). option(HoodieMetadataConfig.ENABLE.key(), "true"). option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). load(s"${basePath}/sessions/") df.createOrReplaceTempView(table) *Passing an individual partition works though* val df = spark. read. format("org.apache.hudi"). option(HoodieMetadataConfig.ENABLE.key(), "true"). option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). load(s"${basePath}/sessions/date=2022/01/25") df.createOrReplaceTempView(table) *Also, disabling metadata works, but the query taking very long time* val df = spark. read. format("org.apache.hudi"). option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). load(s"${basePath}/sessions/") df.createOrReplaceTempView(table) *Loading files with stacktrace:* at org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) at org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) at org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) at org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) at org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) at $anonfun$res3$1(:46) at $anonfun$res3$1$adapted(:40) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) *Writer config* ** spark-submit \ --master yarn \ --deploy-mode cluster \ --driver-cores 4 \ --driver-memory 4g \ --executor-cores 4 \ --executor-memory 6g \ --num-executors 8 \ --jars s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \ s3://datalake/jars/hudi-0.10.1/hudi-utilities-bundle_2.12-0.10.1.jar \ --table-type COPY_ON_WRITE \ --source-ordering-field timestamp \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --target-base-path s3a://datalake-hudi/sessions \ --target-table sessions \ --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ --op INSERT \ --hoodie-conf hoodie.clean.automatic=true \ --hoodie-conf hoodie.cleaner.commits.retained=10 \ --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ --hoodie-conf hoodie.clustering.inline=true \ --hoodie-conf hoodie.clustering.inline.max.commits=5 \ --hoodie-conf hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy \ --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=1000 \ --hoodie-con
[jira] [Updated] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3335: Description: Have a COW table with metadata enabled. Loading from Spark query fails with java.lang.NullPointerException *Environment* Spark 3.1.2 Hudi 0.10.1 *Query* import org.apache.hudi.DataSourceReadOptions import org.apache.hudi.common.config.HoodieMetadataConfig val basePath = "s3a://datalake-hudi/v1" val df = spark. read. format("org.apache.hudi"). option(HoodieMetadataConfig.ENABLE.key(), "true"). option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). load(s"${basePath}/sessions/") df.createOrReplaceTempView(table) *Passing an individual partition works though* val df = spark. read. format("org.apache.hudi"). option(HoodieMetadataConfig.ENABLE.key(), "true"). option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). load(s"${basePath}/sessions/date=2022/01/25") df.createOrReplaceTempView(table) *Also, disabling metadata works, but the query taking very long time* val df = spark. read. format("org.apache.hudi"). option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). load(s"${basePath}/sessions/") df.createOrReplaceTempView(table) *Loading files with stacktrace:* at org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) at org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) at org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) at org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) at org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) at $anonfun$res3$1(:46) at $anonfun$res3$1$adapted(:40) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) *Writer config* ** spark-submit \ --master yarn \ --deploy-mode cluster \ --driver-cores 4 \ --driver-memory 4g \ --executor-cores 4 \ --executor-memory 6g \ --num-executors 8 \ --jars s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \ s3://datalake/jars/hudi-0.10.1/hudi-utilities-bundle_2.12-0.10.1.jar \ --table-type COPY_ON_WRITE \ --source-ordering-field timestamp \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --target-base-path s3a://datalake-hudi/sessions \ --target-table sessions \ --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ --op INSERT \ --hoodie-conf hoodie.clean.automatic=true \ --hoodie-conf hoodie.cleaner.commits.retained=10 \ --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ --hoodie-conf hoodie.clustering.inline=true \ --hoodie-conf hoodie.clustering.inline.max.commits=5 \ --hoodie-conf hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy \ --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=1000 \ --hoodie-con
[jira] [Updated] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3335: Description: Have a COW table with metadata enabled. Loading from Spark query fails with java.lang.NullPointerException *Environment* Spark 3.1.2 Hudi 0.10.1 *Query* import org.apache.hudi.DataSourceReadOptions import org.apache.hudi.common.config.HoodieMetadataConfig val basePath = "s3a://datalake-hudi/v1" val df = spark. read. format("org.apache.hudi"). option(HoodieMetadataConfig.ENABLE.key(), "true"). option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). load(s"${basePath}/sessions/") df.createOrReplaceTempView(table) *Passing an individual partition works though* val df = spark. read. format("org.apache.hudi"). option(HoodieMetadataConfig.ENABLE.key(), "true"). option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). load(s"${basePath}/sessions/date=2022/01/25") df.createOrReplaceTempView(table) *Also, disabling metadata works, but the query taking very long time* val df = spark. read. format("org.apache.hudi"). option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). load(s"${basePath}/sessions/") df.createOrReplaceTempView(table) *Loading files with stacktrace:* at org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) at org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) at org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) at org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) at org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) at $anonfun$res3$1(:46) at $anonfun$res3$1$adapted(:40) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) *Writer config* ** spark-submit \ --master yarn \ --deploy-mode cluster \ --driver-cores 4 \ --driver-memory 4g \ --executor-cores 4 \ --executor-memory 6g \ --num-executors 8 \ --jars s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \ s3://datalake/jars/hudi-0.10.1/hudi-utilities-bundle_2.12-0.10.1.jar \ --table-type COPY_ON_WRITE \ --source-ordering-field timestamp \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --target-base-path s3a://datalake-hudi/sessions \ --target-table sessions \ --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ --op INSERT \ --hoodie-conf hoodie.clean.automatic=true \ --hoodie-conf hoodie.cleaner.commits.retained=10 \ --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ --hoodie-conf hoodie.clustering.inline=true \ --hoodie-conf hoodie.clustering.inline.max.commits=5 \ --hoodie-conf hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy \ --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=1000 \ --hoodie-con
[jira] [Updated] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3335: Description: Have a COW table with metadata enabled. Loading from Spark query fails with java.lang.NullPointerException *Environment* Spark 3.1.2 Hudi 0.10.1 *Query* import org.apache.hudi.DataSourceReadOptions import org.apache.hudi.common.config.HoodieMetadataConfig val basePath = "s3a://datalake-hudi/v1" val df = spark. read. format("org.apache.hudi"). option(HoodieMetadataConfig.ENABLE.key(), "true"). option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). load(s"${basePath}/sessions/") df.createOrReplaceTempView(table) *Passing an individual partition works though* val df = spark. read. format("org.apache.hudi"). option(HoodieMetadataConfig.ENABLE.key(), "true"). option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). load(s"${basePath}/sessions/date=2022/01/25") df.createOrReplaceTempView(table) *Stacktrace:* at org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) at org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) at org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) at org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) at org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) at $anonfun$res3$1(:46) at $anonfun$res3$1$adapted(:40) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) was: Have a COW table with metadata enabled. Loading from Spark query fails with java.lang.NullPointerException *Environment* Spark 3.1.2 Hudi 0.10.1 *Query* import org.apache.hudi.DataSourceReadOptions import org.apache.hudi.common.config.HoodieMetadataConfig val basePath = "s3a://datalake-hudi/v1" val df = spark. read. format("org.apache.hudi"). option(HoodieMetadataConfig.ENABLE.key(), "true"). option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). load(s"${basePath}/sessions/") df.createOrReplaceTempView(table) *Stacktrace:* at org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) at org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) at org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) at org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) at org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) at org.apac
[jira] [Updated] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3335: Description: Have a COW table with metadata enabled. Loading from Spark query fails with java.lang.NullPointerException *Environment* Spark 3.1.2 Hudi 0.10.1 *Query* import org.apache.hudi.DataSourceReadOptions import org.apache.hudi.common.config.HoodieMetadataConfig val basePath = "s3a://datalake-hudi/v1" val df = spark. read. format("org.apache.hudi"). option(HoodieMetadataConfig.ENABLE.key(), "true"). option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). load(s"${basePath}/sessions/") df.createOrReplaceTempView(table) *Stacktrace:* at org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) at org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) at org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) at org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) at org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) at $anonfun$res3$1(:46) at $anonfun$res3$1$adapted(:40) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) was: Environment Spark 3.1.2 Hudi 0.10.1 Query import org.apache.hudi.DataSourceReadOptions import org.apache.hudi.common.config.HoodieMetadataConfig val basePath = "s3a://datalake-hudi/v1" val df = spark. read. format("org.apache.hudi"). option(HoodieMetadataConfig.ENABLE.key(), "true"). option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). load(s"${basePath}/sessions/") df.createOrReplaceTempView(table) Have a COW table with metadata enabled. Loading from Spark query fails with java.lang.NullPointerException at org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) at org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) at org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) at org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) at org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) at org.apache.spark.sql.DataFrameReader.loadV1Sourc
[jira] [Updated] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3335: Description: Environment Spark 3.1.2 Hudi 0.10.1 Query import org.apache.hudi.DataSourceReadOptions import org.apache.hudi.common.config.HoodieMetadataConfig val basePath = "s3a://datalake-hudi/v1" val df = spark. read. format("org.apache.hudi"). option(HoodieMetadataConfig.ENABLE.key(), "true"). option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). load(s"${basePath}/sessions/") df.createOrReplaceTempView(table) Have a COW table with metadata enabled. Loading from Spark query fails with java.lang.NullPointerException at org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) at org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) at org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) at org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) at org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) at $anonfun$res3$1(:46) at $anonfun$res3$1$adapted(:40) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) was: Environment Spark 3.1.2 Hudi 0.10.1 Have a COW table with metadata enabled. Loading from Spark query fails with java.lang.NullPointerException at org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) at org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) at org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) at org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) at org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) at $anonfun$res3$1(:46) at $anonfun$res3$1$adapted(:40) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.fo
[jira] [Created] (HUDI-3335) Loading Hudi table fails with NullPointerException
Harsha Teja Kanna created HUDI-3335: --- Summary: Loading Hudi table fails with NullPointerException Key: HUDI-3335 URL: https://issues.apache.org/jira/browse/HUDI-3335 Project: Apache Hudi Issue Type: Bug Affects Versions: 0.10.1 Reporter: Harsha Teja Kanna Environment Spark 3.1.2 Hudi 0.10.1 Have a COW table with metadata enabled. Loading from Spark query fails with java.lang.NullPointerException at org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) at org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) at org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) at org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) at org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) at $anonfun$res3$1(:46) at $anonfun$res3$1$adapted(:40) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3216) Support timestamp with microseconds precision
[ https://issues.apache.org/jira/browse/HUDI-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480496#comment-17480496 ] Harsha Teja Kanna commented on HUDI-3216: - Hi team, Need confirmation if this will have impact if the timestamp in microseconds is used for source ordering field? > Support timestamp with microseconds precision > - > > Key: HUDI-3216 > URL: https://issues.apache.org/jira/browse/HUDI-3216 > Project: Apache Hudi > Issue Type: Task > Components: spark, writer-core >Reporter: sivabalan narayanan >Priority: Major > Labels: user-support-issues > Fix For: 0.11.0 > > > As of now, if a field with timestamp datatype w/ microsec precision is > ingested to hudi, resultant dataset will only have until millisec > granularity. > Ref issue: [https://github.com/apache/hudi/issues/3429] > > We might need to support millisec granularity. > References issue has some pointers on how to go about it. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479835#comment-17479835 ] Harsha Teja Kanna commented on HUDI-3242: - Input: monthly partitions partition=2021/01 file1_1 - timestamp1 file2_1 - timestamp2 file3_1 - timestamp3 partition=2021/02 file1_2 - timestamp1 file2_2 - timestamp2 file3_2 - timestamp3 Now I want to run Deltastreamer partition after partition to create Hudi table > Checkpoint 0 is ignored -Partial parquet file discovery after the first commit > -- > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug > Components: spark, writer-core >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Assignee: sivabalan narayanan >Priority: Critical > Labels: hudi-on-call, sev:critical, user-support-issues > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot > 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png > > Original Estimate: 3h > Remaining Estimate: 3h > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > Checkpoint is set to 0. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --checkpoint 0 \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ > --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > \ > --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ > --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ > --hoodie-conf hoodie.datasource.hive_sync.enable=false \ > --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ > --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ > --hoodie-conf > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > \ > --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ > --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > \ > --hoodie-conf hoodie.datasource.write.operation=insert \ > --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ > --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ > --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ > --hoodie-con
[jira] [Comment Edited] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479696#comment-17479696 ] Harsha Teja Kanna edited comment on HUDI-3242 at 1/20/22, 9:42 PM: --- I have input dataset partitioned but have very small files(millions of files), and JSON format. and need to create multiple Hudi Tables out of it. So What I do is process it by a spark program. converts it into Parquet but big files. As the dataset is large. the partitions are processed in parallel. and the file stamps for the files in result dataset can be in any order. Operation is only 'INSERT'. I used to create the table by setting checkpoint 0 before even in 0.10.0 release. How can I do that now? was (Author: h7kanna): I have input dataset partitioned but have very small files(millions of files), and JSON format. and need to create multiple Hudi Tables out of it. So What I do is process it by a spark program. converts it into Parquet but big files. As the dataset is large. the partitions are processed in parallel. and the file stamps for the files can be in any order. Operation is only 'INSERT'. I used to create the table by setting checkpoint 0 before even in 0.10.0 release. How can I do that now? > Checkpoint 0 is ignored -Partial parquet file discovery after the first commit > -- > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug > Components: spark, writer-core >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Assignee: sivabalan narayanan >Priority: Critical > Labels: hudi-on-call, sev:critical, user-support-issues > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot > 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png > > Original Estimate: 3h > Remaining Estimate: 3h > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > Checkpoint is set to 0. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --checkpoint 0 \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ > --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > \ > --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ > --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ > --hoodie-conf hoodie.datasource.hive_sync.enable=false \ > --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ > --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ > --hoodie-conf > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > \ > --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ > --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true
[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479696#comment-17479696 ] Harsha Teja Kanna commented on HUDI-3242: - I have a dataset partitioned but have very small files, and JSON format. and need to create multiple Hudi Tables out of it. So What I do is process it by a spark program. converts it into Parquet but big files. As the dataset is large. the partitions are processed in parallel. and the file stamps for the files can be in any order. Operation is only 'INSERT'. I used to create the table by setting checkpoint 0 before even in 0.10.0 release. How can I do that now? > Checkpoint 0 is ignored -Partial parquet file discovery after the first commit > -- > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug > Components: spark, writer-core >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Assignee: sivabalan narayanan >Priority: Critical > Labels: hudi-on-call, sev:critical, user-support-issues > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot > 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png > > Original Estimate: 3h > Remaining Estimate: 3h > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > Checkpoint is set to 0. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --checkpoint 0 \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ > --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > \ > --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ > --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ > --hoodie-conf hoodie.datasource.hive_sync.enable=false \ > --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ > --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ > --hoodie-conf > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > \ > --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ > --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > \ > --hoodie-conf hoodie.datasource.write.operation=insert \ > --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ > --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ > --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.output.d
[jira] [Comment Edited] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479696#comment-17479696 ] Harsha Teja Kanna edited comment on HUDI-3242 at 1/20/22, 9:41 PM: --- I have input dataset partitioned but have very small files(millions of files), and JSON format. and need to create multiple Hudi Tables out of it. So What I do is process it by a spark program. converts it into Parquet but big files. As the dataset is large. the partitions are processed in parallel. and the file stamps for the files can be in any order. Operation is only 'INSERT'. I used to create the table by setting checkpoint 0 before even in 0.10.0 release. How can I do that now? was (Author: h7kanna): I have a dataset partitioned but have very small files, and JSON format. and need to create multiple Hudi Tables out of it. So What I do is process it by a spark program. converts it into Parquet but big files. As the dataset is large. the partitions are processed in parallel. and the file stamps for the files can be in any order. Operation is only 'INSERT'. I used to create the table by setting checkpoint 0 before even in 0.10.0 release. How can I do that now? > Checkpoint 0 is ignored -Partial parquet file discovery after the first commit > -- > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug > Components: spark, writer-core >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Assignee: sivabalan narayanan >Priority: Critical > Labels: hudi-on-call, sev:critical, user-support-issues > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot > 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png > > Original Estimate: 3h > Remaining Estimate: 3h > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > Checkpoint is set to 0. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --checkpoint 0 \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ > --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > \ > --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ > --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ > --hoodie-conf hoodie.datasource.hive_sync.enable=false \ > --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ > --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ > --hoodie-conf > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > \ > --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ > --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ > --hoodie-conf > hoodie.datasource.wri
[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479685#comment-17479685 ] Harsha Teja Kanna commented on HUDI-3242: - Let me try that. Yes, I am aware that checkpoint need not be set for initial creation, It was just my automation passing it by mistake found out the issue. Also few of my unloaded datasets have non linear timestamps across partitions and I create the hudi table partition after partition and set checkpoint to 0. > Checkpoint 0 is ignored -Partial parquet file discovery after the first commit > -- > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug > Components: spark, writer-core >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Assignee: sivabalan narayanan >Priority: Critical > Labels: hudi-on-call, sev:critical, user-support-issues > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot > 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png > > Original Estimate: 3h > Remaining Estimate: 3h > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > Checkpoint is set to 0. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --checkpoint 0 \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ > --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > \ > --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ > --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ > --hoodie-conf hoodie.datasource.hive_sync.enable=false \ > --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ > --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ > --hoodie-conf > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > \ > --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ > --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > \ > --hoodie-conf hoodie.datasource.write.operation=insert \ > --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ > --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ > --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebase
[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479650#comment-17479650 ] Harsha Teja Kanna commented on HUDI-3242: - ./20220114204813242.commit: "deltastreamer.checkpoint.reset_key" : "0", ./20220115020705625.commit: "deltastreamer.checkpoint.reset_key" : "0", ./20220115024513109.commit: "deltastreamer.checkpoint.reset_key" : "0", ./20220116191900632.commit: "deltastreamer.checkpoint.reset_key" : "0", ./20220116192332086.commit: "deltastreamer.checkpoint.reset_key" : "0", ./20220120081554816.commit: "deltastreamer.checkpoint.reset_key" : "0", ./20220120081925787.commit: "deltastreamer.checkpoint.reset_key" : "0", ./20220120082326203.commit: "deltastreamer.checkpoint.reset_key" : "0", ./20220120082717393.commit: "deltastreamer.checkpoint.reset_key" : "0", ./20220120083124273.commit: "deltastreamer.checkpoint.reset_key" : "0", ./20220120085228007.commit: "deltastreamer.checkpoint.reset_key" : "0", > Checkpoint 0 is ignored -Partial parquet file discovery after the first commit > -- > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug > Components: spark, writer-core >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Assignee: sivabalan narayanan >Priority: Critical > Labels: hudi-on-call, sev:critical, user-support-issues > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot > 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png > > Original Estimate: 3h > Remaining Estimate: 3h > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > Checkpoint is set to 0. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --checkpoint 0 \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ > --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > \ > --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ > --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ > --hoodie-conf hoodie.datasource.hive_sync.enable=false \ > --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ > --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ > --hoodie-conf > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > \ > --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ > --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > \ > --hoodie-conf hoodie.datasource.write.operation=insert \ > --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ >
[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479644#comment-17479644 ] Harsha Teja Kanna commented on HUDI-3242: - ./20220114204813242.commit: "deltastreamer.checkpoint.key" : "1642193264000" ./20220115020705625.commit: "deltastreamer.checkpoint.key" : "1642204844000" ./20220115024513109.commit: "deltastreamer.checkpoint.key" : "1642214688000" ./20220116191900632.commit: "deltastreamer.checkpoint.key" : "1642291261000" ./20220116192332086.commit: "deltastreamer.checkpoint.key" : "1642360982000" ./20220120081554816.commit: "deltastreamer.checkpoint.key" : "1642377603000" ./20220120081925787.commit: "deltastreamer.checkpoint.key" : "1642464065000" ./20220120082326203.commit: "deltastreamer.checkpoint.key" : "1642550413000" ./20220120082717393.commit: "deltastreamer.checkpoint.key" : "1642636804000" ./20220120083124273.commit: "deltastreamer.checkpoint.key" : "1642667425000" ./20220120085228007.commit: "deltastreamer.checkpoint.key" : "1642668697000" > Checkpoint 0 is ignored -Partial parquet file discovery after the first commit > -- > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug > Components: spark, writer-core >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Assignee: sivabalan narayanan >Priority: Critical > Labels: hudi-on-call, sev:critical, user-support-issues > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot > 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png > > Original Estimate: 3h > Remaining Estimate: 3h > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > Checkpoint is set to 0. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --checkpoint 0 \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ > --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > \ > --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ > --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ > --hoodie-conf hoodie.datasource.hive_sync.enable=false \ > --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ > --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ > --hoodie-conf > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > \ > --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ > --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > \ > --hoodie-conf hoodie.datasource.write.operation=insert \ > --hoodie-conf hoodie.
[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479633#comment-17479633 ] Harsha Teja Kanna commented on HUDI-3242: - 5) checkpoint from the last commit before this replace commit { "partitionToWriteStats" : { "date=2022/01/19" : [ { "fileId" : "4c5a04e9-9288-4000-9909-4c2640c5b779-0", "path" : "date=2022/01/19/4c5a04e9-9288-4000-9909-4c2640c5b779-0_0-29-893_20220120085228007.parquet", "prevCommit" : "20220120083230209", "numWrites" : 297106, "numDeletes" : 0, "numUpdateWrites" : 0, "numInserts" : 3452, "totalWriteBytes" : 12468290, "totalWriteErrors" : 0, "tempPath" : null, "partitionPath" : "date=2022/01/19", "totalLogRecords" : 0, "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, "fileSizeInBytes" : 12468290, "minEventTime" : null, "maxEventTime" : null } ] }, "compacted" : false, "extraMetadata" : { "schema" : "\{\"type\":\"record\",\"name\":\"hoodie_source\",\"namespace\":\"hoodie.source\",\"fields\":[REDACTED]}", "deltastreamer.checkpoint.reset_key" : "0", "deltastreamer.checkpoint.key" : "1642668697000" }, "operationType" : "UPSERT", "totalRecordsDeleted" : 0, "totalLogRecordsCompacted" : 0, "totalLogFilesCompacted" : 0, "totalCompactedRecordsUpdated" : 0, "totalLogFilesSize" : 0, "totalScanTime" : 0, "totalCreateTime" : 0, "totalUpsertTime" : 6371, "minAndMaxEventTime" : { "Optional.empty" : { "val" : null, "present" : false } }, "writePartitionPaths" : [ "date=2022/01/19" ], "fileIdAndRelativePaths" : { "4c5a04e9-9288-4000-9909-4c2640c5b779-0" : "date=2022/01/19/4c5a04e9-9288-4000-9909-4c2640c5b779-0_0-29-893_20220120085228007.parquet" } } !Screen Shot 2022-01-20 at 1.36.48 PM.png! > Checkpoint 0 is ignored -Partial parquet file discovery after the first commit > -- > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug > Components: spark, writer-core >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Assignee: sivabalan narayanan >Priority: Critical > Labels: hudi-on-call, sev:critical, user-support-issues > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot > 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png > > Original Estimate: 3h > Remaining Estimate: 3h > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > Checkpoint is set to 0. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --checkpoint 0 \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort
[jira] [Updated] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3242: Attachment: Screen Shot 2022-01-20 at 1.36.48 PM.png > Checkpoint 0 is ignored -Partial parquet file discovery after the first commit > -- > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug > Components: spark, writer-core >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Assignee: sivabalan narayanan >Priority: Critical > Labels: hudi-on-call, sev:critical, user-support-issues > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot > 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png > > Original Estimate: 3h > Remaining Estimate: 3h > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > Checkpoint is set to 0. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --checkpoint 0 \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ > --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > \ > --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ > --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ > --hoodie-conf hoodie.datasource.hive_sync.enable=false \ > --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ > --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ > --hoodie-conf > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > \ > --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ > --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > \ > --hoodie-conf hoodie.datasource.write.operation=insert \ > --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ > --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ > --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ > --hoodie-conf > hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02 > \ > --hoodie-conf > hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector > \ > --hoodie-conf "\"hoodie.deltastreamer.tra
[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479631#comment-17479631 ] Harsha Teja Kanna commented on HUDI-3242: - Hi, Initially I saw the behavior of not picking the files from the partition, but that dataset was differently produced. meaning source dataset is unloaded from Warehouse and files in different partitions do not have linear timestamps. So I thought that be might be cause Now I am seeing this in all the tables, When passing checkpoint 0 to reprocess a partition, it is just skipping the partition. Only processing the current partition. So I found that checkpoint is ignored. I see it Deltastreamer logs and the job ends in 20 seconds 2) No clustering commits are pending, only one Deltastreamer is running and it successfully completed.. I can see the cluster commit on the timeline. 3) I can reproduce it consistently, not being able to backfill the tables currently. 4) Contents of replace commit { "partitionToWriteStats" : { "date=2022/01/19" : [ { "fileId" : "c8b06d0b-1c8a-434e-b54a-15b6525b738a-0", "path" : "date=2022/01/19/c8b06d0b-1c8a-434e-b54a-15b6525b738a-0_0-78-996_20220120085344674.parquet", "prevCommit" : "null", "numWrites" : 297106, "numDeletes" : 0, "numUpdateWrites" : 0, "numInserts" : 297106, "totalWriteBytes" : 12468914, "totalWriteErrors" : 0, "tempPath" : null, "partitionPath" : "date=2022/01/19", "totalLogRecords" : 0, "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, "fileSizeInBytes" : 12468914, "minEventTime" : null, "maxEventTime" : null } ] }, "compacted" : false, "extraMetadata" : { "schema" : "{\"type\":\"record\",\"name\":\"hoodie_source\",\"namespace\":\"hoodie.source\",\"fields\":[REDACTED]" }, "operationType" : "CLUSTER", "partitionToReplaceFileIds" : { "date=2022/01/19" : [ "4c5a04e9-9288-4000-9909-4c2640c5b779-0" ] }, "totalRecordsDeleted" : 0, "totalLogRecordsCompacted" : 0, "totalLogFilesCompacted" : 0, "totalCompactedRecordsUpdated" : 0, "totalLogFilesSize" : 0, "totalScanTime" : 0, "totalCreateTime" : 9815, "totalUpsertTime" : 0, "minAndMaxEventTime" : { "Optional.empty" : { "val" : null, "present" : false } }, "writePartitionPaths" : [ "date=2022/01/19" ], "fileIdAndRelativePaths" : { "c8b06d0b-1c8a-434e-b54a-15b6525b738a-0" : "date=2022/01/19/c8b06d0b-1c8a-434e-b54a-15b6525b738a-0_0-78-996_20220120085344674.parquet" } } > Checkpoint 0 is ignored -Partial parquet file discovery after the first commit > -- > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug > Components: spark, writer-core >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Assignee: sivabalan narayanan >Priority: Critical > Labels: hudi-on-call, sev:critical, user-support-issues > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot > 2022-01-13 at 2.55.35 AM.png > > Original Estimate: 3h > Remaining Estimate: 3h > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > Checkpoint is set to 0. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedT
[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479240#comment-17479240 ] Harsha Teja Kanna commented on HUDI-3242: - So what I found from further debugging is that once the --checkpoint 0 is passed once to Deltastreamer, it will not pick it again if it is same. [https://github.com/apache/hudi/blob/release-0.10.1/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L471] I added log statements in a PR to master branch This is what I got 22/01/20 04:04:28 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://datalake-hudi/v1/journals 22/01/20 04:04:28 INFO HoodieTableConfig: Loading table properties from s3a://datalake-hudi/v1/journals/.hoodie/hoodie.properties 22/01/20 04:04:28 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://datalake-hudi/v1/journals 22/01/20 04:04:29 INFO HoodieActiveTimeline: Loaded instants upto : Option\{val=[20220120085344674__replacecommit__COMPLETED]} 22/01/20 04:04:29 INFO DFSPathSelector: Using path selector org.apache.hudi.utilities.sources.helpers.DFSPathSelector 22/01/20 04:04:29 INFO HoodieDeltaStreamer: Delta Streamer running only single round 22/01/20 04:04:29 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://datalake-hudi/v1/journals 22/01/20 04:04:29 INFO HoodieTableConfig: Loading table properties from s3a://datalake-hudi/v1/journals/.hoodie/hoodie.properties 22/01/20 04:04:29 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://datalake-hudi/v1/journals 22/01/20 04:04:30 INFO HoodieActiveTimeline: Loaded instants upto : Option\{val=[20220120085344674__replacecommit__COMPLETED]} 22/01/20 04:04:30 INFO DeltaSync: *Checkpoint reset from metadata: 0* 22/01/20 04:04:30 INFO DeltaSync: *Checkpoint from config: 0* 22/01/20 04:04:30 INFO DeltaSync: *Checkpoint to resume from : Option\{val=1642668697000}* 22/01/20 04:04:30 INFO DFSPathSelector: Root path => s3a://datalake-hudi/v1/journals/year=2022/month=01/day=19 source limit => 9223372036854775807 22/01/20 04:04:37 INFO DeltaSync: No new data, source checkpoint has not changed. Nothing to commit. Old checkpoint=(Option\{val=1642668697000}). New Checkpoint=(1642668697000) 22/01/20 04:04:37 INFO DeltaSync: Shutting down embedded timeline server 22/01/20 04:04:37 INFO HoodieDeltaStreamer: Shut down delta streamer 22/01/20 04:04:37 INFO SparkUI: Stopped Spark web UI at http://192.168.86.5:4040 22/01/20 04:04:37 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 22/01/20 04:04:37 INFO MemoryStore: MemoryStore cleared 22/01/20 04:04:37 INFO BlockManager: BlockManager stopped 22/01/20 04:04:38 INFO BlockManagerMaster: BlockManagerMaster stopped 22/01/20 04:04:38 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 22/01/20 04:04:38 INFO SparkContext: Successfully stopped SparkContext 22/01/20 04:04:38 INFO ShutdownHookManager: Shutdown hook called 22/01/20 04:04:38 INFO ShutdownHookManager: Deleting directory /private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-acf0e21c-c48c-440c-86f8-72ff20bef349 22/01/20 04:04:38 INFO ShutdownHookManager: Deleting directory /private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-b53eb674-0c67-4b68-8974-7ff706408686 22/01/20 04:04:38 INFO MetricsSystemImpl: Stopping s3a-file-system metrics system... 22/01/20 04:04:38 INFO MetricsSystemImpl: s3a-file-system metrics system stopped. 22/01/20 04:04:38 INFO MetricsSystemImpl: s3a-file-system metrics system shutdown complete. > Checkpoint 0 is ignored -Partial parquet file discovery after the first commit > -- > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug > Components: spark, writer-core >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Assignee: sivabalan narayanan >Priority: Critical > Labels: hudi-on-call, sev:critical, user-support-issues > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot > 2022-01-13 at 2.55.35 AM.png > > Original Estimate: 3h > Remaining Estimate: 3h > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 f
[jira] [Comment Edited] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479240#comment-17479240 ] Harsha Teja Kanna edited comment on HUDI-3242 at 1/20/22, 10:23 AM: [~shivnarayan] So what I found from further debugging is that once the --checkpoint 0 is passed once to Deltastreamer, it will not pick it again if it is same. [https://github.com/apache/hudi/blob/release-0.10.1/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L471] I added log statements in a PR to master branch This is what I got 22/01/20 04:04:28 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://datalake-hudi/v1/journals 22/01/20 04:04:28 INFO HoodieTableConfig: Loading table properties from s3a://datalake-hudi/v1/journals/.hoodie/hoodie.properties 22/01/20 04:04:28 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://datalake-hudi/v1/journals 22/01/20 04:04:29 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20220120085344674__replacecommit__COMPLETED]} 22/01/20 04:04:29 INFO DFSPathSelector: Using path selector org.apache.hudi.utilities.sources.helpers.DFSPathSelector 22/01/20 04:04:29 INFO HoodieDeltaStreamer: Delta Streamer running only single round 22/01/20 04:04:29 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://datalake-hudi/v1/journals 22/01/20 04:04:29 INFO HoodieTableConfig: Loading table properties from s3a://datalake-hudi/v1/journals/.hoodie/hoodie.properties 22/01/20 04:04:29 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://datalake-hudi/v1/journals 22/01/20 04:04:30 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20220120085344674__replacecommit__COMPLETED]} 22/01/20 04:04:30 INFO DeltaSync: *Checkpoint reset from metadata: 0* 22/01/20 04:04:30 INFO DeltaSync: *Checkpoint from config: 0* 22/01/20 04:04:30 INFO DeltaSync: *Checkpoint to resume from : Option\{val=1642668697000}* 22/01/20 04:04:30 INFO DFSPathSelector: Root path => s3a://datalake-hudi/v1/journals/year=2022/month=01/day=19 source limit => 9223372036854775807 22/01/20 04:04:37 INFO DeltaSync: No new data, source checkpoint has not changed. Nothing to commit. Old checkpoint=(Option\{val=1642668697000}). New Checkpoint=(1642668697000) 22/01/20 04:04:37 INFO DeltaSync: Shutting down embedded timeline server 22/01/20 04:04:37 INFO HoodieDeltaStreamer: Shut down delta streamer 22/01/20 04:04:37 INFO SparkUI: Stopped Spark web UI at [http://192.168.86.5:4040|http://192.168.86.5:4040/] 22/01/20 04:04:37 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 22/01/20 04:04:37 INFO MemoryStore: MemoryStore cleared 22/01/20 04:04:37 INFO BlockManager: BlockManager stopped 22/01/20 04:04:38 INFO BlockManagerMaster: BlockManagerMaster stopped 22/01/20 04:04:38 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 22/01/20 04:04:38 INFO SparkContext: Successfully stopped SparkContext 22/01/20 04:04:38 INFO ShutdownHookManager: Shutdown hook called 22/01/20 04:04:38 INFO ShutdownHookManager: Deleting directory /private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-acf0e21c-c48c-440c-86f8-72ff20bef349 22/01/20 04:04:38 INFO ShutdownHookManager: Deleting directory /private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-b53eb674-0c67-4b68-8974-7ff706408686 22/01/20 04:04:38 INFO MetricsSystemImpl: Stopping s3a-file-system metrics system... 22/01/20 04:04:38 INFO MetricsSystemImpl: s3a-file-system metrics system stopped. 22/01/20 04:04:38 INFO MetricsSystemImpl: s3a-file-system metrics system shutdown complete. was (Author: h7kanna): So what I found from further debugging is that once the --checkpoint 0 is passed once to Deltastreamer, it will not pick it again if it is same. [https://github.com/apache/hudi/blob/release-0.10.1/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L471] I added log statements in a PR to master branch This is what I got 22/01/20 04:04:28 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://datalake-hudi/v1/journals 22/01/20 04:04:28 INFO HoodieTableConfig: Loading table properties from s3a://datalake-hudi/v1/journals/.hoodie/hoodie.properties 22/01/20 04:04:28 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://datalake-hudi/v1/journals 22/01/20 04:04:29 INFO HoodieActiveTimeline: Loaded instants upto : Option\{val=[20220120085344674__replacecommit__COMPLETED]} 22/01/20 04:04:29 INFO DFSPathSelector: Using path selector org.apache.hudi.utilities.sources.helpers.DFSPathSelector 22/01/20 04:04:29 INFO HoodieD
[jira] [Comment Edited] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476923#comment-17476923 ] Harsha Teja Kanna edited comment on HUDI-3242 at 1/17/22, 2:30 AM: --- I am now seeing this for all the tables.. after running for few commits. at a partition, I had to backfill and the checkpoint 0 is ignored. Problem is that I am not able to reproduce it deterministically on a test table. was (Author: h7kanna): I am now seeing this for all the tables.. after running for few commits. at a partition, I had to backfill and the checkpoint 0 is ignored. > Checkpoint 0 is ignored -Partial parquet file discovery after the first commit > -- > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug > Components: spark, writer-core >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Priority: Blocker > Labels: sev:critical > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot > 2022-01-13 at 2.55.35 AM.png > > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > Checkpoint is set to 0. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --checkpoint 0 \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ > --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > \ > --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ > --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ > --hoodie-conf hoodie.datasource.hive_sync.enable=false \ > --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ > --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ > --hoodie-conf > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > \ > --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ > --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > \ > --hoodie-conf hoodie.datasource.write.operation=insert \ > --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ > --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ > --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ > --hoodie-conf >
[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476923#comment-17476923 ] Harsha Teja Kanna commented on HUDI-3242: - I am now seeing this for all the tables.. after running for few commits. at a partition, I had to backfill and the checkpoint 0 is ignored. > Checkpoint 0 is ignored -Partial parquet file discovery after the first commit > -- > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug > Components: spark, writer-core >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Priority: Blocker > Labels: sev:critical > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot > 2022-01-13 at 2.55.35 AM.png > > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > Checkpoint is set to 0. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --checkpoint 0 \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ > --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > \ > --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ > --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ > --hoodie-conf hoodie.datasource.hive_sync.enable=false \ > --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ > --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ > --hoodie-conf > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > \ > --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ > --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > \ > --hoodie-conf hoodie.datasource.write.operation=insert \ > --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ > --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ > --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ > --hoodie-conf > hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02 > \ > --hoodie-conf > hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector > \ > --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, a
[jira] [Updated] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3242: Priority: Blocker (was: Critical) > Checkpoint 0 is ignored -Partial parquet file discovery after the first commit > -- > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug > Components: spark, writer-core >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Priority: Blocker > Labels: sev:critical > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot > 2022-01-13 at 2.55.35 AM.png > > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > Checkpoint is set to 0. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --checkpoint 0 \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ > --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > \ > --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ > --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ > --hoodie-conf hoodie.datasource.hive_sync.enable=false \ > --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ > --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ > --hoodie-conf > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > \ > --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ > --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > \ > --hoodie-conf hoodie.datasource.write.operation=insert \ > --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ > --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ > --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ > --hoodie-conf > hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02 > \ > --hoodie-conf > hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector > \ > --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, > to_timestamp(timestamp) as timestamp, sid, > date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM a \"" \ > --hoodie-conf hoodie.file.listing
[jira] [Updated] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3242: Priority: Critical (was: Blocker) > Checkpoint 0 is ignored -Partial parquet file discovery after the first commit > -- > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Priority: Critical > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot > 2022-01-13 at 2.55.35 AM.png > > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > Checkpoint is set to 0. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --checkpoint 0 \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ > --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > \ > --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ > --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ > --hoodie-conf hoodie.datasource.hive_sync.enable=false \ > --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ > --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ > --hoodie-conf > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > \ > --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ > --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > \ > --hoodie-conf hoodie.datasource.write.operation=insert \ > --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ > --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ > --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ > --hoodie-conf > hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02 > \ > --hoodie-conf > hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector > \ > --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, > to_timestamp(timestamp) as timestamp, sid, > date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM a \"" \ > --hoodie-conf hoodie.file.listing.parallelism=256 \ > --hoodie-conf hoodie.finalize.write.parallelism=256 \ >
[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475935#comment-17475935 ] Harsha Teja Kanna edited comment on HUDI-3066 at 1/14/22, 5:04 AM: --- With only base path in the load. File listing time(few seconds) is negligible compared to query runtime. Thanks Though I have to add a new column 'date' to every table to make the above columns type clash resolved. was (Author: h7kanna): With only base path in the load. File listing time(few seconds) is negligible compared to query runtime. Thanks Though I have to add a new column 'date' to every table to make the above columns type clash. > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: sivabalan narayanan >Priority: Critical > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 > PM.png, metadata_files.txt, metadata_files_compacted.txt, > metadata_timeline.txt, metadata_timeline_archived.txt, > metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, > timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37
[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475935#comment-17475935 ] Harsha Teja Kanna commented on HUDI-3066: - With only base path in the load. File listing time(few seconds) is negligible compared to query runtime. Thanks Though I have to add a new column 'date' to every table to make the above columns type clash. > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: sivabalan narayanan >Priority: Critical > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 > PM.png, metadata_files.txt, metadata_files_compacted.txt, > metadata_timeline.txt, metadata_timeline_archived.txt, > metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, > timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Rea
[jira] [Comment Edited] (HUDI-2947) HoodieDeltaStreamer/DeltaSync can improperly pick up the checkpoint config from CLI in continuous mode
[ https://issues.apache.org/jira/browse/HUDI-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475869#comment-17475869 ] Harsha Teja Kanna edited comment on HUDI-2947 at 1/14/22, 12:17 AM: I think this problem still exists but not in continuous mode https://issues.apache.org/jira/browse/HUDI-3242 was (Author: h7kanna): I think this problem still exists https://issues.apache.org/jira/browse/HUDI-3242 > HoodieDeltaStreamer/DeltaSync can improperly pick up the checkpoint config > from CLI in continuous mode > -- > > Key: HUDI-2947 > URL: https://issues.apache.org/jira/browse/HUDI-2947 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: sivabalan narayanan >Priority: Critical > Labels: pull-request-available, sev:critical > Fix For: 0.10.1 > > > *Problem:* > When deltastreamer is started with a given checkpoint, e.g., `--checkpoint > 0`, in the continuous mode, the deltastreamer job may pick up the wrong > checkpoint later on. The wrong checkpoint (for 20211206203551080 commit) > happens after the replacecommit and clean, which is reset to "0", instead of > "5" after 20211206202728233.commit. More details below. > > The bug is due to the check here: > [https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L335] > {code:java} > if (cfg.checkpoint != null && > (StringUtils.isNullOrEmpty(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY)) > || > !cfg.checkpoint.equals(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY { > resumeCheckpointStr = Option.of(cfg.checkpoint); > } {code} > In this case of resuming after a clustering commit, "cfg.checkpoint != null" > and > "StringUtils.isNullOrEmpty(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY))" > are both true as "--checkpoint 0" is configured and last commit is > replacecommit without checkpoint keys. This leads to the resume checkpoint > string being reset to the configured checkpoint, skipping the timeline > walk-back logic below, which is wrong. > > Timeline: > > {code:java} > 189069 Dec 6 12:19 20211206201238649.commit > 0 Dec 6 12:12 20211206201238649.commit.requested > 0 Dec 6 12:12 20211206201238649.inflight > 189069 Dec 6 12:27 20211206201959151.commit > 0 Dec 6 12:20 20211206201959151.commit.requested > 0 Dec 6 12:20 20211206201959151.inflight > 189069 Dec 6 12:34 20211206202728233.commit > 0 Dec 6 12:27 20211206202728233.commit.requested > 0 Dec 6 12:27 20211206202728233.inflight > 36662 Dec 6 12:35 20211206203449899.replacecommit > 0 Dec 6 12:35 20211206203449899.replacecommit.inflight > 34656 Dec 6 12:35 20211206203449899.replacecommit.requested > 28013 Dec 6 12:35 20211206203503574.clean > 19024 Dec 6 12:35 20211206203503574.clean.inflight > 19024 Dec 6 12:35 20211206203503574.clean.requested > 189069 Dec 6 12:43 20211206203551080.commit > 0 Dec 6 12:35 20211206203551080.commit.requested > 0 Dec 6 12:35 20211206203551080.inflight > 189069 Dec 6 12:50 20211206204311612.commit > 0 Dec 6 12:43 20211206204311612.commit.requested > 0 Dec 6 12:43 20211206204311612.inflight > 0 Dec 6 12:50 20211206205044595.commit.requested > 0 Dec 6 12:50 20211206205044595.inflight > 128 Dec 6 12:56 archived > 483 Dec 6 11:52 hoodie.properties > {code} > > Checkpoints in commits: > > {code:java} > grep "deltastreamer.checkpoint.key" * > 20211206201238649.commit: "deltastreamer.checkpoint.key" : "2" > 20211206201959151.commit: "deltastreamer.checkpoint.key" : "3" > 20211206202728233.commit: "deltastreamer.checkpoint.key" : "4" > 20211206203551080.commit: "deltastreamer.checkpoint.key" : "1" > 20211206204311612.commit: "deltastreamer.checkpoint.key" : "2" {code} > > *Steps to reproduce:* > Run HoodieDeltaStreamer in the continuous mode, by providing both > "--checkpoint 0" and "--continuous", with inline clustering and sync clean > enabled (some configs are masked). > > {code:java} > spark-submit \ > --master yarn \ > --driver-memory 8g --executor-memory 8g --num-executors 3 --executor-cores > 4 \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --conf > spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain > \ > --conf spark.speculation=true \ > --conf spark.speculation.multiplier=1.0 \ > --conf spark.speculation.quantile=0.5 \ > --packages org.apache.spark:spark-avro_2.12:3.2.0 \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > file:/h
[jira] [Commented] (HUDI-2947) HoodieDeltaStreamer/DeltaSync can improperly pick up the checkpoint config from CLI in continuous mode
[ https://issues.apache.org/jira/browse/HUDI-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475869#comment-17475869 ] Harsha Teja Kanna commented on HUDI-2947: - I think this problem still exists https://issues.apache.org/jira/browse/HUDI-3242 > HoodieDeltaStreamer/DeltaSync can improperly pick up the checkpoint config > from CLI in continuous mode > -- > > Key: HUDI-2947 > URL: https://issues.apache.org/jira/browse/HUDI-2947 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: sivabalan narayanan >Priority: Critical > Labels: pull-request-available, sev:critical > Fix For: 0.10.1 > > > *Problem:* > When deltastreamer is started with a given checkpoint, e.g., `--checkpoint > 0`, in the continuous mode, the deltastreamer job may pick up the wrong > checkpoint later on. The wrong checkpoint (for 20211206203551080 commit) > happens after the replacecommit and clean, which is reset to "0", instead of > "5" after 20211206202728233.commit. More details below. > > The bug is due to the check here: > [https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L335] > {code:java} > if (cfg.checkpoint != null && > (StringUtils.isNullOrEmpty(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY)) > || > !cfg.checkpoint.equals(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY { > resumeCheckpointStr = Option.of(cfg.checkpoint); > } {code} > In this case of resuming after a clustering commit, "cfg.checkpoint != null" > and > "StringUtils.isNullOrEmpty(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY))" > are both true as "--checkpoint 0" is configured and last commit is > replacecommit without checkpoint keys. This leads to the resume checkpoint > string being reset to the configured checkpoint, skipping the timeline > walk-back logic below, which is wrong. > > Timeline: > > {code:java} > 189069 Dec 6 12:19 20211206201238649.commit > 0 Dec 6 12:12 20211206201238649.commit.requested > 0 Dec 6 12:12 20211206201238649.inflight > 189069 Dec 6 12:27 20211206201959151.commit > 0 Dec 6 12:20 20211206201959151.commit.requested > 0 Dec 6 12:20 20211206201959151.inflight > 189069 Dec 6 12:34 20211206202728233.commit > 0 Dec 6 12:27 20211206202728233.commit.requested > 0 Dec 6 12:27 20211206202728233.inflight > 36662 Dec 6 12:35 20211206203449899.replacecommit > 0 Dec 6 12:35 20211206203449899.replacecommit.inflight > 34656 Dec 6 12:35 20211206203449899.replacecommit.requested > 28013 Dec 6 12:35 20211206203503574.clean > 19024 Dec 6 12:35 20211206203503574.clean.inflight > 19024 Dec 6 12:35 20211206203503574.clean.requested > 189069 Dec 6 12:43 20211206203551080.commit > 0 Dec 6 12:35 20211206203551080.commit.requested > 0 Dec 6 12:35 20211206203551080.inflight > 189069 Dec 6 12:50 20211206204311612.commit > 0 Dec 6 12:43 20211206204311612.commit.requested > 0 Dec 6 12:43 20211206204311612.inflight > 0 Dec 6 12:50 20211206205044595.commit.requested > 0 Dec 6 12:50 20211206205044595.inflight > 128 Dec 6 12:56 archived > 483 Dec 6 11:52 hoodie.properties > {code} > > Checkpoints in commits: > > {code:java} > grep "deltastreamer.checkpoint.key" * > 20211206201238649.commit: "deltastreamer.checkpoint.key" : "2" > 20211206201959151.commit: "deltastreamer.checkpoint.key" : "3" > 20211206202728233.commit: "deltastreamer.checkpoint.key" : "4" > 20211206203551080.commit: "deltastreamer.checkpoint.key" : "1" > 20211206204311612.commit: "deltastreamer.checkpoint.key" : "2" {code} > > *Steps to reproduce:* > Run HoodieDeltaStreamer in the continuous mode, by providing both > "--checkpoint 0" and "--continuous", with inline clustering and sync clean > enabled (some configs are masked). > > {code:java} > spark-submit \ > --master yarn \ > --driver-memory 8g --executor-memory 8g --num-executors 3 --executor-cores > 4 \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --conf > spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain > \ > --conf spark.speculation=true \ > --conf spark.speculation.multiplier=1.0 \ > --conf spark.speculation.quantile=0.5 \ > --packages org.apache.spark:spark-avro_2.12:3.2.0 \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > file:/home/hadoop/ethan/hudi-utilities-bundle_2.12-0.10.0-rc3.jar \ > --props file:/home/hadoop/ethan/test.properties \ > --source-class ... \ > --source-ordering-field ts \ > --target-bas
[jira] [Updated] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3242: Description: Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. However, I see for a certain table. Only partial discovery of files happening after the initial commit of the table. But if the second partition is given as input for the first commit, all the files are getting discovered. First partition : 2021/01 has 744 files and all of them are discovered Second partition: 2021/02 has 762 files but only 72 are discovered. Checkpoint is set to 0. No errors in the logs. {code:java} spark-submit \ --master yarn \ --deploy-mode cluster \ --driver-cores 30 \ --driver-memory 32g \ --executor-cores 5 \ --executor-memory 32g \ --num-executors 120 \ --jars s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar \ --table-type COPY_ON_WRITE \ --source-ordering-field timestamp \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ --target-table sessions_by_date \ --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ --op INSERT \ --checkpoint 0 \ --hoodie-conf hoodie.clean.automatic=true \ --hoodie-conf hoodie.cleaner.commits.retained=1 \ --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ --hoodie-conf hoodie.clustering.inline=false \ --hoodie-conf hoodie.clustering.inline.max.commits=1 \ --hoodie-conf hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy \ --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 \ --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ --hoodie-conf hoodie.datasource.hive_sync.enable=false \ --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor \ --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator \ --hoodie-conf hoodie.datasource.write.operation=insert \ --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02 \ --hoodie-conf hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector \ --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, to_timestamp(timestamp) as timestamp, sid, date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM a \"" \ --hoodie-conf hoodie.file.listing.parallelism=256 \ --hoodie-conf hoodie.finalize.write.parallelism=256 \ --hoodie-conf hoodie.generate.consistent.timestamp.logical.for.key.generator=true \ --hoodie-conf hoodie.insert.shuffle.parallelism=1000 \ --hoodie-conf hoodie.metadata.enable=true \ --hoodie-conf hoodie.metadata.metrics.enable=true \ --hoodie-conf hoodie.metrics.cloudwatch.metric.prefix=emr.datalake-service.prd.insert.sessions_by_date \ --hoodie-conf hoodie.metrics.on=true \ --hoodie-conf hoodie.metrics.reporter.type=CLOUDWATCH \ --hoodie-conf hoodie.parquet.block.size=268435456 \ --hoodie-conf hoodie.parquet.compression.codec=snappy \ --hoodie-conf hoodie.parquet.max.file.size=268435456 \ --hoodie-conf hoodie.parquet.small.file.limit=25000 {code} was: Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. However, I see for a certain table. Only partial dis
[jira] [Updated] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3242: Priority: Blocker (was: Critical) > Checkpoint 0 is ignored -Partial parquet file discovery after the first commit > -- > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Priority: Blocker > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot > 2022-01-13 at 2.55.35 AM.png > > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > Checkpoint is set to 0. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --checkpoint 0 \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ > --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > \ > --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ > --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ > --hoodie-conf hoodie.datasource.hive_sync.enable=false \ > --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ > --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ > --hoodie-conf > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > \ > --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ > --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > \ > --hoodie-conf hoodie.datasource.write.operation=insert \ > --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ > --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ > --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ > --hoodie-conf > hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02 > \ > --hoodie-conf > hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector > \ > --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, > to_timestamp(timestamp) as timestamp, sid, > date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM a \"" \ > --hoodie-conf hoodie.file.listing.parallelism=256 \ > --hoodie-conf hoodie.finalize.write.parallelism=256 \ > -
[jira] [Updated] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3242: Summary: Checkpoint 0 is ignored -Partial parquet file discovery after the first commit (was: Partial parquet file discovery after the first commit) > Checkpoint 0 is ignored -Partial parquet file discovery after the first commit > -- > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Priority: Critical > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot > 2022-01-13 at 2.55.35 AM.png > > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --checkpoint 0 \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ > --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > \ > --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ > --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ > --hoodie-conf hoodie.datasource.hive_sync.enable=false \ > --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ > --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ > --hoodie-conf > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > \ > --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ > --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > \ > --hoodie-conf hoodie.datasource.write.operation=insert \ > --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ > --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ > --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ > --hoodie-conf > hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02 > \ > --hoodie-conf > hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector > \ > --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, > to_timestamp(timestamp) as timestamp, sid, > date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM a \"" \ > --hoodie-conf hoodie.
[jira] [Updated] (HUDI-3242) Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3242: Description: Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. However, I see for a certain table. Only partial discovery of files happening after the initial commit of the table. But if the second partition is given as input for the first commit, all the files are getting discovered. First partition : 2021/01 has 744 files and all of them are discovered Second partition: 2021/02 has 762 files but only 72 are discovered. No errors in the logs. {code:java} spark-submit \ --master yarn \ --deploy-mode cluster \ --driver-cores 30 \ --driver-memory 32g \ --executor-cores 5 \ --executor-memory 32g \ --num-executors 120 \ --jars s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar \ --table-type COPY_ON_WRITE \ --source-ordering-field timestamp \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ --target-table sessions_by_date \ --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ --op INSERT \ --checkpoint 0 \ --hoodie-conf hoodie.clean.automatic=true \ --hoodie-conf hoodie.cleaner.commits.retained=1 \ --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ --hoodie-conf hoodie.clustering.inline=false \ --hoodie-conf hoodie.clustering.inline.max.commits=1 \ --hoodie-conf hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy \ --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 \ --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ --hoodie-conf hoodie.datasource.hive_sync.enable=false \ --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor \ --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator \ --hoodie-conf hoodie.datasource.write.operation=insert \ --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02 \ --hoodie-conf hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector \ --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, to_timestamp(timestamp) as timestamp, sid, date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM a \"" \ --hoodie-conf hoodie.file.listing.parallelism=256 \ --hoodie-conf hoodie.finalize.write.parallelism=256 \ --hoodie-conf hoodie.generate.consistent.timestamp.logical.for.key.generator=true \ --hoodie-conf hoodie.insert.shuffle.parallelism=1000 \ --hoodie-conf hoodie.metadata.enable=true \ --hoodie-conf hoodie.metadata.metrics.enable=true \ --hoodie-conf hoodie.metrics.cloudwatch.metric.prefix=emr.datalake-service.prd.insert.sessions_by_date \ --hoodie-conf hoodie.metrics.on=true \ --hoodie-conf hoodie.metrics.reporter.type=CLOUDWATCH \ --hoodie-conf hoodie.parquet.block.size=268435456 \ --hoodie-conf hoodie.parquet.compression.codec=snappy \ --hoodie-conf hoodie.parquet.max.file.size=268435456 \ --hoodie-conf hoodie.parquet.small.file.limit=25000 {code} was: Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. However, I see for a certain table. Only partial discovery of files happening
[jira] [Updated] (HUDI-3242) Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3242: Attachment: (was: Screen Shot 2022-01-13 at 2.40.55 AM.png) > Partial parquet file discovery after the first commit > - > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Priority: Critical > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot > 2022-01-13 at 2.55.35 AM.png > > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ > --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > \ > --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ > --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ > --hoodie-conf hoodie.datasource.hive_sync.enable=false \ > --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ > --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ > --hoodie-conf > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > \ > --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ > --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > \ > --hoodie-conf hoodie.datasource.write.operation=insert \ > --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ > --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ > --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ > --hoodie-conf > hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02 > \ > --hoodie-conf > hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector > \ > --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, > to_timestamp(timestamp) as timestamp, sid, > date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM a \"" \ > --hoodie-conf hoodie.file.listing.parallelism=256 \ > --hoodie-conf hoodie.finalize.write.parallelism=256 \ > --hoodie-conf > hoodie.generate.consistent.timestamp.logical.for.k
[jira] [Updated] (HUDI-3242) Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3242: Attachment: Screen Shot 2022-01-13 at 2.40.55 AM.png Screen Shot 2022-01-13 at 2.55.35 AM.png > Partial parquet file discovery after the first commit > - > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Priority: Critical > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot > 2022-01-13 at 2.55.35 AM.png > > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ > --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > \ > --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ > --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ > --hoodie-conf hoodie.datasource.hive_sync.enable=false \ > --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ > --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ > --hoodie-conf > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > \ > --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ > --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > \ > --hoodie-conf hoodie.datasource.write.operation=insert \ > --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ > --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ > --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ > --hoodie-conf > hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02 > \ > --hoodie-conf > hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector > \ > --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, > to_timestamp(timestamp) as timestamp, sid, > date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM a \"" \ > --hoodie-conf hoodie.file.listing.parallelism=256 \ > --hoodie-conf hoodie.finalize.write.parallelism=256 \ > --hoodie-conf > hood
[jira] [Updated] (HUDI-3242) Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3242: Attachment: Screen Shot 2022-01-13 at 2.55.35 AM.png > Partial parquet file discovery after the first commit > - > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Priority: Critical > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png > > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ > --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > \ > --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ > --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ > --hoodie-conf hoodie.datasource.hive_sync.enable=false \ > --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ > --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ > --hoodie-conf > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > \ > --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ > --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > \ > --hoodie-conf hoodie.datasource.write.operation=insert \ > --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ > --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ > --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ > --hoodie-conf > hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02 > \ > --hoodie-conf > hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector > \ > --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, > to_timestamp(timestamp) as timestamp, sid, > date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM a \"" \ > --hoodie-conf hoodie.file.listing.parallelism=256 \ > --hoodie-conf hoodie.finalize.write.parallelism=256 \ > --hoodie-conf > hoodie.generate.consistent.timestamp.logical.for.key.generator=true \ > --hoodie-conf hoodie.insert.shuffl
[jira] [Updated] (HUDI-3242) Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3242: Attachment: (was: Screen Shot 2022-01-13 at 2.40.55 AM-2.png) > Partial parquet file discovery after the first commit > - > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Priority: Critical > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png > > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ > --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > \ > --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ > --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ > --hoodie-conf hoodie.datasource.hive_sync.enable=false \ > --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ > --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ > --hoodie-conf > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > \ > --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ > --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > \ > --hoodie-conf hoodie.datasource.write.operation=insert \ > --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ > --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ > --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ > --hoodie-conf > hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02 > \ > --hoodie-conf > hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector > \ > --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, > to_timestamp(timestamp) as timestamp, sid, > date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM a \"" \ > --hoodie-conf hoodie.file.listing.parallelism=256 \ > --hoodie-conf hoodie.finalize.write.parallelism=256 \ > --hoodie-conf > hoodie.generate.consistent.timestamp.logical.for.key.generator=true \ > --hoodie-conf hoodie.
[jira] [Updated] (HUDI-3242) Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3242: Attachment: Screen Shot 2022-01-13 at 2.40.55 AM.png > Partial parquet file discovery after the first commit > - > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Priority: Critical > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png > > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ > --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > \ > --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ > --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ > --hoodie-conf hoodie.datasource.hive_sync.enable=false \ > --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ > --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ > --hoodie-conf > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > \ > --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ > --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > \ > --hoodie-conf hoodie.datasource.write.operation=insert \ > --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ > --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ > --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ > --hoodie-conf > hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02 > \ > --hoodie-conf > hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector > \ > --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, > to_timestamp(timestamp) as timestamp, sid, > date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM a \"" \ > --hoodie-conf hoodie.file.listing.parallelism=256 \ > --hoodie-conf hoodie.finalize.write.parallelism=256 \ > --hoodie-conf > hoodie.generate.consistent.timestamp.logical.for.key.generator=true \ > --hoodie-conf hoodie.insert.shuffl
[jira] [Updated] (HUDI-3242) Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3242: Attachment: (was: Screen Shot 2022-01-13 at 2.55.35 AM.png) > Partial parquet file discovery after the first commit > - > > Key: HUDI-3242 > URL: https://issues.apache.org/jira/browse/HUDI-3242 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.1 > Environment: AWS > EMR 6.4.0 > Spark 3.1.2 > Hudi - 0.10.1-rc >Reporter: Harsha Teja Kanna >Priority: Critical > Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png > > > Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. > However, I see for a certain table. Only partial discovery of files happening > after the initial commit of the table. > But if the second partition is given as input for the first commit, all the > files are getting discovered. > First partition : 2021/01 has 744 files and all of them are discovered > Second partition: 2021/02 has 762 files but only 72 are discovered. > No errors in the logs. > {code:java} > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 30 \ > --driver-memory 32g \ > --executor-cores 5 \ > --executor-memory 32g \ > --num-executors 120 \ > --jars > s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar > \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ > --target-table sessions_by_date \ > --transformer-class > org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ > --op INSERT \ > --hoodie-conf hoodie.clean.automatic=true \ > --hoodie-conf hoodie.cleaner.commits.retained=1 \ > --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ > --hoodie-conf hoodie.clustering.inline=false \ > --hoodie-conf hoodie.clustering.inline.max.commits=1 \ > --hoodie-conf > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy > \ > --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ > --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ > --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ > --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > \ > --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ > --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ > --hoodie-conf hoodie.datasource.hive_sync.enable=false \ > --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ > --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ > --hoodie-conf > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > \ > --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ > --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > \ > --hoodie-conf hoodie.datasource.write.operation=insert \ > --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ > --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ > --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \ > --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \ > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ > --hoodie-conf > hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02 > \ > --hoodie-conf > hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector > \ > --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, > to_timestamp(timestamp) as timestamp, sid, > date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM a \"" \ > --hoodie-conf hoodie.file.listing.parallelism=256 \ > --hoodie-conf hoodie.finalize.write.parallelism=256 \ > --hoodie-conf > hoodie.generate.consistent.timestamp.logical.for.key.generator=true \ > --hoodie-conf hoodie.in
[jira] [Updated] (HUDI-3242) Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3242: Description: Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. However, I see for a certain table. Only partial discovery of files happening after the initial commit of the table. But if the second partition is given as input for the first commit, all the files are getting discovered. First partition : 2021/01 has 744 files and all of them are discovered Second partition: 2021/02 has 762 files but only 72 are discovered. No errors in the logs. {code:java} spark-submit \ --master yarn \ --deploy-mode cluster \ --driver-cores 30 \ --driver-memory 32g \ --executor-cores 5 \ --executor-memory 32g \ --num-executors 120 \ --jars s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar \ --table-type COPY_ON_WRITE \ --source-ordering-field timestamp \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ --target-table sessions_by_date \ --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ --op INSERT \ --hoodie-conf hoodie.clean.automatic=true \ --hoodie-conf hoodie.cleaner.commits.retained=1 \ --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ --hoodie-conf hoodie.clustering.inline=false \ --hoodie-conf hoodie.clustering.inline.max.commits=1 \ --hoodie-conf hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy \ --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 \ --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ --hoodie-conf hoodie.datasource.hive_sync.enable=false \ --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor \ --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator \ --hoodie-conf hoodie.datasource.write.operation=insert \ --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02 \ --hoodie-conf hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector \ --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, to_timestamp(timestamp) as timestamp, sid, date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM a \"" \ --hoodie-conf hoodie.file.listing.parallelism=256 \ --hoodie-conf hoodie.finalize.write.parallelism=256 \ --hoodie-conf hoodie.generate.consistent.timestamp.logical.for.key.generator=true \ --hoodie-conf hoodie.insert.shuffle.parallelism=1000 \ --hoodie-conf hoodie.metadata.enable=true \ --hoodie-conf hoodie.metadata.metrics.enable=true \ --hoodie-conf hoodie.metrics.cloudwatch.metric.prefix=emr.datalake-service.prd.insert.sessions_by_date \ --hoodie-conf hoodie.metrics.on=true \ --hoodie-conf hoodie.metrics.reporter.type=CLOUDWATCH \ --hoodie-conf hoodie.parquet.block.size=268435456 \ --hoodie-conf hoodie.parquet.compression.codec=snappy \ --hoodie-conf hoodie.parquet.max.file.size=268435456 \ --hoodie-conf hoodie.parquet.small.file.limit=25000 {code} was: Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. However, I see for a certain table. Only partial discovery of files happening after the initia
[jira] [Updated] (HUDI-3242) Partial parquet file discovery after the first commit
[ https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3242: Description: Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. However, I see for a certain table. Only partial discovery of files happening after the initial commit of the table. But if the second partition is given as input for the first commit, all the files are getting discovered. First partition : 2021/01 has 744 files and all of them are discovered Second partition: 2021/02 has 762 files but only 72 are discovered. {code:java} spark-submit \ --master yarn \ --deploy-mode cluster \ --driver-cores 30 \ --driver-memory 32g \ --executor-cores 5 \ --executor-memory 32g \ --num-executors 120 \ --jars s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar \ --table-type COPY_ON_WRITE \ --source-ordering-field timestamp \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ --target-table sessions_by_date \ --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ --op INSERT \ --hoodie-conf hoodie.clean.automatic=true \ --hoodie-conf hoodie.cleaner.commits.retained=1 \ --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ --hoodie-conf hoodie.clustering.inline=false \ --hoodie-conf hoodie.clustering.inline.max.commits=1 \ --hoodie-conf hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy \ --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 \ --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ --hoodie-conf hoodie.datasource.hive_sync.enable=false \ --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor \ --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator \ --hoodie-conf hoodie.datasource.write.operation=insert \ --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02 \ --hoodie-conf hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector \ --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, to_timestamp(timestamp) as timestamp, sid, date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM a \"" \ --hoodie-conf hoodie.file.listing.parallelism=256 \ --hoodie-conf hoodie.finalize.write.parallelism=256 \ --hoodie-conf hoodie.generate.consistent.timestamp.logical.for.key.generator=true \ --hoodie-conf hoodie.insert.shuffle.parallelism=1000 \ --hoodie-conf hoodie.metadata.enable=true \ --hoodie-conf hoodie.metadata.metrics.enable=true \ --hoodie-conf hoodie.metrics.cloudwatch.metric.prefix=emr.datalake-service.prd.insert.sessions_by_date \ --hoodie-conf hoodie.metrics.on=true \ --hoodie-conf hoodie.metrics.reporter.type=CLOUDWATCH \ --hoodie-conf hoodie.parquet.block.size=268435456 \ --hoodie-conf hoodie.parquet.compression.codec=snappy \ --hoodie-conf hoodie.parquet.max.file.size=268435456 \ --hoodie-conf hoodie.parquet.small.file.limit=25000 {code} was: Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. However, I see for a certain table. Only partial discovery of files happening after the initial commit of the table
[jira] [Created] (HUDI-3242) Partial parquet file discovery after the first commit
Harsha Teja Kanna created HUDI-3242: --- Summary: Partial parquet file discovery after the first commit Key: HUDI-3242 URL: https://issues.apache.org/jira/browse/HUDI-3242 Project: Apache Hudi Issue Type: Bug Affects Versions: 0.10.1 Environment: AWS EMR 6.4.0 Spark 3.1.2 Hudi - 0.10.1-rc Reporter: Harsha Teja Kanna Attachments: Screen Shot 2022-01-13 at 2.40.55 AM-2.png Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it. However, I see for a certain table. Only partial discovery of files happening after the initial commit of the table. But if the partition is given as input for the first commit, all the files are getting discovered. First partition : 2021/01 has 744 files and all of them are discovered Second partition: 2021/02 has 762 files but only 72 are discovered. {code:java} spark-submit \ --master yarn \ --deploy-mode cluster \ --driver-cores 30 \ --driver-memory 32g \ --executor-cores 5 \ --executor-memory 32g \ --num-executors 120 \ --jars s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar \ --table-type COPY_ON_WRITE \ --source-ordering-field timestamp \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \ --target-table sessions_by_date \ --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ --op INSERT \ --hoodie-conf hoodie.clean.automatic=true \ --hoodie-conf hoodie.cleaner.commits.retained=1 \ --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ --hoodie-conf hoodie.clustering.inline=false \ --hoodie-conf hoodie.clustering.inline.max.commits=1 \ --hoodie-conf hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy \ --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \ --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \ --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \ --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 \ --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \ --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \ --hoodie-conf hoodie.datasource.hive_sync.enable=false \ --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \ --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor \ --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \ --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator \ --hoodie-conf hoodie.datasource.write.operation=insert \ --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \ --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \ --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02 \ --hoodie-conf hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector \ --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, to_timestamp(timestamp) as timestamp, sid, date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM a \"" \ --hoodie-conf hoodie.file.listing.parallelism=256 \ --hoodie-conf hoodie.finalize.write.parallelism=256 \ --hoodie-conf hoodie.generate.consistent.timestamp.logical.for.key.generator=true \ --hoodie-conf hoodie.insert.shuffle.parallelism=1000 \ --hoodie-conf hoodie.metadata.enable=true \ --hoodie-conf hoodie.metadata.metrics.enable=true \ --hoodie-conf hoodie.metrics.cloudwatch.metric.prefix=emr.datalake-service.prd.insert.sessions_by_date \ --hoodie-conf hoodie.metrics.on=true \ --hoodie-conf hoodie.metrics.reporter.type=CLOUDWATCH \ --hoodie-conf hoodie.parquet.block.size=268435456 \ --hoodie-conf hoodie.parquet.compression.codec=snappy \ --hoodie-conf hoodie.parquet.max.file
[jira] [Commented] (HUDI-2909) Partition field parsing fails due to KeyGenerator giving inconsistent value for logical timestamp type
[ https://issues.apache.org/jira/browse/HUDI-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472466#comment-17472466 ] Harsha Teja Kanna commented on HUDI-2909: - Hi, Thanks, I will recreate the table. No problem. > Partition field parsing fails due to KeyGenerator giving inconsistent value > for logical timestamp type > -- > > Key: HUDI-2909 > URL: https://issues.apache.org/jira/browse/HUDI-2909 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Harsha Teja Kanna >Assignee: Sagar Sumit >Priority: Blocker > Labels: core-flow-ds, pull-request-available, sev:critical > Fix For: 0.10.1 > > > Existing table has timebased keygen config show below > hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR > hoodie.deltastreamer.keygen.timebased.output.timezone=GMT > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd > hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS > hoodie.deltastreamer.keygen.timebased.input.timezone=GMT > hoodie.datasource.write.partitionpath.field=lastdate:timestamp > hoodie.datasource.write.operation=upsert > hoodie.deltastreamer.transformer.sql=SELECT session.id, session.rid, > session.mid, to_timestamp(session.lastdate) as lastdate, > to_timestamp(session.updatedate) as updatedate FROM a > > Upgrading to 0.10.0 from 0.9.0 fails with exception > org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input > partition field :2021-12-01 10:13:34.702 > Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Unexpected > type for partition field: java.sql.Timestamp > at > org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:211) > at > org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133) > *Workaround fix:* > Reverting this > https://github.com/apache/hudi/pull/3944/files#diff-22fb52b5cf28727ba23cb8bd4be820432a4e396ce663ac472a4677e889b7491eR543 > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (HUDI-2943) Deltastreamer fails to continue with pending clustering after restart in 0.10.0 and inline clustering
[ https://issues.apache.org/jira/browse/HUDI-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472393#comment-17472393 ] Harsha Teja Kanna edited comment on HUDI-2943 at 1/11/22, 1:38 AM: --- Hi, Any chance this can be prioritized/fixed in 0.10.1 ? was (Author: h7kanna): Hi, Any chance this can be fixed in 0.10.1 ? > Deltastreamer fails to continue with pending clustering after restart in > 0.10.0 and inline clustering > - > > Key: HUDI-2943 > URL: https://issues.apache.org/jira/browse/HUDI-2943 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Harsha Teja Kanna >Assignee: sivabalan narayanan >Priority: Major > Labels: core-flow-ds, sev:high > Attachments: image-2021-12-08-15-10-02-420.png > > > Deltastreamer fails to restart when there is a pending clustering commit from > a previous run with Upsert failed exception when inline clustering is on. > {*}Note{*}: workaround of running Clustering job with > --retry-last-failed-clustering-job works > Hudi version : 0.10.0 > Spark version : 3.1.2 > EMR : 6.4.0 > diagnostics: User class threw exception: > org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit > time 20211206081248919 > at > org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:62) > at > org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:46) > at > org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:119) > at > org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:103) > at > org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:159) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:501) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:306) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:193) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:191) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:511) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735) > Caused by: org.apache.hudi.exception.HoodieClusteringUpdateException: Not > allowed to update the clustering file group > HoodieFileGroupId\{partitionPath='', > fileId='39ca735d-1fc4-40f9-a314-93744642b38c-0'}. For pending clustering > operations, we are not going to support update for now. > at > org.apache.hudi.client.clustering.update.strategy.SparkRejectUpdateStrategy.lambda$handleUpdate$0(SparkRejectUpdateStrategy.java:65) > Config: > hoodie.index.type=GLOBAL_SIMPLE > hoodie.datasource.write.partitionpath.field= > hoodie.datasource.write.precombine.field=updatedate > hoodie.datasource.hive_sync.database=datalake > hoodie.datasource.write.operation=upsert > hoodie.datasource.hive_sync.table=hudi.prd.surveys > hoodie.datasource.hive_sync.mode=hms > hoodie.datasource.hive_sync.enable=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > hoodie.datasource.write.hive_style_partitioning=true > hoodie.finalize.write.parallelism=256 > hoodie.deltastreamer.source.dfs.root=s3://datalake-bucket/raw/parquet/data/surveys/year=2021/month=12/day=06/hour=16 > hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector > hoodie.parquet.max.file.size=134217728 > hoodie.parquet.small.file.limit=67108864 > hoodie.parquet.block.size=134217728 > hoodie.parquet.compression.codec=snappy > hoodie.file.listing.parallelism=256 > hoodie.upsert.shuffle.parallelism=10 > hoodie.metadata.enable=false > hoodie.metadata.clean.async=true > hoodie.clustering.preserve.commit.metadata=true > hoodie.clustering.inline.max.commits=1 > hoodie.clustering.inline=true > hoodie.clustering.plan.strategy.target.file.max.bytes=134217728 > hoodie.clustering.plan.strategy.small.file.limit=67108864 > hoodie.clusteri
[jira] [Commented] (HUDI-2943) Deltastreamer fails to continue with pending clustering after restart in 0.10.0 and inline clustering
[ https://issues.apache.org/jira/browse/HUDI-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472393#comment-17472393 ] Harsha Teja Kanna commented on HUDI-2943: - Hi, Any chance this can be fixed in 0.10.1 ? > Deltastreamer fails to continue with pending clustering after restart in > 0.10.0 and inline clustering > - > > Key: HUDI-2943 > URL: https://issues.apache.org/jira/browse/HUDI-2943 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Harsha Teja Kanna >Assignee: sivabalan narayanan >Priority: Major > Labels: core-flow-ds, sev:high > Attachments: image-2021-12-08-15-10-02-420.png > > > Deltastreamer fails to restart when there is a pending clustering commit from > a previous run with Upsert failed exception when inline clustering is on. > {*}Note{*}: workaround of running Clustering job with > --retry-last-failed-clustering-job works > Hudi version : 0.10.0 > Spark version : 3.1.2 > EMR : 6.4.0 > diagnostics: User class threw exception: > org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit > time 20211206081248919 > at > org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:62) > at > org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:46) > at > org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:119) > at > org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:103) > at > org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:159) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:501) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:306) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:193) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:191) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:511) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735) > Caused by: org.apache.hudi.exception.HoodieClusteringUpdateException: Not > allowed to update the clustering file group > HoodieFileGroupId\{partitionPath='', > fileId='39ca735d-1fc4-40f9-a314-93744642b38c-0'}. For pending clustering > operations, we are not going to support update for now. > at > org.apache.hudi.client.clustering.update.strategy.SparkRejectUpdateStrategy.lambda$handleUpdate$0(SparkRejectUpdateStrategy.java:65) > Config: > hoodie.index.type=GLOBAL_SIMPLE > hoodie.datasource.write.partitionpath.field= > hoodie.datasource.write.precombine.field=updatedate > hoodie.datasource.hive_sync.database=datalake > hoodie.datasource.write.operation=upsert > hoodie.datasource.hive_sync.table=hudi.prd.surveys > hoodie.datasource.hive_sync.mode=hms > hoodie.datasource.hive_sync.enable=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > hoodie.datasource.write.hive_style_partitioning=true > hoodie.finalize.write.parallelism=256 > hoodie.deltastreamer.source.dfs.root=s3://datalake-bucket/raw/parquet/data/surveys/year=2021/month=12/day=06/hour=16 > hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector > hoodie.parquet.max.file.size=134217728 > hoodie.parquet.small.file.limit=67108864 > hoodie.parquet.block.size=134217728 > hoodie.parquet.compression.codec=snappy > hoodie.file.listing.parallelism=256 > hoodie.upsert.shuffle.parallelism=10 > hoodie.metadata.enable=false > hoodie.metadata.clean.async=true > hoodie.clustering.preserve.commit.metadata=true > hoodie.clustering.inline.max.commits=1 > hoodie.clustering.inline=true > hoodie.clustering.plan.strategy.target.file.max.bytes=134217728 > hoodie.clustering.plan.strategy.small.file.limit=67108864 > hoodie.clustering.plan.strategy.sort.columns=projectid > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.Spark
[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471280#comment-17471280 ] Harsha Teja Kanna edited comment on HUDI-3066 at 1/9/22, 6:11 AM: -- Hi [~shivnarayan] Basic question I am trying to use just the base-path without the wildcards. But facing this issue. Table effectively has two columns with same name. table is created using a timestamp column for key generation and mapped to date partition. using hoodie.datasource.write.partitionpath.field=entrydate:timestamp So the partition is entrydate=/mm/dd. {code:java} import org.apache.hudi.DataSourceReadOptions import org.apache.hudi.common.config.HoodieMetadataConfig val df = spark. read. format("org.apache.hudi"). option(HoodieMetadataConfig.ENABLE.key(), "true"). option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). load("s3a://datalake-hudi/sessions_by_entrydate/") df.createOrReplaceTempView("sessions") spark.sql("SELECT count(*) FROM sessions").show() {code} Without wildcards. spark inferring the column type and query fails with {code:java} Caused by: java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong(rows.scala:42) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong$(rows.scala:42) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getLong(rows.scala:195) at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:98) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:230) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:249) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:331) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} was (Author: h7kanna): Hi [~shivnarayan] Basic question I am trying to use just the base-path without the wildcards. But facing this issue. Table effectively has two columns with same name. table is created using a timestamp column for key generation and mapped to date partition. using hoodie.datasource.write.partitionpath.field=entrydate:timestamp So the partition is entrydate=/mm/dd. Without wildcards. spark inferring the column type and query fails with java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > --
[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471280#comment-17471280 ] Harsha Teja Kanna edited comment on HUDI-3066 at 1/9/22, 4:15 AM: -- Hi [~shivnarayan] Basic question I am trying to use just the base-path without the wildcards. But facing this issue. Table effectively has two columns with same name. table is created using a timestamp column for key generation and mapped to date partition. using hoodie.datasource.write.partitionpath.field=entrydate:timestamp So the partition is entrydate=/mm/dd. Without wildcards. spark inferring the column type and query fails with java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long was (Author: h7kanna): Hi [~shivnarayan] Basic question I am trying to use just the base-path without the wildcards. But facing this issue. Table effectively has two columns with same name. table is created using a timestamp column for key generation and mapped to date partition. using hoodie.datasource.write.partitionpath.field=entrydate:timestamp So the partition is entrydate=/mm/dd. Without wildcards. spark inferring the column type is query fails with java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: sivabalan narayanan >Priority: Critical > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 > PM.png, metadata_files.txt, metadata_files_compacted.txt, > metadata_timeline.txt, metadata_timeline_archived.txt, > metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, > timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877
[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471280#comment-17471280 ] Harsha Teja Kanna commented on HUDI-3066: - Hi [~shivnarayan] Basic question I am trying to use just the base-path without the wildcards. But facing this issue. Table effectively has two columns with same name. table is created using a timestamp column for key generation and mapped to date partition. using hoodie.datasource.write.partitionpath.field=entrydate:timestamp So the partition is entrydate=/mm/dd. Without wildcards. spark inferring the column type is query fails with java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: sivabalan narayanan >Priority: Critical > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 > PM.png, metadata_files.txt, metadata_files_compacted.txt, > metadata_timeline.txt, metadata_timeline_archived.txt, > metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, > timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream
[jira] [Comment Edited] (HUDI-2909) Partition field parsing fails due to KeyGenerator giving inconsistent value for logical timestamp type
[ https://issues.apache.org/jira/browse/HUDI-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471264#comment-17471264 ] Harsha Teja Kanna edited comment on HUDI-2909 at 1/9/22, 2:05 AM: -- I am not able to determine if I fall under user type c or a/b :) from the Github issue or the above description. I can you please help understand if I have to recreate the dataset? was (Author: h7kanna): I am not able to determine if I fall under user type c or a/b :) > Partition field parsing fails due to KeyGenerator giving inconsistent value > for logical timestamp type > -- > > Key: HUDI-2909 > URL: https://issues.apache.org/jira/browse/HUDI-2909 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Harsha Teja Kanna >Assignee: Sagar Sumit >Priority: Blocker > Labels: core-flow-ds, pull-request-available, sev:critical > Fix For: 0.10.1 > > > Existing table has timebased keygen config show below > hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR > hoodie.deltastreamer.keygen.timebased.output.timezone=GMT > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd > hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS > hoodie.deltastreamer.keygen.timebased.input.timezone=GMT > hoodie.datasource.write.partitionpath.field=lastdate:timestamp > hoodie.datasource.write.operation=upsert > hoodie.deltastreamer.transformer.sql=SELECT session.id, session.rid, > session.mid, to_timestamp(session.lastdate) as lastdate, > to_timestamp(session.updatedate) as updatedate FROM a > > Upgrading to 0.10.0 from 0.9.0 fails with exception > org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input > partition field :2021-12-01 10:13:34.702 > Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Unexpected > type for partition field: java.sql.Timestamp > at > org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:211) > at > org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133) > *Workaround fix:* > Reverting this > https://github.com/apache/hudi/pull/3944/files#diff-22fb52b5cf28727ba23cb8bd4be820432a4e396ce663ac472a4677e889b7491eR543 > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (HUDI-2909) Partition field parsing fails due to KeyGenerator giving inconsistent value for logical timestamp type
[ https://issues.apache.org/jira/browse/HUDI-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471264#comment-17471264 ] Harsha Teja Kanna edited comment on HUDI-2909 at 1/9/22, 2:05 AM: -- I am not able to determine if I fall under user type c or a/b :) from the Github issue or the above description. Can you please help understand if I have to recreate the dataset? was (Author: h7kanna): I am not able to determine if I fall under user type c or a/b :) from the Github issue or the above description. I can you please help understand if I have to recreate the dataset? > Partition field parsing fails due to KeyGenerator giving inconsistent value > for logical timestamp type > -- > > Key: HUDI-2909 > URL: https://issues.apache.org/jira/browse/HUDI-2909 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Harsha Teja Kanna >Assignee: Sagar Sumit >Priority: Blocker > Labels: core-flow-ds, pull-request-available, sev:critical > Fix For: 0.10.1 > > > Existing table has timebased keygen config show below > hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR > hoodie.deltastreamer.keygen.timebased.output.timezone=GMT > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd > hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS > hoodie.deltastreamer.keygen.timebased.input.timezone=GMT > hoodie.datasource.write.partitionpath.field=lastdate:timestamp > hoodie.datasource.write.operation=upsert > hoodie.deltastreamer.transformer.sql=SELECT session.id, session.rid, > session.mid, to_timestamp(session.lastdate) as lastdate, > to_timestamp(session.updatedate) as updatedate FROM a > > Upgrading to 0.10.0 from 0.9.0 fails with exception > org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input > partition field :2021-12-01 10:13:34.702 > Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Unexpected > type for partition field: java.sql.Timestamp > at > org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:211) > at > org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133) > *Workaround fix:* > Reverting this > https://github.com/apache/hudi/pull/3944/files#diff-22fb52b5cf28727ba23cb8bd4be820432a4e396ce663ac472a4677e889b7491eR543 > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-2909) Partition field parsing fails due to KeyGenerator giving inconsistent value for logical timestamp type
[ https://issues.apache.org/jira/browse/HUDI-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471264#comment-17471264 ] Harsha Teja Kanna commented on HUDI-2909: - I am not able to determine if I fall under user type c or a/b :) > Partition field parsing fails due to KeyGenerator giving inconsistent value > for logical timestamp type > -- > > Key: HUDI-2909 > URL: https://issues.apache.org/jira/browse/HUDI-2909 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Harsha Teja Kanna >Assignee: Sagar Sumit >Priority: Blocker > Labels: core-flow-ds, pull-request-available, sev:critical > Fix For: 0.10.1 > > > Existing table has timebased keygen config show below > hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR > hoodie.deltastreamer.keygen.timebased.output.timezone=GMT > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd > hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS > hoodie.deltastreamer.keygen.timebased.input.timezone=GMT > hoodie.datasource.write.partitionpath.field=lastdate:timestamp > hoodie.datasource.write.operation=upsert > hoodie.deltastreamer.transformer.sql=SELECT session.id, session.rid, > session.mid, to_timestamp(session.lastdate) as lastdate, > to_timestamp(session.updatedate) as updatedate FROM a > > Upgrading to 0.10.0 from 0.9.0 fails with exception > org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input > partition field :2021-12-01 10:13:34.702 > Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Unexpected > type for partition field: java.sql.Timestamp > at > org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:211) > at > org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133) > *Workaround fix:* > Reverting this > https://github.com/apache/hudi/pull/3944/files#diff-22fb52b5cf28727ba23cb8bd4be820432a4e396ce663ac472a4677e889b7491eR543 > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17470802#comment-17470802 ] Harsha Teja Kanna commented on HUDI-3066: - Hi I am working on re-testing this. Seeing other un-related issues with partition discovery schema inference without wildcard in the base path. Will post my result soon. Thanks. > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: sivabalan narayanan >Priority: Critical > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 > PM.png, metadata_files.txt, metadata_files_compacted.txt, > metadata_timeline.txt, metadata_timeline_archived.txt, > metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, > timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from fi
[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463577#comment-17463577 ] Harsha Teja Kanna commented on HUDI-3066: - After few runs of the query, I see even with compacted metadata table. Enabling metadata in reader is slower. Though is not taking very long time like 1hr as before !Screen Shot 2021-12-21 at 10.22.54 PM.png! !Screen Shot 2021-12-21 at 10.24.12 PM.png! > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 > PM.png, metadata_files.txt, metadata_files_compacted.txt, > metadata_timeline.txt, metadata_timeline_archived.txt, > metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, > timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:3
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: Screen Shot 2021-12-21 at 10.24.12 PM.png > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 > PM.png, metadata_files.txt, metadata_files_compacted.txt, > metadata_timeline.txt, metadata_timeline_archived.txt, > metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, > timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractH
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: Screen Shot 2021-12-21 at 10.22.54 PM.png > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 > PM.png, metadata_files.txt, metadata_files_compacted.txt, > metadata_timeline.txt, metadata_timeline_archived.txt, > metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, > timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractH
[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463549#comment-17463549 ] Harsha Teja Kanna edited comment on HUDI-3066 at 12/22/21, 2:55 AM: Seems like disabling the metadata in the reader is faster than enabling it. 24s - reader metadata off 59s - reader metadata on this is a table of only 2182 files. either on/off .. file listing is faster than before though. was (Author: h7kanna): Seems like disabling the metadata in the reader is faster than enabling it. 24s - reader metadata off 59s - reader metadata on this is a table of only 2182 files. > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_files.txt, metadata_files_compacted.txt, metadata_timeline.txt, > metadata_timeline_archived.txt, metadata_timeline_compacted.txt, > stderr_part1.txt, stderr_part2.txt, timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-h
[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463549#comment-17463549 ] Harsha Teja Kanna commented on HUDI-3066: - Seems like disabling the metadata in the reader is faster than enabling it. 24s - reader metadata off 59s - reader metadata on this is a table of only 2182 files. > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_files.txt, metadata_files_compacted.txt, metadata_timeline.txt, > metadata_timeline_archived.txt, metadata_timeline_compacted.txt, > stderr_part1.txt, stderr_part2.txt, timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 >
[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463539#comment-17463539 ] Harsha Teja Kanna commented on HUDI-3066: - Yes it is non-partitioned table. I will create a new issue. I see the file listing time varying each time I query. One of the instances it is only 24 seconds. I am testing again repeatedly and on another bigger table. will post the result. > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_files.txt, metadata_files_compacted.txt, metadata_timeline.txt, > metadata_timeline_archived.txt, metadata_timeline_compacted.txt, > stderr_part1.txt, stderr_part2.txt, timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessi
[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463442#comment-17463442 ] Harsha Teja Kanna edited comment on HUDI-3066 at 12/22/21, 1:24 AM: There is no separate cleaner job. I set hoodie.clean.automatic=true It is only one writer. I have hoodie.clean.async[=|https://hudi.apache.org/docs/configurations#hoodiecleanasync]true because from the docs it says Only applies when hoodie.clean.automatic is turned on. When turned on runs cleaner async with writing, which can speed up overall write performance. https://hudi.apache.org/docs/configurations#hoodiecleanasync was (Author: h7kanna): There is no separate cleaner job. I set hoodie.clean.automatic=true It is only one writer. I have hoodie.clean.async[=|https://hudi.apache.org/docs/configurations#hoodiecleanasync]true because from the docs it says Only applies when hoodie.clean.automatic is turned on. When turned on runs cleaner async with writing, which can speed up overall write performance. > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_files.txt, metadata_files_compacted.txt, metadata_timeline.txt, > metadata_timeline_archived.txt, metadata_timeline_compacted.txt, > stderr_part1.txt, stderr_part2.txt, timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogReco
[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463528#comment-17463528 ] Harsha Teja Kanna commented on HUDI-3066: - I made changes to the sync 1) Clustering for each 10 commits 2) Removed hoodie.metadata.clean.async=true 2) Removed hoodie.clean.async=true I do see compaction kick off and file listing much faster. [^metadata_files_compacted.txt] [^metadata_timeline_compacted.txt] Though for a different table I see below exception org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieUpsertException: Failed to merge old record into new file for key from old file s3a://bucket/.hoodie/metadata/files/files-_0-45-650_20211221205951077001.hfile to new file s3a://bucket/.hoodie/metadata/files/files-_0-67-713_20211222000526106001.hfile with writerSchema { "type" : "record", "name" : "HoodieMetadataRecord", "namespace" : "org.apache.hudi.avro.model", "doc" : "A record saved within the Metadata Table", "fields" : [ { "name" : "_hoodie_commit_time", "type" : [ "null", "string" ], "doc" : "", "default" : null }, { "name" : "_hoodie_commit_seqno", "type" : [ "null", "string" ], "doc" : "", "default" : null }, { "name" : "_hoodie_record_key", "type" : [ "null", "string" ], "doc" : "", "default" : null }, { "name" : "_hoodie_partition_path", "type" : [ "null", "string" ], "doc" : "", "default" : null }, { "name" : "_hoodie_file_name", "type" : [ "null", "string" ], "doc" : "", "default" : null }, { "name" : "key", "type" : { "type" : "string", "avro.java.string" : "String" } }, { "name" : "type", "type" : "int", "doc" : "Type of the metadata record" }, { "name" : "filesystemMetadata", "type" : [ "null", { "type" : "map", "values" : { "type" : "record", "name" : "HoodieMetadataFileInfo", "fields" : [ { "name" : "size", "type" : "long", "doc" : "Size of the file" }, { "name" : "isDeleted", "type" : "boolean", "doc" : "True if this file has been deleted" } ] }, "avro.java.string" : "String" } ], "doc" : "Contains information about partitions and files within the dataset" } ] } at org.apache.hudi.table.action.commit.SparkMergeHelper.runMerge(SparkMergeHelper.java:102) at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.handleUpdateInternal(HoodieSparkCopyOnWriteTable.java:292) at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.handleUpdate(HoodieSparkCopyOnWriteTable.java:283) at org.apache.hudi.table.action.compact.HoodieCompactor.compact(HoodieCompactor.java:197) at org.apache.hudi.table.action.compact.HoodieCompactor.lambda$compact$57154431$1(HoodieCompactor.java:133) at org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1440) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384) at org.apache.spark.rdd.RDD.iterator(RDD.scala:335) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieUpsertException: Failed to merge old record into new file for key from old file s3a://bucket/.hoodie/metadata/files/files-_0-45-650_20211221205951077001.hfile to new file s3a://bucket/.hoodie/metadata/files/files-_0-67-713_20211222000526106001.hfile with writerSchema { "type" : "record", "name" : "HoodieMetadataRecord", "namespace" : "org.apache.hudi.avro.model", "doc" : "A record sav
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: metadata_files_compacted.txt metadata_timeline_compacted.txt > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_files.txt, metadata_files_compacted.txt, metadata_timeline.txt, > metadata_timeline_archived.txt, metadata_timeline_compacted.txt, > stderr_part1.txt, stderr_part2.txt, timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to
[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463453#comment-17463453 ] Harsha Teja Kanna commented on HUDI-3066: - rollback.requested file { "instantToRollback": { "commitTime": "20211221044403331", "action": "deltacommit" }, "RollbackRequests": [ { "partitionPath": "files", "fileId": "", "latestBaseInstant": "", "filesToBeDeleted": [], "logBlocksToBeDeleted": {} }, { "partitionPath": "files", "fileId": "files-", "latestBaseInstant": "20211218223919666001", "filesToBeDeleted": [], "logBlocksToBeDeleted": { "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.246_0-95-1905": 5936, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.144_0-63-1952": 7437, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.293_0-84-2355": 6189, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.212_0-63-1778": 6736, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.346_0-97-2576": 5971, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.11_0-27-1737": 58553, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.275_0-18-814": 5910, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.99_0-21-889": 5906, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.358_0-97-2706": 5989, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.345_0-86-2507": 7564, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.396_0-76-2741": 9532, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.308_0-63-1903": 7209, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.202_0-96-2184": 5938, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.137_0-85-2109": 6311, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.269_0-84-1927": 6853, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.201_0-85-2115": 6865, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.173_0-85-2139": 5815, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.180_0-63-1995": 7198, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.69_0-85-1993": 6903, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.53_0-84-1893": 6657, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.221_0-85-2181": 6665, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.89_0-84-1761": 6427, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.70_0-96-2065": 5933, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.325_0-84-2375": 6322, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.319_0-27-1256": 5906, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.287_0-18-832": 5906, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.365_0-84-3041": 8622, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.298_0-96-1967": 5938, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.350_0-96-3006": 5970, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.111_0-18-732": 5910, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.191_0-18-667": 5938, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.73_0-84-1896": 6345, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.60_0-63-1744": 6525, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.323_0-18-883": 5909, "s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.389_0-84-2945": 11320, "s3a://bu
[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463448#comment-17463448 ] Harsha Teja Kanna commented on HUDI-3066: - yes this is for testing the clustering as of now(pre-prod). So set the clustering inline on every commit. > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_files.txt, metadata_timeline.txt, metadata_timeline_archived.txt, > stderr_part1.txt, stderr_part2.txt, timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblo
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: metadata_files.txt > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_files.txt, metadata_timeline.txt, metadata_timeline_archived.txt, > stderr_part1.txt, stderr_part2.txt, timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile
[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463444#comment-17463444 ] Harsha Teja Kanna commented on HUDI-3066: - metadata files [^metadata_files.txt] > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_files.txt, metadata_timeline.txt, metadata_timeline_archived.txt, > stderr_part1.txt, stderr_part2.txt, timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatRead
[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463442#comment-17463442 ] Harsha Teja Kanna edited comment on HUDI-3066 at 12/21/21, 7:52 PM: There is no separate cleaner job. I set hoodie.clean.automatic=true It is only one writer. I have hoodie.clean.async[=|https://hudi.apache.org/docs/configurations#hoodiecleanasync]true because from the docs it says Only applies when hoodie.clean.automatic is turned on. When turned on runs cleaner async with writing, which can speed up overall write performance. was (Author: h7kanna): There is no separate cleaner job. It is only one writer. I have hoodie.clean.async[=|https://hudi.apache.org/docs/configurations#hoodiecleanasync]true because from the docs it says Only applies when hoodie.clean.automatic is turned on. When turned on runs cleaner async with writing, which can speed up overall write performance. > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, > stderr_part2.txt, timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021
[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463442#comment-17463442 ] Harsha Teja Kanna commented on HUDI-3066: - There is no separate cleaner job. It is only one writer. I have hoodie.clean.async[=|https://hudi.apache.org/docs/configurations#hoodiecleanasync]true because from the docs it says Only applies when hoodie.clean.automatic is turned on. When turned on runs cleaner async with writing, which can speed up overall write performance. > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, > stderr_part2.txt, timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi
[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463385#comment-17463385 ] Harsha Teja Kanna edited comment on HUDI-3066 at 12/21/21, 5:58 PM: Hi [~manojg] It is single writer, inline clustering, automatic cleaner [^writer_log.txt] hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 hoodie.datasource.hive_sync.table=sessions_by_entrydate hoodie.metadata.clean.async=true hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR hoodie.datasource.hive_sync.partition_fields=entrydate hoodie.finalize.write.parallelism=256 hoodie.clean.automatic=true hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor hoodie.deltastreamer.source.dfs.root=s3://bucket-archive/sessions_prd_v1/data/2021/12/21/17 hoodie.datasource.hive_sync.mode=hms hoodie.clean.async=true hoodie.metadata.metrics.enable=true hoodie.parquet.max.file.size=268435456 hoodie.datasource.write.recordkey.field=id hoodie.metadata.enable=true hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator hoodie.datasource.write.hive_style_partitioning=true hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector hoodie.clustering.preserve.commit.metadata=true hoodie.parquet.small.file.limit=25000 hoodie.datasource.hive_sync.enable=false hoodie.cleaner.policy=KEEP_LATEST_COMMITS hoodie.metrics.cloudwatch.metric.prefix=emr.datalake.prd.upsert.sessions_by_entrydate hoodie.clustering.inline.max.commits=1 hoodie.cleaner.commits.retained=10 hoodie.deltastreamer.keygen.timebased.output.timezone=GMT hoodie.clustering.inline=true hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd hoodie.clustering.plan.strategy.max.num.groups=1000 hoodie.parquet.compression.codec=snappy hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS hoodie.datasource.hive_sync.use_jdbc=false hoodie.file.listing.parallelism=256 hoodie.datasource.write.partitionpath.field=entrydate:TIMESTAMP hoodie.parquet.block.size=268435456 hoodie.upsert.shuffle.parallelism=200 hoodie.datasource.hive_sync.ignore_exceptions=true hoodie.clustering.plan.strategy.small.file.limit=25000 hoodie.datasource.write.precombine.field=updatedate hoodie.clustering.plan.strategy.sort.columns=surveyid,groupid hoodie.metrics.reporter.type=CLOUDWATCH hoodie.metrics.on=true hoodie.datasource.hive_sync.database=datalake-hudi hoodie.datasource.write.operation=upsert hoodie.deltastreamer.keygen.timebased.input.timezone=GMT hoodie.deltastreamer.transformer.sql=SELECT * FROM a was (Author: h7kanna): Hi [~manojg] It is single writer, inline clustering, automatic cleaner [^writer_log.txt] hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 hoodie.datasource.hive_sync.table=sessions_by_entrydate hoodie.metadata.clean.async=true hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR hoodie.datasource.hive_sync.partition_fields=entrydate hoodie.finalize.write.parallelism=256 hoodie.clean.automatic=true hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor hoodie.deltastreamer.source.dfs.root=s3://bucket-archive/sessions_prd_v1/data/2021/12/21/17 hoodie.datasource.hive_sync.mode=hms hoodie.clean.async=true hoodie.metadata.metrics.enable=true hoodie.parquet.max.file.size=268435456 hoodie.datasource.write.recordkey.field=id hoodie.metadata.enable=true hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator hoodie.datasource.write.hive_style_partitioning=true hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector hoodie.clustering.preserve.commit.metadata=true hoodie.parquet.small.file.limit=25000 hoodie.datasource.hive_sync.enable=false hoodie.cleaner.policy=KEEP_LATEST_COMMITS hoodie.metrics.cloudwatch.metric.prefix=emr.datalake.prd.upsert.sessions_by_entrydate hoodie.clustering.inline.max.commits=1 hoodie.cleaner.commits.retained=10 hoodie.deltastreamer.keygen.timebased.output.timezone=GMT hoodie.clustering.inline=true hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd hoodie.clustering.plan.strategy.max.num.groups=1000 hoodie.parquet.compression.codec=snappy hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS hoodie.datasource.hive_sync.use_jdbc=false hoodie.file.listing.parallelism=256 hoodie.datasource.write.partitionpath.field=entrydate:TIMESTAMP hoodie.parquet.block.size=268435456 hoodie.upsert.shuffle.parallelism=200 hoodie.datasource.hive_s
[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463385#comment-17463385 ] Harsha Teja Kanna commented on HUDI-3066: - Hi [~manojg] It is single writer, inline clustering, automatic cleaner [^writer_log.txt] hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 hoodie.datasource.hive_sync.table=sessions_by_entrydate hoodie.metadata.clean.async=true hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR hoodie.datasource.hive_sync.partition_fields=entrydate hoodie.finalize.write.parallelism=256 hoodie.clean.automatic=true hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor hoodie.deltastreamer.source.dfs.root=s3://bucket-archive/sessions_prd_v1/data/2021/12/21/17 hoodie.datasource.hive_sync.mode=hms hoodie.clean.async=true hoodie.metadata.metrics.enable=true hoodie.parquet.max.file.size=268435456 hoodie.datasource.write.recordkey.field=id hoodie.metadata.enable=true hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator hoodie.datasource.write.hive_style_partitioning=true hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector hoodie.clustering.preserve.commit.metadata=true hoodie.parquet.small.file.limit=25000 hoodie.datasource.hive_sync.enable=false hoodie.cleaner.policy=KEEP_LATEST_COMMITS hoodie.metrics.cloudwatch.metric.prefix=emr.datalake.prd.upsert.sessions_by_entrydate hoodie.clustering.inline.max.commits=1 hoodie.cleaner.commits.retained=10 hoodie.deltastreamer.keygen.timebased.output.timezone=GMT hoodie.clustering.inline=true hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd hoodie.clustering.plan.strategy.max.num.groups=1000 hoodie.parquet.compression.codec=snappy hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS hoodie.datasource.hive_sync.use_jdbc=false hoodie.file.listing.parallelism=256 hoodie.datasource.write.partitionpath.field=entrydate:TIMESTAMP hoodie.parquet.block.size=268435456 hoodie.upsert.shuffle.parallelism=200 hoodie.datasource.hive_sync.ignore_exceptions=true hoodie.clustering.plan.strategy.small.file.limit=25000 hoodie.datasource.write.precombine.field=updatedate hoodie.clustering.plan.strategy.sort.columns=surveyid,groupid hoodie.metrics.reporter.type=CLOUDWATCH hoodie.metrics.on=true hoodie.datasource.hive_sync.database=lucid-datalake-hudi hoodie.datasource.write.operation=upsert hoodie.deltastreamer.keygen.timebased.input.timezone=GMT hoodie.deltastreamer.transformer.sql=SELECT * FROM a > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, > stderr_part2.txt, timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordRe
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: writer_log.txt > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, > stderr_part2.txt, timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalak
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: (was: writer_log.txt) > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, > stderr_part2.txt, timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: writer_log.txt > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, > stderr_part2.txt, timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalak
[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462993#comment-17462993 ] Harsha Teja Kanna commented on HUDI-3066: - {*}Note{*}: I ran the recent query from 'master' as I needed a fix of running clustering in parallel from master. > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, > stderr_part2.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18
[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462993#comment-17462993 ] Harsha Teja Kanna edited comment on HUDI-3066 at 12/21/21, 5:29 AM: {*}Note{*}: I ran the recent query using 'master' as I needed a fix of running clustering in parallel from master. was (Author: h7kanna): {*}Note{*}: I ran the recent query from 'master' as I needed a fix of running clustering in parallel from master. > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, > stderr_part2.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-000
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: (was: Screen Shot 2021-12-20 at 10.17.44 PM-1.png) > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, > stderr_part2.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: (was: stderr_part2-1.txt) > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2.txt, > timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > reader for l
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: (was: timeline-1.txt) > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2.txt, > timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfi
[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462986#comment-17462986 ] Harsha Teja Kanna edited comment on HUDI-3066 at 12/21/21, 4:37 AM: Complete log files for Slow run (Metadata reader on) [^stderr_part1.txt] [^stderr_part2.txt] was (Author: h7kanna): Complete log files [^stderr_part1.txt] [^stderr_part2.txt] > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Major > Labels: performance, pull-request-available > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, > stderr_part2.txt, timeline-1.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.lo
[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462986#comment-17462986 ] Harsha Teja Kanna edited comment on HUDI-3066 at 12/21/21, 4:36 AM: Complete log files [^stderr_part1.txt] [^stderr_part2.txt] was (Author: h7kanna): Complete log files [^stderr_part1.txt] [^stderr_part1.txt][^stderr_part2.txt] > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Major > Labels: performance, pull-request-available > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, > stderr_part2.txt, timeline-1.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-61
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: stderr_part2-1.txt > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Major > Labels: performance, pull-request-available > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, > stderr_part2.txt, timeline-1.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile >
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: stderr_part1.txt stderr_part2.txt > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Major > Labels: performance, pull-request-available > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, > stderr_part2.txt, timeline-1.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to th
[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462986#comment-17462986 ] Harsha Teja Kanna commented on HUDI-3066: - Complete log files [^stderr_part1.txt] [^stderr_part1.txt][^stderr_part2.txt] > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Major > Labels: performance, pull-request-available > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, timeline-1.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatRead
[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462981#comment-17462981 ] Harsha Teja Kanna commented on HUDI-3066: - Metadata on reader side disabled !Screen Shot 2021-12-20 at 10.17.44 PM.png! > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Major > Labels: performance, pull-request-available > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, timeline-1.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: Screen Shot 2021-12-20 at 10.17.44 PM-1.png > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Major > Labels: performance, pull-request-available > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, timeline-1.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://dat
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: Screen Shot 2021-12-20 at 10.17.44 PM.png > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Major > Labels: performance, pull-request-available > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_timeline.txt, metadata_timeline_archived.txt, timeline-1.txt, > timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.fil
[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462979#comment-17462979 ] Harsha Teja Kanna commented on HUDI-3066: - Metadata timeline [^metadata_timeline.txt] [^metadata_timeline_archived.txt] > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Major > Labels: performance, pull-request-available > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, timeline-1.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: metadata_timeline_archived.txt > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Major > Labels: performance, pull-request-available > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, timeline-1.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.63_0-56-519', > fileLen
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: metadata_timeline.txt > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Major > Labels: performance, pull-request-available > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, metadata_timeline.txt, timeline-1.txt, > timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.63_0-56-519', > fileLen=0} > 2021-12-18 23:37:46,122 INFO log.Ab