[jira] [Updated] (HUDI-4318) IndexOutOfBoundException when recordKey has List values for Bucket index table

2022-06-24 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-4318:

Description: 
Currently, the Bucket index is supported only if the record key has columns 
with simple values.

[https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/BucketIdentifier.java#L71]

Example record for which this breaks

column1:value1,column2:value2,column3:[value1,value2]

  was:
Currently, the Bucket index is supported only if the record key has columns 
with simple values.

[https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/BucketIdentifier.java#L71]


> IndexOutOfBoundException when recordKey has List values for Bucket index table
> --
>
> Key: HUDI-4318
> URL: https://issues.apache.org/jira/browse/HUDI-4318
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.1
>Reporter: Harsha Teja Kanna
>Priority: Minor
>
> Currently, the Bucket index is supported only if the record key has columns 
> with simple values.
> [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/BucketIdentifier.java#L71]
> Example record for which this breaks
> column1:value1,column2:value2,column3:[value1,value2]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-4318) IndexOutOfBoundException when recordKey has List values for Bucket index table

2022-06-24 Thread Harsha Teja Kanna (Jira)
Harsha Teja Kanna created HUDI-4318:
---

 Summary: IndexOutOfBoundException when recordKey has List values 
for Bucket index table
 Key: HUDI-4318
 URL: https://issues.apache.org/jira/browse/HUDI-4318
 Project: Apache Hudi
  Issue Type: Bug
  Components: core
Affects Versions: 0.11.1
Reporter: Harsha Teja Kanna


Currently, the Bucket index is supported only if the record key has columns 
with simple values.

[https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/BucketIdentifier.java#L71]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-02-04 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487008#comment-17487008
 ] 

Harsha Teja Kanna commented on HUDI-3335:
-

This is happening again on the same table after running sync for a while.
I will try to gather the details needed.

> Loading Hudi table fails with NullPointerException
> --
>
> Key: HUDI-3335
> URL: https://issues.apache.org/jira/browse/HUDI-3335
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 0.10.1
>Reporter: Harsha Teja Kanna
>Priority: Critical
>  Labels: hudi-on-call, user-support-issues
> Fix For: 0.11.0
>
>
> Have a COW table with metadata enabled. Loading from Spark query fails with 
> java.lang.NullPointerException
> *Environment*
> Spark 3.1.2
> Hudi 0.10.1
> *Query*
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val basePath = "s3a://datalake-hudi/v1"
>  val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Passing an individual partition works though*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/date=2022/01/25")
>  df.createOrReplaceTempView(table)
> *Also, disabling metadata works, but the query taking very long time*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Loading files with stacktrace:*
>   at 
> org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
>   at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
>   at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
>   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
>   at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at 
> org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
>   at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
>   at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
>   at 
> org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
>   at $anonfun$res3$1(:46)
>   at $anonfun$res3$1$adapted(:40)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> *Writer config*
> **
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 4 \
> --driver-memory 4g \
> --executor-cores 4 \
> --executor-memory 6g \
> --num-executors 8 \
> --jars 
> s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \
> s3://datalake/jars/hudi-0.10.

[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-30 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484469#comment-17484469
 ] 

Harsha Teja Kanna commented on HUDI-3335:
-

Hi, Thanks. by deleting the metadata and running the sync I am able to load 
table again

But the corrupted metadata is gone now). I will run it on another instance of 
the table to provide the above info.

> Loading Hudi table fails with NullPointerException
> --
>
> Key: HUDI-3335
> URL: https://issues.apache.org/jira/browse/HUDI-3335
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 0.10.1
>Reporter: Harsha Teja Kanna
>Priority: Critical
>  Labels: hudi-on-call, user-support-issues
> Fix For: 0.11.0
>
>
> Have a COW table with metadata enabled. Loading from Spark query fails with 
> java.lang.NullPointerException
> *Environment*
> Spark 3.1.2
> Hudi 0.10.1
> *Query*
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val basePath = "s3a://datalake-hudi/v1"
>  val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Passing an individual partition works though*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/date=2022/01/25")
>  df.createOrReplaceTempView(table)
> *Also, disabling metadata works, but the query taking very long time*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Loading files with stacktrace:*
>   at 
> org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
>   at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
>   at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
>   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
>   at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at 
> org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
>   at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
>   at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
>   at 
> org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
>   at $anonfun$res3$1(:46)
>   at $anonfun$res3$1$adapted(:40)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> *Writer config*
> **
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 4 \
> --driver-memory 4g \
> --executor-cores 4 \
> --executor-memory 6g \
> --num-executors 8 \
> --jars 
> s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark

[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-30 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484420#comment-17484420
 ] 

Harsha Teja Kanna edited comment on HUDI-3335 at 1/30/22, 8:17 PM:
---

Hi, I cannot do that immediately(I will have to check), also this is a very 
large table to reproduce.

But I have see this happen on the same table after creating it for the second 
time.

I will try to delete the metadata folder and re-run sync to see if that helps.

Also will try to see if I can reproduce this on any small table.

 

 


was (Author: h7kanna):
Hi, I cannot do that immediately(I will have to check), also this is a very 
large to reproduce.

But I have see this happen on the same table after creating it for the second 
time.

I will try to delete the metadata folder and re-run sync to see if that helps.

Also will try to see if I can reproduce this on any small table.

 

 

> Loading Hudi table fails with NullPointerException
> --
>
> Key: HUDI-3335
> URL: https://issues.apache.org/jira/browse/HUDI-3335
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 0.10.1
>Reporter: Harsha Teja Kanna
>Priority: Critical
>  Labels: hudi-on-call, user-support-issues
> Fix For: 0.11.0
>
>
> Have a COW table with metadata enabled. Loading from Spark query fails with 
> java.lang.NullPointerException
> *Environment*
> Spark 3.1.2
> Hudi 0.10.1
> *Query*
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val basePath = "s3a://datalake-hudi/v1"
>  val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Passing an individual partition works though*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/date=2022/01/25")
>  df.createOrReplaceTempView(table)
> *Also, disabling metadata works, but the query taking very long time*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Loading files with stacktrace:*
>   at 
> org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
>   at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
>   at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
>   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
>   at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at 
> org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
>   at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
>   at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
>   at 
> org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
>   at $anonfun$res3$1(:46)
>   at $anonfun$res3$1$adapted(:40)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach

[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-30 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484420#comment-17484420
 ] 

Harsha Teja Kanna commented on HUDI-3335:
-

Hi, I cannot do that immediately(I will have to check), also this is a very 
large to reproduce.

But I have see this happen on the same table after creating it for the second 
time.

I will try to delete the metadata folder and re-run sync to see if that helps.

Also will try to see if I can reproduce this on any small table.

 

 

> Loading Hudi table fails with NullPointerException
> --
>
> Key: HUDI-3335
> URL: https://issues.apache.org/jira/browse/HUDI-3335
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 0.10.1
>Reporter: Harsha Teja Kanna
>Priority: Critical
>  Labels: hudi-on-call, user-support-issues
> Fix For: 0.11.0
>
>
> Have a COW table with metadata enabled. Loading from Spark query fails with 
> java.lang.NullPointerException
> *Environment*
> Spark 3.1.2
> Hudi 0.10.1
> *Query*
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val basePath = "s3a://datalake-hudi/v1"
>  val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Passing an individual partition works though*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/date=2022/01/25")
>  df.createOrReplaceTempView(table)
> *Also, disabling metadata works, but the query taking very long time*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Loading files with stacktrace:*
>   at 
> org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
>   at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
>   at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
>   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
>   at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at 
> org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
>   at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
>   at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
>   at 
> org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
>   at $anonfun$res3$1(:46)
>   at $anonfun$res3$1$adapted(:40)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> *Writer config*
> **
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 4 \
> --driver-memory 4g \
> --executor-cores 4 \
> --executor-memory 6g \
> --num-executors 8 \
> --jars 
> s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.ut

[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-28 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483970#comment-17483970
 ] 

Harsha Teja Kanna commented on HUDI-3335:
-

22/01/28 15:14:14 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/07/20],date=2021/07/20), total files 0

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/04/29],date=2021/04/29), total files 21

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/02/02],date=2021/02/02), total files 26

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/09/30],date=2021/09/30), total files 0

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/12/30],date=2021/12/30), total files 16

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/03/10],date=2021/03/10), total files 25

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/08/31],date=2021/08/31), total files 19

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/01/08],date=2021/01/08), total files 14

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/12/19],date=2021/12/19), total files 0

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/04/26],date=2021/04/26), total files 23

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/08/06],date=2021/08/06), total files 19

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/05/23],date=2021/05/23), total files 0

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/04/09],date=2021/04/09), total files 22

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/10/12],date=2021/10/12), total files 0

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/06/12],date=2021/06/12), total files 0

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/05/20],date=2021/05/20), total files 22

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/03/04],date=2021/03/04), total files 25

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/08/20],date=2021/08/20), total files 0

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/02/13],date=2021/02/13), total files 21

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/08/02],date=2021/08/02), total files 0

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/05/05],date=2021/05/05), total files 0

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/03/03],date=2021/03/03), total files 24

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/06/05],date=2021/06/05), total files 0

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/10/23],date=2021/10/23), total files 0

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/09/29],date=2021/09/29), total files 0

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/04/14],date=2021/04/14), total files 22

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/02/18],date=2021/02/18), total files 29

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/03/21],date=2021/03/21), total files 17

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/07/24],date=2021/07/24), total files 0

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/04/30],date=2021/04/30), total files 17

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/01/21],date=2021/01/21), total files 15

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/10/29],date=2021/10/29), total files 0

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/10/01],date=2021/10/01), total files 0

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/06/23],date=2021/06/23), total files 0

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/12/27],date=2021/12/27), total files 0

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/04/08],date=2021/04/08), total files 21

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/03/14],date=2021/03/14), total files 14

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/03/26],date=2021/03/26), total files 24

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([2021/12/04],date=2021/12/04), total files 0

22/01/28 15:14:15 INFO HoodieFileIndex: partition path 
PartitionRowPath([202

[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-28 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483603#comment-17483603
 ] 

Harsha Teja Kanna edited comment on HUDI-3335 at 1/28/22, 8:34 AM:
---

Hi,

1) Yes partitions are of type '/mm/dd'. No using wild cards takes a long 
time. In fact we switch to only using base path from the suggestion from the 
other ticket https://issues.apache.org/jira/browse/HUDI-3066

2) 'metadata list-partitions' shown at the end

3) No delete_partition operations were performed

4) hive_sync disabled intentionally, 

5) 'metadata validate-files' is running for all partitions for a while now, 
total 389 of them, But I see these below errors for many partitions

1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of log 
files scanned => 7
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - 
MaxMemoryInBytes allowed for compaction => 1073741824
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of 
entries in MemoryBasedMap in ExternalSpillableMap => 3
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Total size in 
bytes of MemoryBasedMap in ExternalSpillableMap => 1800
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of 
entries in BitCaskDiskMap in ExternalSpillableMap => 0
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Size of file 
spilled to disk => 0
1640607 [Spring Shell] INFO  org.apache.hudi.metadata.HoodieBackedTableMetadata 
 - Opened 7 metadata log files (dataset instant=20220126024720121, metadata 
instant=20220126024720121) in 3577 ms
1640806 [Spring Shell] INFO  org.apache.hadoop.io.compress.CodecPool  - Got 
brand-new decompressor [.gz]
1640808 [Spring Shell] INFO  org.apache.hudi.metadata.HoodieBackedTableMetadata 
 - Metadata read for 1 keys took [baseFileRead, logMerge] [0, 201] ms
1640809 [Spring Shell] INFO  org.apache.hudi.metadata.BaseTableMetadata  - 
Listed file in partition from metadata: partition=date=2022/01/15, #files=0
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  -  
FS and metadata files count not matching for date=2022/01/15. FS files count 
19, metadata base files count 0
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
91eeab55-ee54-4285-a493-01d04a237691-0_775-489-15939_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
91eeab55-ee54-4285-a493-01d04a237691-1_775-489-15939_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
96c4d9d1-9de4-4c50-bd9c-f3f6a7bfb40b-0_133-10-6713_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
e266b306-b376-4954-8403-8f858dec34ee-0_134-10-6714_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
399fa917-1198-4c9a-bf63-754f40a5ad09-0_806-501-16124_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dac6d5d0-ca15-48e2-9b13-f37b4d113e64-0_804-501-16163_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dde98b16-326c-4890-b100-4548d3161328-0_138-10-6718_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
80cd989e-296c-4dda-ae2d-4c7577fdc351-0_805-501-15994_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
83409360-e1c2-4bd6-ac22-b59e52eaf6ad-0_776-489-16054_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
5887f02d-fd66-4678-ac34-fe585e9946a4-0_137-10-6717_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
a03a35be-db55-4a96-a584-42b8007d507c-0_132-10-6712_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
ddd0500e-9434-451b-9066-dfc7c5f6260b-0_777-489-16040_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dac6d5d0-ca15-48e2-9b13-f37b4d113e64-1_804-501-16163_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands

[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483610#comment-17483610
 ] 

Harsha Teja Kanna commented on HUDI-3335:
-

Log



22/01/28 01:29:34 INFO Executor: Adding 
file:/private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-823349b0-aeeb-494d-bdc6-c276419a0fe1/userFiles-644e376c-59bb-4837-a421-590697992dc6/hudi-utilities-bundle_2.12-0.10.1.jar
 to class loader
22/01/28 01:29:34 INFO Executor: Fetching 
spark://192.168.86.5:49947/jars/org.spark-project.spark_unused-1.0.0.jar with 
timestamp 1643354959702
22/01/28 01:29:34 INFO Utils: Fetching 
spark://192.168.86.5:49947/jars/org.spark-project.spark_unused-1.0.0.jar to 
/private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-823349b0-aeeb-494d-bdc6-c276419a0fe1/userFiles-644e376c-59bb-4837-a421-590697992dc6/fetchFileTemp5819832321479921719.tmp
22/01/28 01:29:34 INFO Executor: Adding 
file:/private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-823349b0-aeeb-494d-bdc6-c276419a0fe1/userFiles-644e376c-59bb-4837-a421-590697992dc6/org.spark-project.spark_unused-1.0.0.jar
 to class loader
22/01/28 01:29:34 INFO Utils: Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 49956.
22/01/28 01:29:34 INFO NettyBlockTransferService: Server created on 
192.168.86.5:49956
22/01/28 01:29:34 INFO BlockManager: Using 
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
policy
22/01/28 01:29:34 INFO BlockManagerMaster: Registering BlockManager 
BlockManagerId(driver, 192.168.86.5, 49956, None)
22/01/28 01:29:34 INFO BlockManagerMasterEndpoint: Registering block manager 
192.168.86.5:49956 with 2004.6 MiB RAM, BlockManagerId(driver, 192.168.86.5, 
49956, None)
22/01/28 01:29:34 INFO BlockManagerMaster: Registered BlockManager 
BlockManagerId(driver, 192.168.86.5, 49956, None)
22/01/28 01:29:34 INFO BlockManager: Initialized BlockManager: 
BlockManagerId(driver, 192.168.86.5, 49956, None)
22/01/28 01:29:35 INFO SharedState: Setting hive.metastore.warehouse.dir 
('null') to the value of spark.sql.warehouse.dir 
('file:/Users/harshakanna/spark-warehouse/').
22/01/28 01:29:35 INFO SharedState: Warehouse path is 
'file:/Users/harshakanna/spark-warehouse/'.
22/01/28 01:29:36 INFO DataSourceUtils: Getting table path..
22/01/28 01:29:36 INFO TablePathUtils: Getting table path from path : 
s3a://datalake-hudi/sessions
22/01/28 01:29:36 INFO DefaultSource: Obtained hudi table path: 
s3a://datalake-hudi/sessions
22/01/28 01:29:36 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3a://datalake-hudi/sessions
22/01/28 01:29:36 INFO HoodieTableConfig: Loading table properties from 
s3a://datalake-hudi/sessions/.hoodie/hoodie.properties
22/01/28 01:29:36 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3a://datalake-hudi/sessions
22/01/28 01:29:36 INFO DefaultSource: Is bootstrapped table => false, tableType 
is: COPY_ON_WRITE, queryType is: snapshot
22/01/28 01:29:36 INFO DefaultSource: Loading Base File Only View with options 
:Map(hoodie.datasource.query.type -> snapshot, hoodie.metadata.enable -> true, 
path -> s3a://datalake-hudi/sessions/)
22/01/28 01:29:37 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3a://datalake-hudi/sessions
22/01/28 01:29:37 INFO HoodieTableConfig: Loading table properties from 
s3a://datalake-hudi/sessions/.hoodie/hoodie.properties
22/01/28 01:29:37 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3a://datalake-hudi/sessions
22/01/28 01:29:37 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3a://datalake-hudi/sessions/.hoodie/metadata
22/01/28 01:29:37 INFO HoodieTableConfig: Loading table properties from 
s3a://datalake-hudi/sessions/.hoodie/metadata/.hoodie/hoodie.properties
22/01/28 01:29:37 INFO HoodieTableMetaClient: Finished Loading Table of type 
MERGE_ON_READ(version=1, baseFileFormat=HFILE) from 
s3a://datalake-hudi/sessions/.hoodie/metadata
22/01/28 01:29:37 INFO HoodieTableMetadataUtil: Loading latest merged file 
slices for metadata table partition files
22/01/28 01:29:38 INFO HoodieActiveTimeline: Loaded instants upto : 
Option\{val=[20220126024720121__deltacommit__COMPLETED]}
22/01/28 01:29:38 INFO AbstractTableFileSystemView: Took 2 ms to read 0 
instants, 0 replaced file groups
22/01/28 01:29:38 INFO ClusteringUtils: Found 0 files in pending clustering 
operations
22/01/28 01:29:38 INFO AbstractTableFileSystemView: Building file system view 
for partition (files)
22/01/28 01:29:38 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=9, 
NumFileGroups=1, FileGroupsCreationTime=11, StoreTimeTaken=0
22/01/28 01:29:38 INFO CacheConfig: Allocating LruBlockCache size=1.42 GB, 
blockSize=64 KB
22/01/28 01:29:38 INFO CacheConfig: Created c

[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483603#comment-17483603
 ] 

Harsha Teja Kanna edited comment on HUDI-3335 at 1/28/22, 7:30 AM:
---

Hi,

1) Yes partitions are of type '/mm/dd'. No using wild cards takes a long 
time. In fact we switch to only using base path from the suggestion from the 
other ticket https://issues.apache.org/jira/browse/HUDI-3066

2) 'metadata list-partitions, I am not able to run it succesfully, will give 
the info once I can do it. Fails with parsing SPARK_MASTER url=yarn or somethig

3) No delete_partition operations were performed

4) hive_sync disabled intentionally, 

5) 'metadata validate-files' is running for all partitions for a while now, 
total 389 of them, But I see these below errors for many partitions

1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of log 
files scanned => 7
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - 
MaxMemoryInBytes allowed for compaction => 1073741824
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of 
entries in MemoryBasedMap in ExternalSpillableMap => 3
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Total size in 
bytes of MemoryBasedMap in ExternalSpillableMap => 1800
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of 
entries in BitCaskDiskMap in ExternalSpillableMap => 0
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Size of file 
spilled to disk => 0
1640607 [Spring Shell] INFO  org.apache.hudi.metadata.HoodieBackedTableMetadata 
 - Opened 7 metadata log files (dataset instant=20220126024720121, metadata 
instant=20220126024720121) in 3577 ms
1640806 [Spring Shell] INFO  org.apache.hadoop.io.compress.CodecPool  - Got 
brand-new decompressor [.gz]
1640808 [Spring Shell] INFO  org.apache.hudi.metadata.HoodieBackedTableMetadata 
 - Metadata read for 1 keys took [baseFileRead, logMerge] [0, 201] ms
1640809 [Spring Shell] INFO  org.apache.hudi.metadata.BaseTableMetadata  - 
Listed file in partition from metadata: partition=date=2022/01/15, #files=0
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  -  
FS and metadata files count not matching for date=2022/01/15. FS files count 
19, metadata base files count 0
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
91eeab55-ee54-4285-a493-01d04a237691-0_775-489-15939_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
91eeab55-ee54-4285-a493-01d04a237691-1_775-489-15939_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
96c4d9d1-9de4-4c50-bd9c-f3f6a7bfb40b-0_133-10-6713_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
e266b306-b376-4954-8403-8f858dec34ee-0_134-10-6714_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
399fa917-1198-4c9a-bf63-754f40a5ad09-0_806-501-16124_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dac6d5d0-ca15-48e2-9b13-f37b4d113e64-0_804-501-16163_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dde98b16-326c-4890-b100-4548d3161328-0_138-10-6718_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
80cd989e-296c-4dda-ae2d-4c7577fdc351-0_805-501-15994_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
83409360-e1c2-4bd6-ac22-b59e52eaf6ad-0_776-489-16054_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
5887f02d-fd66-4678-ac34-fe585e9946a4-0_137-10-6717_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
a03a35be-db55-4a96-a584-42b8007d507c-0_132-10-6712_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
ddd0500e-9434-451b-9066-dfc7c5f6260b-0_777-489-16040_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dac6d5d0-ca15-48e2-9b13-f

[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483603#comment-17483603
 ] 

Harsha Teja Kanna edited comment on HUDI-3335 at 1/28/22, 7:27 AM:
---

Hi,

1) Yes partitions are of type '/mm/dd'. No using wild cards takes a long 
time. In fact we switch to only using base path from the suggestion from the 
other ticket https://issues.apache.org/jira/browse/HUDI-3066

2) 'metadata list-partitions, I am not able to run it succesfully, will give 
the info once I can do it. Fails with parsing SPARK_MASTER url=yarn or somethig

3) No delete_partition operations were performed

4) hive_sync disabled intentionally, 

5) 'metadata validate-files' is running for all partitions for a while now, 
total 329 of them, But I see these below errors for many partitions

1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of log 
files scanned => 7
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - 
MaxMemoryInBytes allowed for compaction => 1073741824
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of 
entries in MemoryBasedMap in ExternalSpillableMap => 3
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Total size in 
bytes of MemoryBasedMap in ExternalSpillableMap => 1800
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of 
entries in BitCaskDiskMap in ExternalSpillableMap => 0
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Size of file 
spilled to disk => 0
1640607 [Spring Shell] INFO  org.apache.hudi.metadata.HoodieBackedTableMetadata 
 - Opened 7 metadata log files (dataset instant=20220126024720121, metadata 
instant=20220126024720121) in 3577 ms
1640806 [Spring Shell] INFO  org.apache.hadoop.io.compress.CodecPool  - Got 
brand-new decompressor [.gz]
1640808 [Spring Shell] INFO  org.apache.hudi.metadata.HoodieBackedTableMetadata 
 - Metadata read for 1 keys took [baseFileRead, logMerge] [0, 201] ms
1640809 [Spring Shell] INFO  org.apache.hudi.metadata.BaseTableMetadata  - 
Listed file in partition from metadata: partition=date=2022/01/15, #files=0
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  -  
FS and metadata files count not matching for date=2022/01/15. FS files count 
19, metadata base files count 0
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
91eeab55-ee54-4285-a493-01d04a237691-0_775-489-15939_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
91eeab55-ee54-4285-a493-01d04a237691-1_775-489-15939_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
96c4d9d1-9de4-4c50-bd9c-f3f6a7bfb40b-0_133-10-6713_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
e266b306-b376-4954-8403-8f858dec34ee-0_134-10-6714_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
399fa917-1198-4c9a-bf63-754f40a5ad09-0_806-501-16124_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dac6d5d0-ca15-48e2-9b13-f37b4d113e64-0_804-501-16163_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dde98b16-326c-4890-b100-4548d3161328-0_138-10-6718_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
80cd989e-296c-4dda-ae2d-4c7577fdc351-0_805-501-15994_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
83409360-e1c2-4bd6-ac22-b59e52eaf6ad-0_776-489-16054_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
5887f02d-fd66-4678-ac34-fe585e9946a4-0_137-10-6717_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
a03a35be-db55-4a96-a584-42b8007d507c-0_132-10-6712_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
ddd0500e-9434-451b-9066-dfc7c5f6260b-0_777-489-16040_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dac6d5d0-ca15-48e2-9b13-f

[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483603#comment-17483603
 ] 

Harsha Teja Kanna edited comment on HUDI-3335 at 1/28/22, 7:25 AM:
---

Hi,

1) Yes partitions are of type '/mm/dd'. No using wild cards takes a long 
time. In fact we switch to only using base path from the suggestion from the 
other ticket https://issues.apache.org/jira/browse/HUDI-3066

2) 'metadata list-partitions, I am able to run it succesfully, will give the 
info once I can do it. Fails with parsing SPARK_MASTER url=yarn or somethig

3) No delete_partition operations were performed

4) hive_sync disabled intentionally, 

5) 'metadata validate-files' is running for all partitions for a while now, 
total 329 of them, But I see these below errors for many partitions

1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of log 
files scanned => 7
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - 
MaxMemoryInBytes allowed for compaction => 1073741824
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of 
entries in MemoryBasedMap in ExternalSpillableMap => 3
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Total size in 
bytes of MemoryBasedMap in ExternalSpillableMap => 1800
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of 
entries in BitCaskDiskMap in ExternalSpillableMap => 0
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Size of file 
spilled to disk => 0
1640607 [Spring Shell] INFO  org.apache.hudi.metadata.HoodieBackedTableMetadata 
 - Opened 7 metadata log files (dataset instant=20220126024720121, metadata 
instant=20220126024720121) in 3577 ms
1640806 [Spring Shell] INFO  org.apache.hadoop.io.compress.CodecPool  - Got 
brand-new decompressor [.gz]
1640808 [Spring Shell] INFO  org.apache.hudi.metadata.HoodieBackedTableMetadata 
 - Metadata read for 1 keys took [baseFileRead, logMerge] [0, 201] ms
1640809 [Spring Shell] INFO  org.apache.hudi.metadata.BaseTableMetadata  - 
Listed file in partition from metadata: partition=date=2022/01/15, #files=0
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  -  
FS and metadata files count not matching for date=2022/01/15. FS files count 
19, metadata base files count 0
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
91eeab55-ee54-4285-a493-01d04a237691-0_775-489-15939_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
91eeab55-ee54-4285-a493-01d04a237691-1_775-489-15939_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
96c4d9d1-9de4-4c50-bd9c-f3f6a7bfb40b-0_133-10-6713_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
e266b306-b376-4954-8403-8f858dec34ee-0_134-10-6714_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
399fa917-1198-4c9a-bf63-754f40a5ad09-0_806-501-16124_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dac6d5d0-ca15-48e2-9b13-f37b4d113e64-0_804-501-16163_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dde98b16-326c-4890-b100-4548d3161328-0_138-10-6718_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
80cd989e-296c-4dda-ae2d-4c7577fdc351-0_805-501-15994_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
83409360-e1c2-4bd6-ac22-b59e52eaf6ad-0_776-489-16054_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
5887f02d-fd66-4678-ac34-fe585e9946a4-0_137-10-6717_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
a03a35be-db55-4a96-a584-42b8007d507c-0_132-10-6712_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
ddd0500e-9434-451b-9066-dfc7c5f6260b-0_777-489-16040_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dac6d5d0-ca15-48e2-9b13-f37b4

[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483603#comment-17483603
 ] 

Harsha Teja Kanna commented on HUDI-3335:
-

Hi,

1) Yes partitions are of type '/mm/dd'. No using wild cards takes a long 
time. In fact we switch to only using base path from the suggestion from the 
other ticket 3066

2) 'metadata list-partitions, I am able to run it succesfully, will give the 
info once I can do it. Fails with parsing SPARK_MASTER url=yarn or somethig

3) No delete_partition operations were performed

4) hive_sync disabled intentionally, 

5) 'metadata validate-files' is running for all partitions for a while now, 
total 329 of them, But I see these below errors for many partitions

1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of log 
files scanned => 7
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - 
MaxMemoryInBytes allowed for compaction => 1073741824
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of 
entries in MemoryBasedMap in ExternalSpillableMap => 3
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Total size in 
bytes of MemoryBasedMap in ExternalSpillableMap => 1800
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of 
entries in BitCaskDiskMap in ExternalSpillableMap => 0
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Size of file 
spilled to disk => 0
1640607 [Spring Shell] INFO  org.apache.hudi.metadata.HoodieBackedTableMetadata 
 - Opened 7 metadata log files (dataset instant=20220126024720121, metadata 
instant=20220126024720121) in 3577 ms
1640806 [Spring Shell] INFO  org.apache.hadoop.io.compress.CodecPool  - Got 
brand-new decompressor [.gz]
1640808 [Spring Shell] INFO  org.apache.hudi.metadata.HoodieBackedTableMetadata 
 - Metadata read for 1 keys took [baseFileRead, logMerge] [0, 201] ms
1640809 [Spring Shell] INFO  org.apache.hudi.metadata.BaseTableMetadata  - 
Listed file in partition from metadata: partition=date=2022/01/15, #files=0
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  -  
FS and metadata files count not matching for date=2022/01/15. FS files count 
19, metadata base files count 0
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
91eeab55-ee54-4285-a493-01d04a237691-0_775-489-15939_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
91eeab55-ee54-4285-a493-01d04a237691-1_775-489-15939_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
96c4d9d1-9de4-4c50-bd9c-f3f6a7bfb40b-0_133-10-6713_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
e266b306-b376-4954-8403-8f858dec34ee-0_134-10-6714_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
399fa917-1198-4c9a-bf63-754f40a5ad09-0_806-501-16124_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dac6d5d0-ca15-48e2-9b13-f37b4d113e64-0_804-501-16163_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dde98b16-326c-4890-b100-4548d3161328-0_138-10-6718_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
80cd989e-296c-4dda-ae2d-4c7577fdc351-0_805-501-15994_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
83409360-e1c2-4bd6-ac22-b59e52eaf6ad-0_776-489-16054_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
5887f02d-fd66-4678-ac34-fe585e9946a4-0_137-10-6717_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
a03a35be-db55-4a96-a584-42b8007d507c-0_132-10-6712_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
ddd0500e-9434-451b-9066-dfc7c5f6260b-0_777-489-16040_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dac6d5d0-ca15-48e2-9b13-f37b4d113e64-1_804-501-16163_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi

[jira] [Updated] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3335:

Description: 
Have a COW table with metadata enabled. Loading from Spark query fails with 
java.lang.NullPointerException

*Environment*

Spark 3.1.2

Hudi 0.10.1

*Query*

import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.common.config.HoodieMetadataConfig

val basePath = "s3a://datalake-hudi/v1"

 val df = spark.
    read.
    format("org.apache.hudi").
    option(HoodieMetadataConfig.ENABLE.key(), "true").
    option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
    load(s"${basePath}/sessions/")
 df.createOrReplaceTempView(table)

*Passing an individual partition works though*
val df = spark.
    read.
    format("org.apache.hudi").
    option(HoodieMetadataConfig.ENABLE.key(), "true").
    option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
    load(s"${basePath}/sessions/date=2022/01/25")
 df.createOrReplaceTempView(table)

*Also, disabling metadata works, but the query taking very long time*
val df = spark.
    read.
    format("org.apache.hudi").
    option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
    load(s"${basePath}/sessions/")
 df.createOrReplaceTempView(table)

*Loading files with stacktrace:*

  at 
org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
  at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
  at 
org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
  at 
org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
  at 
org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
  at 
org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
  at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
  at 
org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
  at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
  at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
  at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
  at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
  at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
  at $anonfun$res3$1(:46)
  at $anonfun$res3$1$adapted(:40)
  at scala.collection.Iterator.foreach(Iterator.scala:941)
  at scala.collection.Iterator.foreach$(Iterator.scala:941)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)

*Writer config*

**

spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-cores 4 \
--driver-memory 4g \
--executor-cores 4 \
--executor-memory 6g \
--num-executors 8 \
--jars 
s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar
 \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \
s3://datalake/jars/hudi-0.10.1/hudi-utilities-bundle_2.12-0.10.1.jar \
--table-type COPY_ON_WRITE \
--source-ordering-field timestamp \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-base-path s3a://datalake-hudi/sessions \
--target-table sessions \
--transformer-class 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
--op INSERT \
--hoodie-conf hoodie.clean.automatic=true \
--hoodie-conf hoodie.cleaner.commits.retained=10 \
--hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
--hoodie-conf hoodie.clustering.inline=true \
--hoodie-conf hoodie.clustering.inline.max.commits=5 \
--hoodie-conf 
hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy
 \
--hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=1000 \
--hoodie-con

[jira] [Updated] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3335:

Description: 
Have a COW table with metadata enabled. Loading from Spark query fails with 
java.lang.NullPointerException

*Environment*

Spark 3.1.2

Hudi 0.10.1

*Query*

import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.common.config.HoodieMetadataConfig

val basePath = "s3a://datalake-hudi/v1"

 val df = spark.
    read.
    format("org.apache.hudi").
    option(HoodieMetadataConfig.ENABLE.key(), "true").
    option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
    load(s"${basePath}/sessions/")
 df.createOrReplaceTempView(table)

*Passing an individual partition works though*
val df = spark.
    read.
    format("org.apache.hudi").
    option(HoodieMetadataConfig.ENABLE.key(), "true").
    option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
    load(s"${basePath}/sessions/date=2022/01/25")
 df.createOrReplaceTempView(table)

*Also, disabling metadata works, but the query taking very long time*
val df = spark.
    read.
    format("org.apache.hudi").
    option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
    load(s"${basePath}/sessions/")
 df.createOrReplaceTempView(table)

*Loading files with stacktrace:*

  at 
org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
  at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
  at 
org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
  at 
org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
  at 
org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
  at 
org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
  at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
  at 
org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
  at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
  at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
  at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
  at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
  at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
  at $anonfun$res3$1(:46)
  at $anonfun$res3$1$adapted(:40)
  at scala.collection.Iterator.foreach(Iterator.scala:941)
  at scala.collection.Iterator.foreach$(Iterator.scala:941)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)

*Writer config*

**

spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-cores 4 \
--driver-memory 4g \
--executor-cores 4 \
--executor-memory 6g \
--num-executors 8 \
--jars 
s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar
 \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \
s3://datalake/jars/hudi-0.10.1/hudi-utilities-bundle_2.12-0.10.1.jar \
--table-type COPY_ON_WRITE \
--source-ordering-field timestamp \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-base-path s3a://datalake-hudi/sessions \
--target-table sessions \
--transformer-class 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
--op INSERT \
--hoodie-conf hoodie.clean.automatic=true \
--hoodie-conf hoodie.cleaner.commits.retained=10 \
--hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
--hoodie-conf hoodie.clustering.inline=true \
--hoodie-conf hoodie.clustering.inline.max.commits=5 \
--hoodie-conf 
hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy
 \
--hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=1000 \
--hoodie-con

[jira] [Updated] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3335:

Description: 
Have a COW table with metadata enabled. Loading from Spark query fails with 
java.lang.NullPointerException

*Environment*

Spark 3.1.2

Hudi 0.10.1

*Query*

import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.common.config.HoodieMetadataConfig

val basePath = "s3a://datalake-hudi/v1"

 val df = spark.
    read.
    format("org.apache.hudi").
    option(HoodieMetadataConfig.ENABLE.key(), "true").
    option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
    load(s"${basePath}/sessions/")
 df.createOrReplaceTempView(table)

*Passing an individual partition works though*
val df = spark.
    read.
    format("org.apache.hudi").
    option(HoodieMetadataConfig.ENABLE.key(), "true").
    option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
    load(s"${basePath}/sessions/date=2022/01/25")
 df.createOrReplaceTempView(table)

*Also, disabling metadata works, but the query taking very long time*
val df = spark.
    read.
    format("org.apache.hudi").
    option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
    load(s"${basePath}/sessions/")
 df.createOrReplaceTempView(table)

*Loading files with stacktrace:*

  at 
org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
  at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
  at 
org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
  at 
org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
  at 
org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
  at 
org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
  at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
  at 
org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
  at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
  at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
  at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
  at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
  at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
  at $anonfun$res3$1(:46)
  at $anonfun$res3$1$adapted(:40)
  at scala.collection.Iterator.foreach(Iterator.scala:941)
  at scala.collection.Iterator.foreach$(Iterator.scala:941)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)

*Writer config*

**

spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-cores 4 \
--driver-memory 4g \
--executor-cores 4 \
--executor-memory 6g \
--num-executors 8 \
--jars 
s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar
 \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \
s3://datalake/jars/hudi-0.10.1/hudi-utilities-bundle_2.12-0.10.1.jar \
--table-type COPY_ON_WRITE \
--source-ordering-field timestamp \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-base-path s3a://datalake-hudi/sessions \
--target-table sessions \
--transformer-class 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
--op INSERT \
--hoodie-conf hoodie.clean.automatic=true \
--hoodie-conf hoodie.cleaner.commits.retained=10 \
--hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
--hoodie-conf hoodie.clustering.inline=true \
--hoodie-conf hoodie.clustering.inline.max.commits=5 \
--hoodie-conf 
hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy
 \
--hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=1000 \
--hoodie-con

[jira] [Updated] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-26 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3335:

Description: 
Have a COW table with metadata enabled. Loading from Spark query fails with 
java.lang.NullPointerException

*Environment*

Spark 3.1.2

Hudi 0.10.1

*Query*

import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.common.config.HoodieMetadataConfig

val basePath = "s3a://datalake-hudi/v1"

 val df = spark.
    read.
    format("org.apache.hudi").
    option(HoodieMetadataConfig.ENABLE.key(), "true").
    option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
    load(s"${basePath}/sessions/")
 df.createOrReplaceTempView(table)

*Passing an individual partition works though*
val df = spark.
    read.
    format("org.apache.hudi").
    option(HoodieMetadataConfig.ENABLE.key(), "true").
    option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
    load(s"${basePath}/sessions/date=2022/01/25")
 df.createOrReplaceTempView(table)



*Stacktrace:*

  at 
org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
  at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
  at 
org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
  at 
org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
  at 
org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
  at 
org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
  at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
  at 
org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
  at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
  at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
  at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
  at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
  at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
  at $anonfun$res3$1(:46)
  at $anonfun$res3$1$adapted(:40)
  at scala.collection.Iterator.foreach(Iterator.scala:941)
  at scala.collection.Iterator.foreach$(Iterator.scala:941)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)

  was:
Have a COW table with metadata enabled. Loading from Spark query fails with 
java.lang.NullPointerException

*Environment*

Spark 3.1.2

Hudi 0.10.1

*Query*

import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.common.config.HoodieMetadataConfig

val basePath = "s3a://datalake-hudi/v1"

 val df = spark.
    read.
    format("org.apache.hudi").
    option(HoodieMetadataConfig.ENABLE.key(), "true").
    option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
    load(s"${basePath}/sessions/")
 df.createOrReplaceTempView(table)


*Stacktrace:*

  at 
org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
  at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
  at 
org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
  at 
org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
  at 
org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
  at 
org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
  at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
  at 
org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
  at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
  at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
  at org.apac

[jira] [Updated] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-26 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3335:

Description: 
Have a COW table with metadata enabled. Loading from Spark query fails with 
java.lang.NullPointerException

*Environment*

Spark 3.1.2

Hudi 0.10.1

*Query*

import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.common.config.HoodieMetadataConfig

val basePath = "s3a://datalake-hudi/v1"

 val df = spark.
    read.
    format("org.apache.hudi").
    option(HoodieMetadataConfig.ENABLE.key(), "true").
    option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
    load(s"${basePath}/sessions/")
 df.createOrReplaceTempView(table)


*Stacktrace:*

  at 
org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
  at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
  at 
org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
  at 
org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
  at 
org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
  at 
org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
  at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
  at 
org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
  at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
  at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
  at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
  at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
  at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
  at $anonfun$res3$1(:46)
  at $anonfun$res3$1$adapted(:40)
  at scala.collection.Iterator.foreach(Iterator.scala:941)
  at scala.collection.Iterator.foreach$(Iterator.scala:941)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)

  was:
Environment

Spark 3.1.2

Hudi 0.10.1

Query

import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.common.config.HoodieMetadataConfig

val basePath = "s3a://datalake-hudi/v1"

 val df = spark.
    read.
    format("org.apache.hudi").
    option(HoodieMetadataConfig.ENABLE.key(), "true").
    option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
    load(s"${basePath}/sessions/")
 df.createOrReplaceTempView(table)

 

Have a COW table with metadata enabled. Loading from Spark query fails with 
java.lang.NullPointerException
  at 
org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
  at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
  at 
org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
  at 
org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
  at 
org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
  at 
org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
  at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
  at 
org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
  at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
  at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
  at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
  at 
org.apache.spark.sql.DataFrameReader.loadV1Sourc

[jira] [Updated] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-26 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3335:

Description: 
Environment

Spark 3.1.2

Hudi 0.10.1

Query

import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.common.config.HoodieMetadataConfig

val basePath = "s3a://datalake-hudi/v1"

 val df = spark.
    read.
    format("org.apache.hudi").
    option(HoodieMetadataConfig.ENABLE.key(), "true").
    option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
    load(s"${basePath}/sessions/")
 df.createOrReplaceTempView(table)

 

Have a COW table with metadata enabled. Loading from Spark query fails with 
java.lang.NullPointerException
  at 
org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
  at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
  at 
org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
  at 
org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
  at 
org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
  at 
org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
  at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
  at 
org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
  at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
  at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
  at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
  at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
  at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
  at $anonfun$res3$1(:46)
  at $anonfun$res3$1$adapted(:40)
  at scala.collection.Iterator.foreach(Iterator.scala:941)
  at scala.collection.Iterator.foreach$(Iterator.scala:941)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)

  was:
Environment

Spark 3.1.2

Hudi 0.10.1

Have a COW table with metadata enabled. Loading from Spark query fails with 
java.lang.NullPointerException
  at 
org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
  at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
  at 
org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
  at 
org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
  at 
org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
  at 
org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
  at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
  at 
org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
  at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
  at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
  at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
  at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
  at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
  at $anonfun$res3$1(:46)
  at $anonfun$res3$1$adapted(:40)
  at scala.collection.Iterator.foreach(Iterator.scala:941)
  at scala.collection.Iterator.fo

[jira] [Created] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-26 Thread Harsha Teja Kanna (Jira)
Harsha Teja Kanna created HUDI-3335:
---

 Summary: Loading Hudi table fails with NullPointerException
 Key: HUDI-3335
 URL: https://issues.apache.org/jira/browse/HUDI-3335
 Project: Apache Hudi
  Issue Type: Bug
Affects Versions: 0.10.1
Reporter: Harsha Teja Kanna


Environment

Spark 3.1.2

Hudi 0.10.1

Have a COW table with metadata enabled. Loading from Spark query fails with 
java.lang.NullPointerException
  at 
org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
  at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
  at 
org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
  at 
org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
  at 
org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
  at 
org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
  at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
  at 
org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
  at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
  at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
  at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
  at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
  at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
  at $anonfun$res3$1(:46)
  at $anonfun$res3$1$adapted(:40)
  at scala.collection.Iterator.foreach(Iterator.scala:941)
  at scala.collection.Iterator.foreach$(Iterator.scala:941)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3216) Support timestamp with microseconds precision

2022-01-22 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480496#comment-17480496
 ] 

Harsha Teja Kanna commented on HUDI-3216:
-

Hi team,

Need confirmation if this will have impact if the timestamp in microseconds is 
used for source ordering field?

> Support timestamp with microseconds precision
> -
>
> Key: HUDI-3216
> URL: https://issues.apache.org/jira/browse/HUDI-3216
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark, writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: user-support-issues
> Fix For: 0.11.0
>
>
> As of now, if a field with timestamp datatype w/ microsec precision is 
> ingested to hudi, resultant dataset will only have until millisec 
> granularity. 
> Ref issue: [https://github.com/apache/hudi/issues/3429]
>  
> We might need to support millisec granularity. 
> References issue has some pointers on how to go about it. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479835#comment-17479835
 ] 

Harsha Teja Kanna commented on HUDI-3242:
-

Input: monthly partitions

partition=2021/01

file1_1 - timestamp1
file2_1 - timestamp2
file3_1 - timestamp3

partition=2021/02

file1_2 - timestamp1
file2_2 - timestamp2
file3_2 - timestamp3

Now I want to run Deltastreamer partition after partition to create Hudi table

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-con

[jira] [Comment Edited] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479696#comment-17479696
 ] 

Harsha Teja Kanna edited comment on HUDI-3242 at 1/20/22, 9:42 PM:
---

I have input dataset partitioned but have very small files(millions of files), 
and JSON format. and need to create multiple Hudi Tables out of it.
So What I do is process it by a spark program. converts it into Parquet but big 
files.
As the dataset is large. the partitions are processed in parallel. and the file 
stamps for the files in result dataset can be in any order. Operation is only 
'INSERT'.
I used to create the table by setting checkpoint 0 before even in 0.10.0 
release.

How can I do that now?


was (Author: h7kanna):
I have input dataset partitioned but have very small files(millions of files), 
and JSON format. and need to create multiple Hudi Tables out of it.
So What I do is process it by a spark program. converts it into Parquet but big 
files.
As the dataset is large. the partitions are processed in parallel. and the file 
stamps for the files can be in any order. Operation is only 'INSERT'.
I used to create the table by setting checkpoint 0 before even in 0.10.0 
release.

How can I do that now?

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true 

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479696#comment-17479696
 ] 

Harsha Teja Kanna commented on HUDI-3242:
-

I have a dataset partitioned but have very small files, and JSON format. and 
need to create multiple Hudi Tables out of it.
So What I do is process it by a spark program. converts it into Parquet but big 
files.
As the dataset is large. the partitions are processed in parallel. and the file 
stamps for the files can be in any order. Operation is only 'INSERT'.
I used to create the table by setting checkpoint 0 before even in 0.10.0 
release.

How can I do that now?

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.d

[jira] [Comment Edited] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479696#comment-17479696
 ] 

Harsha Teja Kanna edited comment on HUDI-3242 at 1/20/22, 9:41 PM:
---

I have input dataset partitioned but have very small files(millions of files), 
and JSON format. and need to create multiple Hudi Tables out of it.
So What I do is process it by a spark program. converts it into Parquet but big 
files.
As the dataset is large. the partitions are processed in parallel. and the file 
stamps for the files can be in any order. Operation is only 'INSERT'.
I used to create the table by setting checkpoint 0 before even in 0.10.0 
release.

How can I do that now?


was (Author: h7kanna):
I have a dataset partitioned but have very small files, and JSON format. and 
need to create multiple Hudi Tables out of it.
So What I do is process it by a spark program. converts it into Parquet but big 
files.
As the dataset is large. the partitions are processed in parallel. and the file 
stamps for the files can be in any order. Operation is only 'INSERT'.
I used to create the table by setting checkpoint 0 before even in 0.10.0 
release.

How can I do that now?

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.wri

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479685#comment-17479685
 ] 

Harsha Teja Kanna commented on HUDI-3242:
-

Let me try that.

Yes, I am aware that checkpoint need not be set for initial creation, It was 
just my automation passing it by mistake found out the issue.

Also few of my unloaded datasets have non linear timestamps across partitions 
and I create the hudi table partition after partition and set checkpoint to 0.

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebase

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479650#comment-17479650
 ] 

Harsha Teja Kanna commented on HUDI-3242:
-

./20220114204813242.commit:    "deltastreamer.checkpoint.reset_key" : "0",

./20220115020705625.commit:    "deltastreamer.checkpoint.reset_key" : "0",

./20220115024513109.commit:    "deltastreamer.checkpoint.reset_key" : "0",

./20220116191900632.commit:    "deltastreamer.checkpoint.reset_key" : "0",

./20220116192332086.commit:    "deltastreamer.checkpoint.reset_key" : "0",

./20220120081554816.commit:    "deltastreamer.checkpoint.reset_key" : "0",

./20220120081925787.commit:    "deltastreamer.checkpoint.reset_key" : "0",

./20220120082326203.commit:    "deltastreamer.checkpoint.reset_key" : "0",

./20220120082717393.commit:    "deltastreamer.checkpoint.reset_key" : "0",

./20220120083124273.commit:    "deltastreamer.checkpoint.reset_key" : "0",

./20220120085228007.commit:    "deltastreamer.checkpoint.reset_key" : "0",

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
>

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479644#comment-17479644
 ] 

Harsha Teja Kanna commented on HUDI-3242:
-

./20220114204813242.commit:    "deltastreamer.checkpoint.key" : "1642193264000"

./20220115020705625.commit:    "deltastreamer.checkpoint.key" : "1642204844000"

./20220115024513109.commit:    "deltastreamer.checkpoint.key" : "1642214688000"

./20220116191900632.commit:    "deltastreamer.checkpoint.key" : "1642291261000"

./20220116192332086.commit:    "deltastreamer.checkpoint.key" : "1642360982000"

./20220120081554816.commit:    "deltastreamer.checkpoint.key" : "1642377603000"

./20220120081925787.commit:    "deltastreamer.checkpoint.key" : "1642464065000"

./20220120082326203.commit:    "deltastreamer.checkpoint.key" : "1642550413000"

./20220120082717393.commit:    "deltastreamer.checkpoint.key" : "1642636804000"

./20220120083124273.commit:    "deltastreamer.checkpoint.key" : "1642667425000"

./20220120085228007.commit:    "deltastreamer.checkpoint.key" : "1642668697000"

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479633#comment-17479633
 ] 

Harsha Teja Kanna commented on HUDI-3242:
-

5) checkpoint from the last commit before this replace commit

{
  "partitionToWriteStats" : {
    "date=2022/01/19" : [ {
      "fileId" : "4c5a04e9-9288-4000-9909-4c2640c5b779-0",
      "path" : 
"date=2022/01/19/4c5a04e9-9288-4000-9909-4c2640c5b779-0_0-29-893_20220120085228007.parquet",
      "prevCommit" : "20220120083230209",
      "numWrites" : 297106,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 3452,
      "totalWriteBytes" : 12468290,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "date=2022/01/19",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 12468290,
      "minEventTime" : null,
      "maxEventTime" : null
    } ]
  },
  "compacted" : false,
  "extraMetadata" : {
    "schema" : 
"\{\"type\":\"record\",\"name\":\"hoodie_source\",\"namespace\":\"hoodie.source\",\"fields\":[REDACTED]}",
    "deltastreamer.checkpoint.reset_key" : "0",
    "deltastreamer.checkpoint.key" : "1642668697000"
  },
  "operationType" : "UPSERT",
  "totalRecordsDeleted" : 0,
  "totalLogRecordsCompacted" : 0,
  "totalLogFilesCompacted" : 0,
  "totalCompactedRecordsUpdated" : 0,
  "totalLogFilesSize" : 0,
  "totalScanTime" : 0,
  "totalCreateTime" : 0,
  "totalUpsertTime" : 6371,
  "minAndMaxEventTime" : {
    "Optional.empty" : {
      "val" : null,
      "present" : false
    }
  },
  "writePartitionPaths" : [ "date=2022/01/19" ],
  "fileIdAndRelativePaths" : {
    "4c5a04e9-9288-4000-9909-4c2640c5b779-0" : 
"date=2022/01/19/4c5a04e9-9288-4000-9909-4c2640c5b779-0_0-29-893_20220120085228007.parquet"
  }
}

!Screen Shot 2022-01-20 at 1.36.48 PM.png!

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort

[jira] [Updated] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3242:

Attachment: Screen Shot 2022-01-20 at 1.36.48 PM.png

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
>  \
> --hoodie-conf "\"hoodie.deltastreamer.tra

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479631#comment-17479631
 ] 

Harsha Teja Kanna commented on HUDI-3242:
-

Hi,

Initially I saw the behavior of not picking the files from the partition, but 
that dataset was differently produced. meaning source dataset is unloaded from 
Warehouse and files in different partitions do not have linear timestamps. So I 
thought that be might be cause


Now I am seeing this in all the tables, When passing checkpoint 0 to reprocess 
a partition, it is just skipping the partition. Only processing the current 
partition. So I found that checkpoint is ignored. I see it Deltastreamer logs 
and the job ends in 20 seconds

2) No clustering commits are pending, only one Deltastreamer is running and it 
successfully completed.. I can see the cluster commit on the timeline.

3) I can reproduce it consistently, not being able to backfill the tables 
currently.

4) Contents of replace commit

{
  "partitionToWriteStats" : {
    "date=2022/01/19" : [ {
      "fileId" : "c8b06d0b-1c8a-434e-b54a-15b6525b738a-0",
      "path" : 
"date=2022/01/19/c8b06d0b-1c8a-434e-b54a-15b6525b738a-0_0-78-996_20220120085344674.parquet",
      "prevCommit" : "null",
      "numWrites" : 297106,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 297106,
      "totalWriteBytes" : 12468914,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "date=2022/01/19",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 12468914,
      "minEventTime" : null,
      "maxEventTime" : null
    } ]
  },
  "compacted" : false,
  "extraMetadata" : {
    "schema" : 
"{\"type\":\"record\",\"name\":\"hoodie_source\",\"namespace\":\"hoodie.source\",\"fields\":[REDACTED]"
  },
  "operationType" : "CLUSTER",
  "partitionToReplaceFileIds" : {
    "date=2022/01/19" : [ "4c5a04e9-9288-4000-9909-4c2640c5b779-0" ]
  },
  "totalRecordsDeleted" : 0,
  "totalLogRecordsCompacted" : 0,
  "totalLogFilesCompacted" : 0,
  "totalCompactedRecordsUpdated" : 0,
  "totalLogFilesSize" : 0,
  "totalScanTime" : 0,
  "totalCreateTime" : 9815,
  "totalUpsertTime" : 0,
  "minAndMaxEventTime" : {
    "Optional.empty" : {
      "val" : null,
      "present" : false
    }
  },
  "writePartitionPaths" : [ "date=2022/01/19" ],
  "fileIdAndRelativePaths" : {
    "c8b06d0b-1c8a-434e-b54a-15b6525b738a-0" : 
"date=2022/01/19/c8b06d0b-1c8a-434e-b54a-15b6525b738a-0_0-78-996_20220120085344674.parquet"
  }
}

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedT

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479240#comment-17479240
 ] 

Harsha Teja Kanna commented on HUDI-3242:
-

So what I found from further debugging is that once the --checkpoint 0 is 
passed once to Deltastreamer, it will not pick it again if it is same.

[https://github.com/apache/hudi/blob/release-0.10.1/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L471]

I added log statements in a PR to master branch

This is what I got



22/01/20 04:04:28 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3a://datalake-hudi/v1/journals

22/01/20 04:04:28 INFO HoodieTableConfig: Loading table properties from 
s3a://datalake-hudi/v1/journals/.hoodie/hoodie.properties

22/01/20 04:04:28 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3a://datalake-hudi/v1/journals

22/01/20 04:04:29 INFO HoodieActiveTimeline: Loaded instants upto : 
Option\{val=[20220120085344674__replacecommit__COMPLETED]}

22/01/20 04:04:29 INFO DFSPathSelector: Using path selector 
org.apache.hudi.utilities.sources.helpers.DFSPathSelector

22/01/20 04:04:29 INFO HoodieDeltaStreamer: Delta Streamer running only single 
round

22/01/20 04:04:29 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3a://datalake-hudi/v1/journals

22/01/20 04:04:29 INFO HoodieTableConfig: Loading table properties from 
s3a://datalake-hudi/v1/journals/.hoodie/hoodie.properties

22/01/20 04:04:29 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3a://datalake-hudi/v1/journals

22/01/20 04:04:30 INFO HoodieActiveTimeline: Loaded instants upto : 
Option\{val=[20220120085344674__replacecommit__COMPLETED]}

22/01/20 04:04:30 INFO DeltaSync: *Checkpoint reset from metadata: 0*

22/01/20 04:04:30 INFO DeltaSync: *Checkpoint from config: 0*

22/01/20 04:04:30 INFO DeltaSync: *Checkpoint to resume from : 
Option\{val=1642668697000}*

22/01/20 04:04:30 INFO DFSPathSelector: Root path => 
s3a://datalake-hudi/v1/journals/year=2022/month=01/day=19 source limit => 
9223372036854775807

22/01/20 04:04:37 INFO DeltaSync: No new data, source checkpoint has not 
changed. Nothing to commit. Old checkpoint=(Option\{val=1642668697000}). New 
Checkpoint=(1642668697000)

22/01/20 04:04:37 INFO DeltaSync: Shutting down embedded timeline server

22/01/20 04:04:37 INFO HoodieDeltaStreamer: Shut down delta streamer

22/01/20 04:04:37 INFO SparkUI: Stopped Spark web UI at http://192.168.86.5:4040

22/01/20 04:04:37 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!

22/01/20 04:04:37 INFO MemoryStore: MemoryStore cleared

22/01/20 04:04:37 INFO BlockManager: BlockManager stopped

22/01/20 04:04:38 INFO BlockManagerMaster: BlockManagerMaster stopped

22/01/20 04:04:38 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!

22/01/20 04:04:38 INFO SparkContext: Successfully stopped SparkContext

22/01/20 04:04:38 INFO ShutdownHookManager: Shutdown hook called

22/01/20 04:04:38 INFO ShutdownHookManager: Deleting directory 
/private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-acf0e21c-c48c-440c-86f8-72ff20bef349

22/01/20 04:04:38 INFO ShutdownHookManager: Deleting directory 
/private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-b53eb674-0c67-4b68-8974-7ff706408686

22/01/20 04:04:38 INFO MetricsSystemImpl: Stopping s3a-file-system metrics 
system...

22/01/20 04:04:38 INFO MetricsSystemImpl: s3a-file-system metrics system 
stopped.

22/01/20 04:04:38 INFO MetricsSystemImpl: s3a-file-system metrics system 
shutdown complete.

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 f

[jira] [Comment Edited] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479240#comment-17479240
 ] 

Harsha Teja Kanna edited comment on HUDI-3242 at 1/20/22, 10:23 AM:


[~shivnarayan] 

So what I found from further debugging is that once the --checkpoint 0 is 
passed once to Deltastreamer, it will not pick it again if it is same.

[https://github.com/apache/hudi/blob/release-0.10.1/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L471]

I added log statements in a PR to master branch

This is what I got

22/01/20 04:04:28 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3a://datalake-hudi/v1/journals

22/01/20 04:04:28 INFO HoodieTableConfig: Loading table properties from 
s3a://datalake-hudi/v1/journals/.hoodie/hoodie.properties

22/01/20 04:04:28 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3a://datalake-hudi/v1/journals

22/01/20 04:04:29 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20220120085344674__replacecommit__COMPLETED]}

22/01/20 04:04:29 INFO DFSPathSelector: Using path selector 
org.apache.hudi.utilities.sources.helpers.DFSPathSelector

22/01/20 04:04:29 INFO HoodieDeltaStreamer: Delta Streamer running only single 
round

22/01/20 04:04:29 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3a://datalake-hudi/v1/journals

22/01/20 04:04:29 INFO HoodieTableConfig: Loading table properties from 
s3a://datalake-hudi/v1/journals/.hoodie/hoodie.properties

22/01/20 04:04:29 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3a://datalake-hudi/v1/journals

22/01/20 04:04:30 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20220120085344674__replacecommit__COMPLETED]}

22/01/20 04:04:30 INFO DeltaSync: *Checkpoint reset from metadata: 0*

22/01/20 04:04:30 INFO DeltaSync: *Checkpoint from config: 0*

22/01/20 04:04:30 INFO DeltaSync: *Checkpoint to resume from : 
Option\{val=1642668697000}*

22/01/20 04:04:30 INFO DFSPathSelector: Root path => 
s3a://datalake-hudi/v1/journals/year=2022/month=01/day=19 source limit => 
9223372036854775807

22/01/20 04:04:37 INFO DeltaSync: No new data, source checkpoint has not 
changed. Nothing to commit. Old checkpoint=(Option\{val=1642668697000}). New 
Checkpoint=(1642668697000)

22/01/20 04:04:37 INFO DeltaSync: Shutting down embedded timeline server

22/01/20 04:04:37 INFO HoodieDeltaStreamer: Shut down delta streamer

22/01/20 04:04:37 INFO SparkUI: Stopped Spark web UI at 
[http://192.168.86.5:4040|http://192.168.86.5:4040/]

22/01/20 04:04:37 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!

22/01/20 04:04:37 INFO MemoryStore: MemoryStore cleared

22/01/20 04:04:37 INFO BlockManager: BlockManager stopped

22/01/20 04:04:38 INFO BlockManagerMaster: BlockManagerMaster stopped

22/01/20 04:04:38 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!

22/01/20 04:04:38 INFO SparkContext: Successfully stopped SparkContext

22/01/20 04:04:38 INFO ShutdownHookManager: Shutdown hook called

22/01/20 04:04:38 INFO ShutdownHookManager: Deleting directory 
/private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-acf0e21c-c48c-440c-86f8-72ff20bef349

22/01/20 04:04:38 INFO ShutdownHookManager: Deleting directory 
/private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-b53eb674-0c67-4b68-8974-7ff706408686

22/01/20 04:04:38 INFO MetricsSystemImpl: Stopping s3a-file-system metrics 
system...

22/01/20 04:04:38 INFO MetricsSystemImpl: s3a-file-system metrics system 
stopped.

22/01/20 04:04:38 INFO MetricsSystemImpl: s3a-file-system metrics system 
shutdown complete.


was (Author: h7kanna):
So what I found from further debugging is that once the --checkpoint 0 is 
passed once to Deltastreamer, it will not pick it again if it is same.

[https://github.com/apache/hudi/blob/release-0.10.1/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L471]

I added log statements in a PR to master branch

This is what I got



22/01/20 04:04:28 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3a://datalake-hudi/v1/journals

22/01/20 04:04:28 INFO HoodieTableConfig: Loading table properties from 
s3a://datalake-hudi/v1/journals/.hoodie/hoodie.properties

22/01/20 04:04:28 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3a://datalake-hudi/v1/journals

22/01/20 04:04:29 INFO HoodieActiveTimeline: Loaded instants upto : 
Option\{val=[20220120085344674__replacecommit__COMPLETED]}

22/01/20 04:04:29 INFO DFSPathSelector: Using path selector 
org.apache.hudi.utilities.sources.helpers.DFSPathSelector

22/01/20 04:04:29 INFO HoodieD

[jira] [Comment Edited] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-16 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476923#comment-17476923
 ] 

Harsha Teja Kanna edited comment on HUDI-3242 at 1/17/22, 2:30 AM:
---

I am now seeing this for all the tables.. after running for few commits. at a 
partition, I had to backfill and the checkpoint 0 is ignored. 
Problem is that I am not able to reproduce it deterministically on a test table.


was (Author: h7kanna):
I am now seeing this for all the tables.. after running for few commits. at a 
partition, I had to backfill and the checkpoint 0 is ignored. 

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Priority: Blocker
>  Labels: sev:critical
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png
>
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
>

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-16 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476923#comment-17476923
 ] 

Harsha Teja Kanna commented on HUDI-3242:
-

I am now seeing this for all the tables.. after running for few commits. at a 
partition, I had to backfill and the checkpoint 0 is ignored. 

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Priority: Blocker
>  Labels: sev:critical
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png
>
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
>  \
> --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, a

[jira] [Updated] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-16 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3242:

Priority: Blocker  (was: Critical)

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Priority: Blocker
>  Labels: sev:critical
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png
>
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
>  \
> --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, 
> to_timestamp(timestamp) as timestamp, sid, 
> date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM  a \"" \
> --hoodie-conf hoodie.file.listing

[jira] [Updated] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-14 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3242:

Priority: Critical  (was: Blocker)

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Priority: Critical
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png
>
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
>  \
> --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, 
> to_timestamp(timestamp) as timestamp, sid, 
> date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM  a \"" \
> --hoodie-conf hoodie.file.listing.parallelism=256 \
> --hoodie-conf hoodie.finalize.write.parallelism=256 \
> 

[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2022-01-13 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475935#comment-17475935
 ] 

Harsha Teja Kanna edited comment on HUDI-3066 at 1/14/22, 5:04 AM:
---

With only base path in the load. File listing time(few seconds) is negligible 
compared to query runtime. Thanks

Though I have to add a new column 'date' to every table to make the above 
columns type clash resolved.


was (Author: h7kanna):
With only base path in the load. File listing time(few seconds) is negligible 
compared to query runtime. Thanks

Though I have to add a new column 'date' to every table to make the above 
columns type clash.

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 
> PM.png, metadata_files.txt, metadata_files_compacted.txt, 
> metadata_timeline.txt, metadata_timeline_archived.txt, 
> metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, 
> timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37

[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2022-01-13 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475935#comment-17475935
 ] 

Harsha Teja Kanna commented on HUDI-3066:
-

With only base path in the load. File listing time(few seconds) is negligible 
compared to query runtime. Thanks

Though I have to add a new column 'date' to every table to make the above 
columns type clash.

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 
> PM.png, metadata_files.txt, metadata_files_compacted.txt, 
> metadata_timeline.txt, metadata_timeline_archived.txt, 
> metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, 
> timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Rea

[jira] [Comment Edited] (HUDI-2947) HoodieDeltaStreamer/DeltaSync can improperly pick up the checkpoint config from CLI in continuous mode

2022-01-13 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475869#comment-17475869
 ] 

Harsha Teja Kanna edited comment on HUDI-2947 at 1/14/22, 12:17 AM:


I think this problem still exists but not in continuous mode

 https://issues.apache.org/jira/browse/HUDI-3242


was (Author: h7kanna):
I think this problem still exists

 https://issues.apache.org/jira/browse/HUDI-3242

> HoodieDeltaStreamer/DeltaSync can improperly pick up the checkpoint config 
> from CLI in continuous mode
> --
>
> Key: HUDI-2947
> URL: https://issues.apache.org/jira/browse/HUDI-2947
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available, sev:critical
> Fix For: 0.10.1
>
>
> *Problem:*
> When deltastreamer is started with a given checkpoint, e.g., `--checkpoint 
> 0`, in the continuous mode, the deltastreamer job may pick up the wrong 
> checkpoint later on.  The wrong checkpoint (for 20211206203551080 commit) 
> happens after the replacecommit and clean, which is reset to "0", instead of 
> "5" after 20211206202728233.commit.  More details below.
>  
> The bug is due to the check here: 
> [https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L335]
> {code:java}
> if (cfg.checkpoint != null && 
> (StringUtils.isNullOrEmpty(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY))  
>   || 
> !cfg.checkpoint.equals(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY {
> resumeCheckpointStr = Option.of(cfg.checkpoint);
> } {code}
> In this case of resuming after a clustering commit, "cfg.checkpoint != null" 
> and 
> "StringUtils.isNullOrEmpty(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY))" 
>  are both true as "--checkpoint 0" is configured and last commit is 
> replacecommit without checkpoint keys.  This leads to the resume checkpoint 
> string being reset to the configured checkpoint, skipping the timeline 
> walk-back logic below, which is wrong.  
>  
> Timeline:
>  
> {code:java}
>  189069 Dec  6 12:19 20211206201238649.commit
>       0 Dec  6 12:12 20211206201238649.commit.requested
>       0 Dec  6 12:12 20211206201238649.inflight
>  189069 Dec  6 12:27 20211206201959151.commit
>       0 Dec  6 12:20 20211206201959151.commit.requested
>       0 Dec  6 12:20 20211206201959151.inflight
>  189069 Dec  6 12:34 20211206202728233.commit
>       0 Dec  6 12:27 20211206202728233.commit.requested
>       0 Dec  6 12:27 20211206202728233.inflight
>   36662 Dec  6 12:35 20211206203449899.replacecommit
>       0 Dec  6 12:35 20211206203449899.replacecommit.inflight
>   34656 Dec  6 12:35 20211206203449899.replacecommit.requested
>   28013 Dec  6 12:35 20211206203503574.clean
>   19024 Dec  6 12:35 20211206203503574.clean.inflight
>   19024 Dec  6 12:35 20211206203503574.clean.requested
>  189069 Dec  6 12:43 20211206203551080.commit
>       0 Dec  6 12:35 20211206203551080.commit.requested
>       0 Dec  6 12:35 20211206203551080.inflight
>  189069 Dec  6 12:50 20211206204311612.commit
>       0 Dec  6 12:43 20211206204311612.commit.requested
>       0 Dec  6 12:43 20211206204311612.inflight
>       0 Dec  6 12:50 20211206205044595.commit.requested
>       0 Dec  6 12:50 20211206205044595.inflight
>     128 Dec  6 12:56 archived
>     483 Dec  6 11:52 hoodie.properties
>  {code}
>  
> Checkpoints in commits:
>  
> {code:java}
> grep "deltastreamer.checkpoint.key" *
> 20211206201238649.commit:    "deltastreamer.checkpoint.key" : "2"
> 20211206201959151.commit:    "deltastreamer.checkpoint.key" : "3"
> 20211206202728233.commit:    "deltastreamer.checkpoint.key" : "4"
> 20211206203551080.commit:    "deltastreamer.checkpoint.key" : "1"
> 20211206204311612.commit:    "deltastreamer.checkpoint.key" : "2" {code}
>  
> *Steps to reproduce:*
> Run HoodieDeltaStreamer in the continuous mode, by providing both 
> "--checkpoint 0" and "--continuous", with inline clustering and sync clean 
> enabled (some configs are masked).
>  
> {code:java}
> spark-submit \
>   --master yarn \
>   --driver-memory 8g --executor-memory 8g --num-executors 3 --executor-cores 
> 4 \
>   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>   --conf 
> spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
>  \
>   --conf spark.speculation=true \
>   --conf spark.speculation.multiplier=1.0 \
>   --conf spark.speculation.quantile=0.5 \
>   --packages org.apache.spark:spark-avro_2.12:3.2.0 \
>   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
>   file:/h

[jira] [Commented] (HUDI-2947) HoodieDeltaStreamer/DeltaSync can improperly pick up the checkpoint config from CLI in continuous mode

2022-01-13 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475869#comment-17475869
 ] 

Harsha Teja Kanna commented on HUDI-2947:
-

I think this problem still exists

 https://issues.apache.org/jira/browse/HUDI-3242

> HoodieDeltaStreamer/DeltaSync can improperly pick up the checkpoint config 
> from CLI in continuous mode
> --
>
> Key: HUDI-2947
> URL: https://issues.apache.org/jira/browse/HUDI-2947
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available, sev:critical
> Fix For: 0.10.1
>
>
> *Problem:*
> When deltastreamer is started with a given checkpoint, e.g., `--checkpoint 
> 0`, in the continuous mode, the deltastreamer job may pick up the wrong 
> checkpoint later on.  The wrong checkpoint (for 20211206203551080 commit) 
> happens after the replacecommit and clean, which is reset to "0", instead of 
> "5" after 20211206202728233.commit.  More details below.
>  
> The bug is due to the check here: 
> [https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L335]
> {code:java}
> if (cfg.checkpoint != null && 
> (StringUtils.isNullOrEmpty(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY))  
>   || 
> !cfg.checkpoint.equals(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY {
> resumeCheckpointStr = Option.of(cfg.checkpoint);
> } {code}
> In this case of resuming after a clustering commit, "cfg.checkpoint != null" 
> and 
> "StringUtils.isNullOrEmpty(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY))" 
>  are both true as "--checkpoint 0" is configured and last commit is 
> replacecommit without checkpoint keys.  This leads to the resume checkpoint 
> string being reset to the configured checkpoint, skipping the timeline 
> walk-back logic below, which is wrong.  
>  
> Timeline:
>  
> {code:java}
>  189069 Dec  6 12:19 20211206201238649.commit
>       0 Dec  6 12:12 20211206201238649.commit.requested
>       0 Dec  6 12:12 20211206201238649.inflight
>  189069 Dec  6 12:27 20211206201959151.commit
>       0 Dec  6 12:20 20211206201959151.commit.requested
>       0 Dec  6 12:20 20211206201959151.inflight
>  189069 Dec  6 12:34 20211206202728233.commit
>       0 Dec  6 12:27 20211206202728233.commit.requested
>       0 Dec  6 12:27 20211206202728233.inflight
>   36662 Dec  6 12:35 20211206203449899.replacecommit
>       0 Dec  6 12:35 20211206203449899.replacecommit.inflight
>   34656 Dec  6 12:35 20211206203449899.replacecommit.requested
>   28013 Dec  6 12:35 20211206203503574.clean
>   19024 Dec  6 12:35 20211206203503574.clean.inflight
>   19024 Dec  6 12:35 20211206203503574.clean.requested
>  189069 Dec  6 12:43 20211206203551080.commit
>       0 Dec  6 12:35 20211206203551080.commit.requested
>       0 Dec  6 12:35 20211206203551080.inflight
>  189069 Dec  6 12:50 20211206204311612.commit
>       0 Dec  6 12:43 20211206204311612.commit.requested
>       0 Dec  6 12:43 20211206204311612.inflight
>       0 Dec  6 12:50 20211206205044595.commit.requested
>       0 Dec  6 12:50 20211206205044595.inflight
>     128 Dec  6 12:56 archived
>     483 Dec  6 11:52 hoodie.properties
>  {code}
>  
> Checkpoints in commits:
>  
> {code:java}
> grep "deltastreamer.checkpoint.key" *
> 20211206201238649.commit:    "deltastreamer.checkpoint.key" : "2"
> 20211206201959151.commit:    "deltastreamer.checkpoint.key" : "3"
> 20211206202728233.commit:    "deltastreamer.checkpoint.key" : "4"
> 20211206203551080.commit:    "deltastreamer.checkpoint.key" : "1"
> 20211206204311612.commit:    "deltastreamer.checkpoint.key" : "2" {code}
>  
> *Steps to reproduce:*
> Run HoodieDeltaStreamer in the continuous mode, by providing both 
> "--checkpoint 0" and "--continuous", with inline clustering and sync clean 
> enabled (some configs are masked).
>  
> {code:java}
> spark-submit \
>   --master yarn \
>   --driver-memory 8g --executor-memory 8g --num-executors 3 --executor-cores 
> 4 \
>   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>   --conf 
> spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
>  \
>   --conf spark.speculation=true \
>   --conf spark.speculation.multiplier=1.0 \
>   --conf spark.speculation.quantile=0.5 \
>   --packages org.apache.spark:spark-avro_2.12:3.2.0 \
>   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
>   file:/home/hadoop/ethan/hudi-utilities-bundle_2.12-0.10.0-rc3.jar \
>   --props file:/home/hadoop/ethan/test.properties \
>   --source-class ... \
>   --source-ordering-field ts \
>   --target-bas

[jira] [Updated] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-13 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3242:

Description: 
Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.

However, I see for a certain table. Only partial discovery of files happening 
after the initial commit of the table.

But if the second partition is given as input for the first commit, all the 
files are getting discovered.

First partition : 2021/01 has 744 files and all of them are discovered
Second partition: 2021/02 has 762 files but only 72 are discovered.

Checkpoint is set to 0. 

No errors in the logs.
{code:java}
spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-cores 30 \
--driver-memory 32g \
--executor-cores 5 \
--executor-memory 32g \
--num-executors 120 \
--jars 
s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
 \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
 \
--table-type COPY_ON_WRITE \
--source-ordering-field timestamp \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
--target-table sessions_by_date \
--transformer-class 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
--op INSERT \
--checkpoint 0 \
--hoodie-conf hoodie.clean.automatic=true \
--hoodie-conf hoodie.cleaner.commits.retained=1 \
--hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
--hoodie-conf hoodie.clustering.inline=false \
--hoodie-conf hoodie.clustering.inline.max.commits=1 \
--hoodie-conf 
hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
 \
--hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
--hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
--hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
--hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 \
--hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
--hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
--hoodie-conf hoodie.datasource.hive_sync.enable=false \
--hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
--hoodie-conf hoodie.datasource.hive_sync.mode=hms \
--hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
 \
--hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
--hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
--hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
--hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
 \
--hoodie-conf hoodie.datasource.write.operation=insert \
--hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
--hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
--hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
--hoodie-conf hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd 
\
--hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
--hoodie-conf 
hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
--hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
--hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
--hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
 \
--hoodie-conf 
hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
 \
--hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, 
to_timestamp(timestamp) as timestamp, sid, date_format(to_timestamp(timestamp), 
'/MM/dd') AS date FROM  a \"" \
--hoodie-conf hoodie.file.listing.parallelism=256 \
--hoodie-conf hoodie.finalize.write.parallelism=256 \
--hoodie-conf 
hoodie.generate.consistent.timestamp.logical.for.key.generator=true \
--hoodie-conf hoodie.insert.shuffle.parallelism=1000 \
--hoodie-conf hoodie.metadata.enable=true \
--hoodie-conf hoodie.metadata.metrics.enable=true \
--hoodie-conf 
hoodie.metrics.cloudwatch.metric.prefix=emr.datalake-service.prd.insert.sessions_by_date
 \
--hoodie-conf hoodie.metrics.on=true \
--hoodie-conf hoodie.metrics.reporter.type=CLOUDWATCH \
--hoodie-conf hoodie.parquet.block.size=268435456 \
--hoodie-conf hoodie.parquet.compression.codec=snappy \
--hoodie-conf hoodie.parquet.max.file.size=268435456 \
--hoodie-conf hoodie.parquet.small.file.limit=25000 {code}
 

 

 

  was:
Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.

However, I see for a certain table. Only partial dis

[jira] [Updated] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-13 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3242:

Priority: Blocker  (was: Critical)

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Priority: Blocker
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png
>
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
>  \
> --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, 
> to_timestamp(timestamp) as timestamp, sid, 
> date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM  a \"" \
> --hoodie-conf hoodie.file.listing.parallelism=256 \
> --hoodie-conf hoodie.finalize.write.parallelism=256 \
> -

[jira] [Updated] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-13 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3242:

Summary: Checkpoint 0 is ignored -Partial parquet file discovery after the 
first commit  (was: Partial parquet file discovery after the first commit)

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Priority: Critical
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png
>
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
>  \
> --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, 
> to_timestamp(timestamp) as timestamp, sid, 
> date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM  a \"" \
> --hoodie-conf hoodie.

[jira] [Updated] (HUDI-3242) Partial parquet file discovery after the first commit

2022-01-13 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3242:

Description: 
Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.

However, I see for a certain table. Only partial discovery of files happening 
after the initial commit of the table.

But if the second partition is given as input for the first commit, all the 
files are getting discovered.

First partition : 2021/01 has 744 files and all of them are discovered
Second partition: 2021/02 has 762 files but only 72 are discovered.

No errors in the logs.
{code:java}
spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-cores 30 \
--driver-memory 32g \
--executor-cores 5 \
--executor-memory 32g \
--num-executors 120 \
--jars 
s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
 \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
 \
--table-type COPY_ON_WRITE \
--source-ordering-field timestamp \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
--target-table sessions_by_date \
--transformer-class 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
--op INSERT \
--checkpoint 0 \
--hoodie-conf hoodie.clean.automatic=true \
--hoodie-conf hoodie.cleaner.commits.retained=1 \
--hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
--hoodie-conf hoodie.clustering.inline=false \
--hoodie-conf hoodie.clustering.inline.max.commits=1 \
--hoodie-conf 
hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
 \
--hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
--hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
--hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
--hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 \
--hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
--hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
--hoodie-conf hoodie.datasource.hive_sync.enable=false \
--hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
--hoodie-conf hoodie.datasource.hive_sync.mode=hms \
--hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
 \
--hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
--hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
--hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
--hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
 \
--hoodie-conf hoodie.datasource.write.operation=insert \
--hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
--hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
--hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
--hoodie-conf hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd 
\
--hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
--hoodie-conf 
hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
--hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
--hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
--hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
 \
--hoodie-conf 
hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
 \
--hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, 
to_timestamp(timestamp) as timestamp, sid, date_format(to_timestamp(timestamp), 
'/MM/dd') AS date FROM  a \"" \
--hoodie-conf hoodie.file.listing.parallelism=256 \
--hoodie-conf hoodie.finalize.write.parallelism=256 \
--hoodie-conf 
hoodie.generate.consistent.timestamp.logical.for.key.generator=true \
--hoodie-conf hoodie.insert.shuffle.parallelism=1000 \
--hoodie-conf hoodie.metadata.enable=true \
--hoodie-conf hoodie.metadata.metrics.enable=true \
--hoodie-conf 
hoodie.metrics.cloudwatch.metric.prefix=emr.datalake-service.prd.insert.sessions_by_date
 \
--hoodie-conf hoodie.metrics.on=true \
--hoodie-conf hoodie.metrics.reporter.type=CLOUDWATCH \
--hoodie-conf hoodie.parquet.block.size=268435456 \
--hoodie-conf hoodie.parquet.compression.codec=snappy \
--hoodie-conf hoodie.parquet.max.file.size=268435456 \
--hoodie-conf hoodie.parquet.small.file.limit=25000 {code}
 

 

 

  was:
Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.

However, I see for a certain table. Only partial discovery of files happening 

[jira] [Updated] (HUDI-3242) Partial parquet file discovery after the first commit

2022-01-13 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3242:

Attachment: (was: Screen Shot 2022-01-13 at 2.40.55 AM.png)

> Partial parquet file discovery after the first commit
> -
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Priority: Critical
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png
>
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
>  \
> --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, 
> to_timestamp(timestamp) as timestamp, sid, 
> date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM  a \"" \
> --hoodie-conf hoodie.file.listing.parallelism=256 \
> --hoodie-conf hoodie.finalize.write.parallelism=256 \
> --hoodie-conf 
> hoodie.generate.consistent.timestamp.logical.for.k

[jira] [Updated] (HUDI-3242) Partial parquet file discovery after the first commit

2022-01-13 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3242:

Attachment: Screen Shot 2022-01-13 at 2.40.55 AM.png
Screen Shot 2022-01-13 at 2.55.35 AM.png

> Partial parquet file discovery after the first commit
> -
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Priority: Critical
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png
>
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
>  \
> --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, 
> to_timestamp(timestamp) as timestamp, sid, 
> date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM  a \"" \
> --hoodie-conf hoodie.file.listing.parallelism=256 \
> --hoodie-conf hoodie.finalize.write.parallelism=256 \
> --hoodie-conf 
> hood

[jira] [Updated] (HUDI-3242) Partial parquet file discovery after the first commit

2022-01-13 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3242:

Attachment: Screen Shot 2022-01-13 at 2.55.35 AM.png

> Partial parquet file discovery after the first commit
> -
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Priority: Critical
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png
>
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
>  \
> --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, 
> to_timestamp(timestamp) as timestamp, sid, 
> date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM  a \"" \
> --hoodie-conf hoodie.file.listing.parallelism=256 \
> --hoodie-conf hoodie.finalize.write.parallelism=256 \
> --hoodie-conf 
> hoodie.generate.consistent.timestamp.logical.for.key.generator=true \
> --hoodie-conf hoodie.insert.shuffl

[jira] [Updated] (HUDI-3242) Partial parquet file discovery after the first commit

2022-01-13 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3242:

Attachment: (was: Screen Shot 2022-01-13 at 2.40.55 AM-2.png)

> Partial parquet file discovery after the first commit
> -
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Priority: Critical
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png
>
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
>  \
> --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, 
> to_timestamp(timestamp) as timestamp, sid, 
> date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM  a \"" \
> --hoodie-conf hoodie.file.listing.parallelism=256 \
> --hoodie-conf hoodie.finalize.write.parallelism=256 \
> --hoodie-conf 
> hoodie.generate.consistent.timestamp.logical.for.key.generator=true \
> --hoodie-conf hoodie.

[jira] [Updated] (HUDI-3242) Partial parquet file discovery after the first commit

2022-01-13 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3242:

Attachment: Screen Shot 2022-01-13 at 2.40.55 AM.png

> Partial parquet file discovery after the first commit
> -
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Priority: Critical
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png
>
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
>  \
> --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, 
> to_timestamp(timestamp) as timestamp, sid, 
> date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM  a \"" \
> --hoodie-conf hoodie.file.listing.parallelism=256 \
> --hoodie-conf hoodie.finalize.write.parallelism=256 \
> --hoodie-conf 
> hoodie.generate.consistent.timestamp.logical.for.key.generator=true \
> --hoodie-conf hoodie.insert.shuffl

[jira] [Updated] (HUDI-3242) Partial parquet file discovery after the first commit

2022-01-13 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3242:

Attachment: (was: Screen Shot 2022-01-13 at 2.55.35 AM.png)

> Partial parquet file discovery after the first commit
> -
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Priority: Critical
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png
>
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
>  \
> --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, 
> to_timestamp(timestamp) as timestamp, sid, 
> date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM  a \"" \
> --hoodie-conf hoodie.file.listing.parallelism=256 \
> --hoodie-conf hoodie.finalize.write.parallelism=256 \
> --hoodie-conf 
> hoodie.generate.consistent.timestamp.logical.for.key.generator=true \
> --hoodie-conf hoodie.in

[jira] [Updated] (HUDI-3242) Partial parquet file discovery after the first commit

2022-01-13 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3242:

Description: 
Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.

However, I see for a certain table. Only partial discovery of files happening 
after the initial commit of the table.

But if the second partition is given as input for the first commit, all the 
files are getting discovered.

First partition : 2021/01 has 744 files and all of them are discovered
Second partition: 2021/02 has 762 files but only 72 are discovered.

No errors in the logs.
{code:java}
spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-cores 30 \
--driver-memory 32g \
--executor-cores 5 \
--executor-memory 32g \
--num-executors 120 \
--jars 
s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
 \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
 \
--table-type COPY_ON_WRITE \
--source-ordering-field timestamp \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
--target-table sessions_by_date \
--transformer-class 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
--op INSERT \
--hoodie-conf hoodie.clean.automatic=true \
--hoodie-conf hoodie.cleaner.commits.retained=1 \
--hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
--hoodie-conf hoodie.clustering.inline=false \
--hoodie-conf hoodie.clustering.inline.max.commits=1 \
--hoodie-conf 
hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
 \
--hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
--hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
--hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
--hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 \
--hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
--hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
--hoodie-conf hoodie.datasource.hive_sync.enable=false \
--hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
--hoodie-conf hoodie.datasource.hive_sync.mode=hms \
--hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
 \
--hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
--hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
--hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
--hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
 \
--hoodie-conf hoodie.datasource.write.operation=insert \
--hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
--hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
--hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
--hoodie-conf hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd 
\
--hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
--hoodie-conf 
hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
--hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
--hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
--hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
 \
--hoodie-conf 
hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
 \
--hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, 
to_timestamp(timestamp) as timestamp, sid, date_format(to_timestamp(timestamp), 
'/MM/dd') AS date FROM  a \"" \
--hoodie-conf hoodie.file.listing.parallelism=256 \
--hoodie-conf hoodie.finalize.write.parallelism=256 \
--hoodie-conf 
hoodie.generate.consistent.timestamp.logical.for.key.generator=true \
--hoodie-conf hoodie.insert.shuffle.parallelism=1000 \
--hoodie-conf hoodie.metadata.enable=true \
--hoodie-conf hoodie.metadata.metrics.enable=true \
--hoodie-conf 
hoodie.metrics.cloudwatch.metric.prefix=emr.datalake-service.prd.insert.sessions_by_date
 \
--hoodie-conf hoodie.metrics.on=true \
--hoodie-conf hoodie.metrics.reporter.type=CLOUDWATCH \
--hoodie-conf hoodie.parquet.block.size=268435456 \
--hoodie-conf hoodie.parquet.compression.codec=snappy \
--hoodie-conf hoodie.parquet.max.file.size=268435456 \
--hoodie-conf hoodie.parquet.small.file.limit=25000 {code}
 

 

 

  was:
Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.

However, I see for a certain table. Only partial discovery of files happening 
after the initia

[jira] [Updated] (HUDI-3242) Partial parquet file discovery after the first commit

2022-01-13 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3242:

Description: 
Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.

However, I see for a certain table. Only partial discovery of files happening 
after the initial commit of the table.

But if the second partition is given as input for the first commit, all the 
files are getting discovered.

First partition : 2021/01 has 744 files and all of them are discovered
Second partition: 2021/02 has 762 files but only 72 are discovered.



{code:java}
spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-cores 30 \
--driver-memory 32g \
--executor-cores 5 \
--executor-memory 32g \
--num-executors 120 \
--jars 
s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
 \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
 \
--table-type COPY_ON_WRITE \
--source-ordering-field timestamp \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
--target-table sessions_by_date \
--transformer-class 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
--op INSERT \
--hoodie-conf hoodie.clean.automatic=true \
--hoodie-conf hoodie.cleaner.commits.retained=1 \
--hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
--hoodie-conf hoodie.clustering.inline=false \
--hoodie-conf hoodie.clustering.inline.max.commits=1 \
--hoodie-conf 
hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
 \
--hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
--hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
--hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
--hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 \
--hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
--hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
--hoodie-conf hoodie.datasource.hive_sync.enable=false \
--hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
--hoodie-conf hoodie.datasource.hive_sync.mode=hms \
--hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
 \
--hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
--hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
--hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
--hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
 \
--hoodie-conf hoodie.datasource.write.operation=insert \
--hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
--hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
--hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
--hoodie-conf hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd 
\
--hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
--hoodie-conf 
hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
--hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
--hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
--hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
 \
--hoodie-conf 
hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
 \
--hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, 
to_timestamp(timestamp) as timestamp, sid, date_format(to_timestamp(timestamp), 
'/MM/dd') AS date FROM  a \"" \
--hoodie-conf hoodie.file.listing.parallelism=256 \
--hoodie-conf hoodie.finalize.write.parallelism=256 \
--hoodie-conf 
hoodie.generate.consistent.timestamp.logical.for.key.generator=true \
--hoodie-conf hoodie.insert.shuffle.parallelism=1000 \
--hoodie-conf hoodie.metadata.enable=true \
--hoodie-conf hoodie.metadata.metrics.enable=true \
--hoodie-conf 
hoodie.metrics.cloudwatch.metric.prefix=emr.datalake-service.prd.insert.sessions_by_date
 \
--hoodie-conf hoodie.metrics.on=true \
--hoodie-conf hoodie.metrics.reporter.type=CLOUDWATCH \
--hoodie-conf hoodie.parquet.block.size=268435456 \
--hoodie-conf hoodie.parquet.compression.codec=snappy \
--hoodie-conf hoodie.parquet.max.file.size=268435456 \
--hoodie-conf hoodie.parquet.small.file.limit=25000 {code}
 

 

 

  was:
Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.

However, I see for a certain table. Only partial discovery of files happening 
after the initial commit of the table

[jira] [Created] (HUDI-3242) Partial parquet file discovery after the first commit

2022-01-13 Thread Harsha Teja Kanna (Jira)
Harsha Teja Kanna created HUDI-3242:
---

 Summary: Partial parquet file discovery after the first commit
 Key: HUDI-3242
 URL: https://issues.apache.org/jira/browse/HUDI-3242
 Project: Apache Hudi
  Issue Type: Bug
Affects Versions: 0.10.1
 Environment: AWS

EMR 6.4.0
Spark 3.1.2
Hudi - 0.10.1-rc

Reporter: Harsha Teja Kanna
 Attachments: Screen Shot 2022-01-13 at 2.40.55 AM-2.png

Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.

However, I see for a certain table. Only partial discovery of files happening 
after the initial commit of the table.

But if the partition is given as input for the first commit, all the files are 
getting discovered.

First partition : 2021/01 has 744 files and all of them are discovered
Second partition: 2021/02 has 762 files but only 72 are discovered.
{code:java}
spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-cores 30 \
--driver-memory 32g \
--executor-cores 5 \
--executor-memory 32g \
--num-executors 120 \
--jars 
s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
 \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
 \
--table-type COPY_ON_WRITE \
--source-ordering-field timestamp \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
--target-table sessions_by_date \
--transformer-class 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
--op INSERT \
--hoodie-conf hoodie.clean.automatic=true \
--hoodie-conf hoodie.cleaner.commits.retained=1 \
--hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
--hoodie-conf hoodie.clustering.inline=false \
--hoodie-conf hoodie.clustering.inline.max.commits=1 \
--hoodie-conf 
hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
 \
--hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
--hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
--hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
--hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 \
--hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
--hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
--hoodie-conf hoodie.datasource.hive_sync.enable=false \
--hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
--hoodie-conf hoodie.datasource.hive_sync.mode=hms \
--hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
 \
--hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
--hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
--hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
--hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
 \
--hoodie-conf hoodie.datasource.write.operation=insert \
--hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
--hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
--hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
--hoodie-conf hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd 
\
--hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
--hoodie-conf 
hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
--hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
--hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
--hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
 \
--hoodie-conf 
hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
 \
--hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, 
to_timestamp(timestamp) as timestamp, sid, date_format(to_timestamp(timestamp), 
'/MM/dd') AS date FROM  a \"" \
--hoodie-conf hoodie.file.listing.parallelism=256 \
--hoodie-conf hoodie.finalize.write.parallelism=256 \
--hoodie-conf 
hoodie.generate.consistent.timestamp.logical.for.key.generator=true \
--hoodie-conf hoodie.insert.shuffle.parallelism=1000 \
--hoodie-conf hoodie.metadata.enable=true \
--hoodie-conf hoodie.metadata.metrics.enable=true \
--hoodie-conf 
hoodie.metrics.cloudwatch.metric.prefix=emr.datalake-service.prd.insert.sessions_by_date
 \
--hoodie-conf hoodie.metrics.on=true \
--hoodie-conf hoodie.metrics.reporter.type=CLOUDWATCH \
--hoodie-conf hoodie.parquet.block.size=268435456 \
--hoodie-conf hoodie.parquet.compression.codec=snappy \
--hoodie-conf hoodie.parquet.max.file

[jira] [Commented] (HUDI-2909) Partition field parsing fails due to KeyGenerator giving inconsistent value for logical timestamp type

2022-01-10 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472466#comment-17472466
 ] 

Harsha Teja Kanna commented on HUDI-2909:
-

Hi, Thanks,

I will recreate the table. No problem.

> Partition field parsing fails due to KeyGenerator giving inconsistent value 
> for logical timestamp type
> --
>
> Key: HUDI-2909
> URL: https://issues.apache.org/jira/browse/HUDI-2909
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Harsha Teja Kanna
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: core-flow-ds, pull-request-available, sev:critical
> Fix For: 0.10.1
>
>
> Existing table has timebased keygen config show below
> hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
> hoodie.deltastreamer.keygen.timebased.output.timezone=GMT
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd
> hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS
> hoodie.deltastreamer.keygen.timebased.input.timezone=GMT
> hoodie.datasource.write.partitionpath.field=lastdate:timestamp
> hoodie.datasource.write.operation=upsert
> hoodie.deltastreamer.transformer.sql=SELECT session.id, session.rid, 
> session.mid, to_timestamp(session.lastdate) as lastdate, 
> to_timestamp(session.updatedate) as updatedate FROM  a
>  
> Upgrading to 0.10.0 from 0.9.0 fails with exception 
> org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input 
> partition field :2021-12-01 10:13:34.702
> Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Unexpected 
> type for partition field: java.sql.Timestamp
> at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:211)
> at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133)
> *Workaround fix:*
> Reverting this 
> https://github.com/apache/hudi/pull/3944/files#diff-22fb52b5cf28727ba23cb8bd4be820432a4e396ce663ac472a4677e889b7491eR543
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HUDI-2943) Deltastreamer fails to continue with pending clustering after restart in 0.10.0 and inline clustering

2022-01-10 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472393#comment-17472393
 ] 

Harsha Teja Kanna edited comment on HUDI-2943 at 1/11/22, 1:38 AM:
---

Hi,
Any chance this can be prioritized/fixed in 0.10.1 ?


was (Author: h7kanna):
Hi,
Any chance this can be fixed in 0.10.1 ?

> Deltastreamer fails to continue with pending clustering after restart in 
> 0.10.0 and inline clustering
> -
>
> Key: HUDI-2943
> URL: https://issues.apache.org/jira/browse/HUDI-2943
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: core-flow-ds, sev:high
> Attachments: image-2021-12-08-15-10-02-420.png
>
>
> Deltastreamer fails to restart when there is a pending clustering commit from 
> a previous run with Upsert failed exception when inline clustering is on.
> {*}Note{*}: workaround of running Clustering job with 
> --retry-last-failed-clustering-job works
> Hudi version : 0.10.0
> Spark version : 3.1.2
> EMR : 6.4.0
> diagnostics: User class threw exception: 
> org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit 
> time 20211206081248919
> at 
> org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:62)
> at 
> org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:46)
> at 
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:119)
> at 
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:103)
> at 
> org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:159)
> at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:501)
> at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:306)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:193)
> at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:191)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:511)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)
> Caused by: org.apache.hudi.exception.HoodieClusteringUpdateException: Not 
> allowed to update the clustering file group 
> HoodieFileGroupId\{partitionPath='', 
> fileId='39ca735d-1fc4-40f9-a314-93744642b38c-0'}. For pending clustering 
> operations, we are not going to support update for now.
> at 
> org.apache.hudi.client.clustering.update.strategy.SparkRejectUpdateStrategy.lambda$handleUpdate$0(SparkRejectUpdateStrategy.java:65)
> Config:
> hoodie.index.type=GLOBAL_SIMPLE
> hoodie.datasource.write.partitionpath.field=
> hoodie.datasource.write.precombine.field=updatedate
> hoodie.datasource.hive_sync.database=datalake
> hoodie.datasource.write.operation=upsert
> hoodie.datasource.hive_sync.table=hudi.prd.surveys
> hoodie.datasource.hive_sync.mode=hms
> hoodie.datasource.hive_sync.enable=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.finalize.write.parallelism=256
> hoodie.deltastreamer.source.dfs.root=s3://datalake-bucket/raw/parquet/data/surveys/year=2021/month=12/day=06/hour=16
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
> hoodie.parquet.max.file.size=134217728
> hoodie.parquet.small.file.limit=67108864
> hoodie.parquet.block.size=134217728
> hoodie.parquet.compression.codec=snappy
> hoodie.file.listing.parallelism=256
> hoodie.upsert.shuffle.parallelism=10
> hoodie.metadata.enable=false
> hoodie.metadata.clean.async=true
> hoodie.clustering.preserve.commit.metadata=true
> hoodie.clustering.inline.max.commits=1
> hoodie.clustering.inline=true
> hoodie.clustering.plan.strategy.target.file.max.bytes=134217728
> hoodie.clustering.plan.strategy.small.file.limit=67108864
> hoodie.clusteri

[jira] [Commented] (HUDI-2943) Deltastreamer fails to continue with pending clustering after restart in 0.10.0 and inline clustering

2022-01-10 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472393#comment-17472393
 ] 

Harsha Teja Kanna commented on HUDI-2943:
-

Hi,
Any chance this can be fixed in 0.10.1 ?

> Deltastreamer fails to continue with pending clustering after restart in 
> 0.10.0 and inline clustering
> -
>
> Key: HUDI-2943
> URL: https://issues.apache.org/jira/browse/HUDI-2943
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: core-flow-ds, sev:high
> Attachments: image-2021-12-08-15-10-02-420.png
>
>
> Deltastreamer fails to restart when there is a pending clustering commit from 
> a previous run with Upsert failed exception when inline clustering is on.
> {*}Note{*}: workaround of running Clustering job with 
> --retry-last-failed-clustering-job works
> Hudi version : 0.10.0
> Spark version : 3.1.2
> EMR : 6.4.0
> diagnostics: User class threw exception: 
> org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit 
> time 20211206081248919
> at 
> org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:62)
> at 
> org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:46)
> at 
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:119)
> at 
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:103)
> at 
> org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:159)
> at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:501)
> at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:306)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:193)
> at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:191)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:511)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)
> Caused by: org.apache.hudi.exception.HoodieClusteringUpdateException: Not 
> allowed to update the clustering file group 
> HoodieFileGroupId\{partitionPath='', 
> fileId='39ca735d-1fc4-40f9-a314-93744642b38c-0'}. For pending clustering 
> operations, we are not going to support update for now.
> at 
> org.apache.hudi.client.clustering.update.strategy.SparkRejectUpdateStrategy.lambda$handleUpdate$0(SparkRejectUpdateStrategy.java:65)
> Config:
> hoodie.index.type=GLOBAL_SIMPLE
> hoodie.datasource.write.partitionpath.field=
> hoodie.datasource.write.precombine.field=updatedate
> hoodie.datasource.hive_sync.database=datalake
> hoodie.datasource.write.operation=upsert
> hoodie.datasource.hive_sync.table=hudi.prd.surveys
> hoodie.datasource.hive_sync.mode=hms
> hoodie.datasource.hive_sync.enable=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.finalize.write.parallelism=256
> hoodie.deltastreamer.source.dfs.root=s3://datalake-bucket/raw/parquet/data/surveys/year=2021/month=12/day=06/hour=16
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
> hoodie.parquet.max.file.size=134217728
> hoodie.parquet.small.file.limit=67108864
> hoodie.parquet.block.size=134217728
> hoodie.parquet.compression.codec=snappy
> hoodie.file.listing.parallelism=256
> hoodie.upsert.shuffle.parallelism=10
> hoodie.metadata.enable=false
> hoodie.metadata.clean.async=true
> hoodie.clustering.preserve.commit.metadata=true
> hoodie.clustering.inline.max.commits=1
> hoodie.clustering.inline=true
> hoodie.clustering.plan.strategy.target.file.max.bytes=134217728
> hoodie.clustering.plan.strategy.small.file.limit=67108864
> hoodie.clustering.plan.strategy.sort.columns=projectid
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.Spark

[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2022-01-08 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471280#comment-17471280
 ] 

Harsha Teja Kanna edited comment on HUDI-3066 at 1/9/22, 6:11 AM:
--

Hi [~shivnarayan] Basic question

I am trying to use just the base-path without the wildcards. But facing this 
issue.

Table effectively has two columns with same name.

table is created using a timestamp column for key generation and mapped to date 
partition.
using hoodie.datasource.write.partitionpath.field=entrydate:timestamp

So the partition is entrydate=/mm/dd. 


{code:java}
import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.common.config.HoodieMetadataConfig

val df = spark.
  read.
  format("org.apache.hudi").
  option(HoodieMetadataConfig.ENABLE.key(), "true").
  option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
  load("s3a://datalake-hudi/sessions_by_entrydate/")

df.createOrReplaceTempView("sessions")

spark.sql("SELECT count(*) FROM sessions").show() {code}
Without wildcards. spark inferring the column type and query fails with 
{code:java}
Caused by: java.lang.ClassCastException: 
org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long
  at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107)
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong(rows.scala:42)
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong$(rows.scala:42)
  at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getLong(rows.scala:195)
  at 
org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:98)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:230)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:249)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:331)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
  at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
  at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
  at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
  at org.apache.spark.scheduler.Task.run(Task.scala:131)
  at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
 {code}


was (Author: h7kanna):
Hi [~shivnarayan] Basic question

I am trying to use just the base-path without the wildcards. But facing this 
issue.

Table effectively has two columns with same name.

table is created using a timestamp column for key generation and mapped to date 
partition.
using hoodie.datasource.write.partitionpath.field=entrydate:timestamp

So the partition is entrydate=/mm/dd. 

Without wildcards. spark inferring the column type and query fails with 

java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot 
be cast to java.lang.Long

 

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> --

[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2022-01-08 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471280#comment-17471280
 ] 

Harsha Teja Kanna edited comment on HUDI-3066 at 1/9/22, 4:15 AM:
--

Hi [~shivnarayan] Basic question

I am trying to use just the base-path without the wildcards. But facing this 
issue.

Table effectively has two columns with same name.

table is created using a timestamp column for key generation and mapped to date 
partition.
using hoodie.datasource.write.partitionpath.field=entrydate:timestamp

So the partition is entrydate=/mm/dd. 

Without wildcards. spark inferring the column type and query fails with 

java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot 
be cast to java.lang.Long

 


was (Author: h7kanna):
Hi [~shivnarayan] Basic question

I am trying to use just the base-path without the wildcards. But facing this 
issue.

Table effectively has two columns with same name.



table is created using a timestamp column for key generation and mapped to date 
partition.
using hoodie.datasource.write.partitionpath.field=entrydate:timestamp

So the partition is entrydate=/mm/dd. 

Without wildcards. spark inferring the column type is query fails with 

java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot 
be cast to java.lang.Long

 

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 
> PM.png, metadata_files.txt, metadata_files_compacted.txt, 
> metadata_timeline.txt, metadata_timeline_archived.txt, 
> metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, 
> timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877

[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2022-01-08 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471280#comment-17471280
 ] 

Harsha Teja Kanna commented on HUDI-3066:
-

Hi [~shivnarayan] Basic question

I am trying to use just the base-path without the wildcards. But facing this 
issue.

Table effectively has two columns with same name.



table is created using a timestamp column for key generation and mapped to date 
partition.
using hoodie.datasource.write.partitionpath.field=entrydate:timestamp

So the partition is entrydate=/mm/dd. 

Without wildcards. spark inferring the column type is query fails with 

java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot 
be cast to java.lang.Long

 

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 
> PM.png, metadata_files.txt, metadata_files_compacted.txt, 
> metadata_timeline.txt, metadata_timeline_archived.txt, 
> metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, 
> timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream

[jira] [Comment Edited] (HUDI-2909) Partition field parsing fails due to KeyGenerator giving inconsistent value for logical timestamp type

2022-01-08 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471264#comment-17471264
 ] 

Harsha Teja Kanna edited comment on HUDI-2909 at 1/9/22, 2:05 AM:
--

I am not able to determine if I fall under user type c or a/b :) from the 
Github issue or the above description.

I can you please help understand if I have to recreate the dataset?


was (Author: h7kanna):
I am not able to determine if I fall under user type c or a/b :)

> Partition field parsing fails due to KeyGenerator giving inconsistent value 
> for logical timestamp type
> --
>
> Key: HUDI-2909
> URL: https://issues.apache.org/jira/browse/HUDI-2909
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Harsha Teja Kanna
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: core-flow-ds, pull-request-available, sev:critical
> Fix For: 0.10.1
>
>
> Existing table has timebased keygen config show below
> hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
> hoodie.deltastreamer.keygen.timebased.output.timezone=GMT
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd
> hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS
> hoodie.deltastreamer.keygen.timebased.input.timezone=GMT
> hoodie.datasource.write.partitionpath.field=lastdate:timestamp
> hoodie.datasource.write.operation=upsert
> hoodie.deltastreamer.transformer.sql=SELECT session.id, session.rid, 
> session.mid, to_timestamp(session.lastdate) as lastdate, 
> to_timestamp(session.updatedate) as updatedate FROM  a
>  
> Upgrading to 0.10.0 from 0.9.0 fails with exception 
> org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input 
> partition field :2021-12-01 10:13:34.702
> Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Unexpected 
> type for partition field: java.sql.Timestamp
> at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:211)
> at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133)
> *Workaround fix:*
> Reverting this 
> https://github.com/apache/hudi/pull/3944/files#diff-22fb52b5cf28727ba23cb8bd4be820432a4e396ce663ac472a4677e889b7491eR543
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HUDI-2909) Partition field parsing fails due to KeyGenerator giving inconsistent value for logical timestamp type

2022-01-08 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471264#comment-17471264
 ] 

Harsha Teja Kanna edited comment on HUDI-2909 at 1/9/22, 2:05 AM:
--

I am not able to determine if I fall under user type c or a/b :) from the 
Github issue or the above description.

Can you please help understand if I have to recreate the dataset?


was (Author: h7kanna):
I am not able to determine if I fall under user type c or a/b :) from the 
Github issue or the above description.

I can you please help understand if I have to recreate the dataset?

> Partition field parsing fails due to KeyGenerator giving inconsistent value 
> for logical timestamp type
> --
>
> Key: HUDI-2909
> URL: https://issues.apache.org/jira/browse/HUDI-2909
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Harsha Teja Kanna
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: core-flow-ds, pull-request-available, sev:critical
> Fix For: 0.10.1
>
>
> Existing table has timebased keygen config show below
> hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
> hoodie.deltastreamer.keygen.timebased.output.timezone=GMT
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd
> hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS
> hoodie.deltastreamer.keygen.timebased.input.timezone=GMT
> hoodie.datasource.write.partitionpath.field=lastdate:timestamp
> hoodie.datasource.write.operation=upsert
> hoodie.deltastreamer.transformer.sql=SELECT session.id, session.rid, 
> session.mid, to_timestamp(session.lastdate) as lastdate, 
> to_timestamp(session.updatedate) as updatedate FROM  a
>  
> Upgrading to 0.10.0 from 0.9.0 fails with exception 
> org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input 
> partition field :2021-12-01 10:13:34.702
> Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Unexpected 
> type for partition field: java.sql.Timestamp
> at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:211)
> at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133)
> *Workaround fix:*
> Reverting this 
> https://github.com/apache/hudi/pull/3944/files#diff-22fb52b5cf28727ba23cb8bd4be820432a4e396ce663ac472a4677e889b7491eR543
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-2909) Partition field parsing fails due to KeyGenerator giving inconsistent value for logical timestamp type

2022-01-08 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471264#comment-17471264
 ] 

Harsha Teja Kanna commented on HUDI-2909:
-

I am not able to determine if I fall under user type c or a/b :)

> Partition field parsing fails due to KeyGenerator giving inconsistent value 
> for logical timestamp type
> --
>
> Key: HUDI-2909
> URL: https://issues.apache.org/jira/browse/HUDI-2909
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Harsha Teja Kanna
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: core-flow-ds, pull-request-available, sev:critical
> Fix For: 0.10.1
>
>
> Existing table has timebased keygen config show below
> hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
> hoodie.deltastreamer.keygen.timebased.output.timezone=GMT
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd
> hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS
> hoodie.deltastreamer.keygen.timebased.input.timezone=GMT
> hoodie.datasource.write.partitionpath.field=lastdate:timestamp
> hoodie.datasource.write.operation=upsert
> hoodie.deltastreamer.transformer.sql=SELECT session.id, session.rid, 
> session.mid, to_timestamp(session.lastdate) as lastdate, 
> to_timestamp(session.updatedate) as updatedate FROM  a
>  
> Upgrading to 0.10.0 from 0.9.0 fails with exception 
> org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input 
> partition field :2021-12-01 10:13:34.702
> Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Unexpected 
> type for partition field: java.sql.Timestamp
> at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:211)
> at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133)
> *Workaround fix:*
> Reverting this 
> https://github.com/apache/hudi/pull/3944/files#diff-22fb52b5cf28727ba23cb8bd4be820432a4e396ce663ac472a4677e889b7491eR543
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2022-01-07 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17470802#comment-17470802
 ] 

Harsha Teja Kanna commented on HUDI-3066:
-

Hi

I am working on re-testing this. Seeing other un-related issues with partition 
discovery schema inference without wildcard in the base path.
Will post my result soon.

Thanks.

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 
> PM.png, metadata_files.txt, metadata_files_compacted.txt, 
> metadata_timeline.txt, metadata_timeline_archived.txt, 
> metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, 
> timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from fi

[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-21 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463577#comment-17463577
 ] 

Harsha Teja Kanna commented on HUDI-3066:
-

After few runs of the query, I see even with compacted metadata table. Enabling 
metadata in reader is slower.

Though is not taking very long time like 1hr as before

!Screen Shot 2021-12-21 at 10.22.54 PM.png!

 

!Screen Shot 2021-12-21 at 10.24.12 PM.png!

 

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 
> PM.png, metadata_files.txt, metadata_files_compacted.txt, 
> metadata_timeline.txt, metadata_timeline_archived.txt, 
> metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, 
> timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:3

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-21 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: Screen Shot 2021-12-21 at 10.24.12 PM.png

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 
> PM.png, metadata_files.txt, metadata_files_compacted.txt, 
> metadata_timeline.txt, metadata_timeline_archived.txt, 
> metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, 
> timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractH

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-21 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: Screen Shot 2021-12-21 at 10.22.54 PM.png

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 
> PM.png, metadata_files.txt, metadata_files_compacted.txt, 
> metadata_timeline.txt, metadata_timeline_archived.txt, 
> metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, 
> timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractH

[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-21 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463549#comment-17463549
 ] 

Harsha Teja Kanna edited comment on HUDI-3066 at 12/22/21, 2:55 AM:


Seems like disabling the metadata in the reader is faster than enabling it.

24s - reader metadata off 

59s - reader metadata on

this is a table of only 2182 files.

either on/off .. file listing is faster than before though.


was (Author: h7kanna):
Seems like disabling the metadata in the reader is faster than enabling it.

24s - reader metadata off 

59s - reader metadata on

this is a table of only 2182 files.

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_files.txt, metadata_files_compacted.txt, metadata_timeline.txt, 
> metadata_timeline_archived.txt, metadata_timeline_compacted.txt, 
> stderr_part1.txt, stderr_part2.txt, timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-h

[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-21 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463549#comment-17463549
 ] 

Harsha Teja Kanna commented on HUDI-3066:
-

Seems like disabling the metadata in the reader is faster than enabling it.

24s - reader metadata off 

59s - reader metadata on

this is a table of only 2182 files.

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_files.txt, metadata_files_compacted.txt, metadata_timeline.txt, 
> metadata_timeline_archived.txt, metadata_timeline_compacted.txt, 
> stderr_part1.txt, stderr_part2.txt, timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>

[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-21 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463539#comment-17463539
 ] 

Harsha Teja Kanna commented on HUDI-3066:
-

Yes it is non-partitioned table. I will create a new issue.

I see the file listing time varying each time I query. 

One of the instances it is only 24 seconds. I am testing again repeatedly and 
on another bigger table. will post the result.

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_files.txt, metadata_files_compacted.txt, metadata_timeline.txt, 
> metadata_timeline_archived.txt, metadata_timeline_compacted.txt, 
> stderr_part1.txt, stderr_part2.txt, timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessi

[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-21 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463442#comment-17463442
 ] 

Harsha Teja Kanna edited comment on HUDI-3066 at 12/22/21, 1:24 AM:


There is no separate cleaner job. I set hoodie.clean.automatic=true

It is only one writer. I have 
hoodie.clean.async[​=|https://hudi.apache.org/docs/configurations#hoodiecleanasync]true
 because from the docs it says

Only applies when hoodie.clean.automatic is turned on. When turned on runs 
cleaner async with writing, which can speed up overall write performance.

https://hudi.apache.org/docs/configurations#hoodiecleanasync


was (Author: h7kanna):
There is no separate cleaner job. I set hoodie.clean.automatic=true

It is only one writer. I have 
hoodie.clean.async[​=|https://hudi.apache.org/docs/configurations#hoodiecleanasync]true
 because from the docs it says

Only applies when hoodie.clean.automatic is turned on. When turned on runs 
cleaner async with writing, which can speed up overall write performance.

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_files.txt, metadata_files_compacted.txt, metadata_timeline.txt, 
> metadata_timeline_archived.txt, metadata_timeline_compacted.txt, 
> stderr_part1.txt, stderr_part2.txt, timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogReco

[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-21 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463528#comment-17463528
 ] 

Harsha Teja Kanna commented on HUDI-3066:
-

I made changes to the sync

1) Clustering for each 10 commits

2) Removed hoodie.metadata.clean.async=true

2) Removed hoodie.clean.async=true

I do see compaction kick off and file listing much faster.

[^metadata_files_compacted.txt]

[^metadata_timeline_compacted.txt]

Though for a different table I see below exception

org.apache.hudi.exception.HoodieException: 
org.apache.hudi.exception.HoodieException: 
java.util.concurrent.ExecutionException: 
org.apache.hudi.exception.HoodieUpsertException: Failed to merge old record 
into new file for key from old file 
s3a://bucket/.hoodie/metadata/files/files-_0-45-650_20211221205951077001.hfile
 to new file 
s3a://bucket/.hoodie/metadata/files/files-_0-67-713_20211222000526106001.hfile
 with writerSchema {
"type" : "record",
"name" : "HoodieMetadataRecord",
"namespace" : "org.apache.hudi.avro.model",
"doc" : "A record saved within the Metadata Table",
"fields" : [ {
"name" : "_hoodie_commit_time",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_commit_seqno",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_record_key",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_partition_path",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_file_name",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "key",
"type" : {
"type" : "string",
"avro.java.string" : "String"
}
}, {
"name" : "type",
"type" : "int",
"doc" : "Type of the metadata record"
}, {
"name" : "filesystemMetadata",
"type" : [ "null", {
"type" : "map",
"values" : {
"type" : "record",
"name" : "HoodieMetadataFileInfo",
"fields" : [ {
"name" : "size",
"type" : "long",
"doc" : "Size of the file"
}, {
"name" : "isDeleted",
"type" : "boolean",
"doc" : "True if this file has been deleted"
} ]
},
"avro.java.string" : "String"
} ],
"doc" : "Contains information about partitions and files within the dataset"
} ]
}
at 
org.apache.hudi.table.action.commit.SparkMergeHelper.runMerge(SparkMergeHelper.java:102)
at 
org.apache.hudi.table.HoodieSparkCopyOnWriteTable.handleUpdateInternal(HoodieSparkCopyOnWriteTable.java:292)
at 
org.apache.hudi.table.HoodieSparkCopyOnWriteTable.handleUpdate(HoodieSparkCopyOnWriteTable.java:283)
at 
org.apache.hudi.table.action.compact.HoodieCompactor.compact(HoodieCompactor.java:197)
at 
org.apache.hudi.table.action.compact.HoodieCompactor.lambda$compact$57154431$1(HoodieCompactor.java:133)
at 
org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349)
at 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1440)
at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hudi.exception.HoodieException: 
java.util.concurrent.ExecutionException: 
org.apache.hudi.exception.HoodieUpsertException: Failed to merge old record 
into new file for key from old file 
s3a://bucket/.hoodie/metadata/files/files-_0-45-650_20211221205951077001.hfile
 to new file 
s3a://bucket/.hoodie/metadata/files/files-_0-67-713_20211222000526106001.hfile
 with writerSchema {
"type" : "record",
"name" : "HoodieMetadataRecord",
"namespace" : "org.apache.hudi.avro.model",
"doc" : "A record sav

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-21 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: metadata_files_compacted.txt
metadata_timeline_compacted.txt

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_files.txt, metadata_files_compacted.txt, metadata_timeline.txt, 
> metadata_timeline_archived.txt, metadata_timeline_compacted.txt, 
> stderr_part1.txt, stderr_part2.txt, timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to

[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-21 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463453#comment-17463453
 ] 

Harsha Teja Kanna commented on HUDI-3066:
-

rollback.requested file

{
"instantToRollback": {
"commitTime": "20211221044403331",
"action": "deltacommit"
},
"RollbackRequests": [
{
"partitionPath": "files",
"fileId": "",
"latestBaseInstant": "",
"filesToBeDeleted": [],
"logBlocksToBeDeleted": {}
},
{
"partitionPath": "files",
"fileId": "files-",
"latestBaseInstant": "20211218223919666001",
"filesToBeDeleted": [],
"logBlocksToBeDeleted": {
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.246_0-95-1905":
 5936,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.144_0-63-1952":
 7437,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.293_0-84-2355":
 6189,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.212_0-63-1778":
 6736,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.346_0-97-2576":
 5971,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.11_0-27-1737":
 58553,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.275_0-18-814":
 5910,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.99_0-21-889":
 5906,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.358_0-97-2706":
 5989,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.345_0-86-2507":
 7564,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.396_0-76-2741":
 9532,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.308_0-63-1903":
 7209,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.202_0-96-2184":
 5938,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.137_0-85-2109":
 6311,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.269_0-84-1927":
 6853,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.201_0-85-2115":
 6865,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.173_0-85-2139":
 5815,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.180_0-63-1995":
 7198,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.69_0-85-1993":
 6903,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.53_0-84-1893":
 6657,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.221_0-85-2181":
 6665,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.89_0-84-1761":
 6427,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.70_0-96-2065":
 5933,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.325_0-84-2375":
 6322,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.319_0-27-1256":
 5906,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.287_0-18-832":
 5906,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.365_0-84-3041":
 8622,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.298_0-96-1967":
 5938,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.350_0-96-3006":
 5970,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.111_0-18-732":
 5910,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.191_0-18-667":
 5938,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.73_0-84-1896":
 6345,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.60_0-63-1744":
 6525,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.323_0-18-883":
 5909,
"s3a://bucket-hudi/sessions_by_entrydate/.hoodie/metadata/files/.files-_20211218223919666001.log.389_0-84-2945":
 11320,
"s3a://bu

[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-21 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463448#comment-17463448
 ] 

Harsha Teja Kanna commented on HUDI-3066:
-

yes this is for testing the clustering as of now(pre-prod). So set the 
clustering inline on every commit.

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_files.txt, metadata_timeline.txt, metadata_timeline_archived.txt, 
> stderr_part1.txt, stderr_part2.txt, timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblo

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-21 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: metadata_files.txt

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_files.txt, metadata_timeline.txt, metadata_timeline_archived.txt, 
> stderr_part1.txt, stderr_part2.txt, timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile

[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-21 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463444#comment-17463444
 ] 

Harsha Teja Kanna commented on HUDI-3066:
-

metadata files

[^metadata_files.txt]

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_files.txt, metadata_timeline.txt, metadata_timeline_archived.txt, 
> stderr_part1.txt, stderr_part2.txt, timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatRead

[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-21 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463442#comment-17463442
 ] 

Harsha Teja Kanna edited comment on HUDI-3066 at 12/21/21, 7:52 PM:


There is no separate cleaner job. I set hoodie.clean.automatic=true

It is only one writer. I have 
hoodie.clean.async[​=|https://hudi.apache.org/docs/configurations#hoodiecleanasync]true
 because from the docs it says

Only applies when hoodie.clean.automatic is turned on. When turned on runs 
cleaner async with writing, which can speed up overall write performance.


was (Author: h7kanna):
There is no separate cleaner job.

It is only one writer. I have 
hoodie.clean.async[​=|https://hudi.apache.org/docs/configurations#hoodiecleanasync]true
 because from the docs it says

Only applies when hoodie.clean.automatic is turned on. When turned on runs 
cleaner async with writing, which can speed up overall write performance.

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, 
> stderr_part2.txt, timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021

[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-21 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463442#comment-17463442
 ] 

Harsha Teja Kanna commented on HUDI-3066:
-

There is no separate cleaner job.

It is only one writer. I have 
hoodie.clean.async[​=|https://hudi.apache.org/docs/configurations#hoodiecleanasync]true
 because from the docs it says

Only applies when hoodie.clean.automatic is turned on. When turned on runs 
cleaner async with writing, which can speed up overall write performance.

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, 
> stderr_part2.txt, timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi

[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-21 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463385#comment-17463385
 ] 

Harsha Teja Kanna edited comment on HUDI-3066 at 12/21/21, 5:58 PM:


Hi [~manojg] 

It is single writer, inline clustering, automatic cleaner

[^writer_log.txt]

hoodie.clustering.plan.strategy.target.file.max.bytes=268435456
hoodie.datasource.hive_sync.table=sessions_by_entrydate
hoodie.metadata.clean.async=true
hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy
hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
hoodie.datasource.hive_sync.partition_fields=entrydate
hoodie.finalize.write.parallelism=256
hoodie.clean.automatic=true
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
hoodie.deltastreamer.source.dfs.root=s3://bucket-archive/sessions_prd_v1/data/2021/12/21/17
hoodie.datasource.hive_sync.mode=hms
hoodie.clean.async=true
hoodie.metadata.metrics.enable=true
hoodie.parquet.max.file.size=268435456
hoodie.datasource.write.recordkey.field=id
hoodie.metadata.enable=true
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
hoodie.datasource.write.hive_style_partitioning=true
hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
hoodie.clustering.preserve.commit.metadata=true
hoodie.parquet.small.file.limit=25000
hoodie.datasource.hive_sync.enable=false
hoodie.cleaner.policy=KEEP_LATEST_COMMITS
hoodie.metrics.cloudwatch.metric.prefix=emr.datalake.prd.upsert.sessions_by_entrydate
hoodie.clustering.inline.max.commits=1
hoodie.cleaner.commits.retained=10
hoodie.deltastreamer.keygen.timebased.output.timezone=GMT
hoodie.clustering.inline=true
hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd
hoodie.clustering.plan.strategy.max.num.groups=1000
hoodie.parquet.compression.codec=snappy
hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS
hoodie.datasource.hive_sync.use_jdbc=false
hoodie.file.listing.parallelism=256
hoodie.datasource.write.partitionpath.field=entrydate:TIMESTAMP
hoodie.parquet.block.size=268435456
hoodie.upsert.shuffle.parallelism=200
hoodie.datasource.hive_sync.ignore_exceptions=true
hoodie.clustering.plan.strategy.small.file.limit=25000
hoodie.datasource.write.precombine.field=updatedate
hoodie.clustering.plan.strategy.sort.columns=surveyid,groupid
hoodie.metrics.reporter.type=CLOUDWATCH
hoodie.metrics.on=true
hoodie.datasource.hive_sync.database=datalake-hudi
hoodie.datasource.write.operation=upsert
hoodie.deltastreamer.keygen.timebased.input.timezone=GMT
hoodie.deltastreamer.transformer.sql=SELECT * FROM  a
 


was (Author: h7kanna):
Hi [~manojg] 

It is single writer, inline clustering, automatic cleaner

[^writer_log.txt]


hoodie.clustering.plan.strategy.target.file.max.bytes=268435456
hoodie.datasource.hive_sync.table=sessions_by_entrydate
hoodie.metadata.clean.async=true
hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy
hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
hoodie.datasource.hive_sync.partition_fields=entrydate
hoodie.finalize.write.parallelism=256
hoodie.clean.automatic=true
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
hoodie.deltastreamer.source.dfs.root=s3://bucket-archive/sessions_prd_v1/data/2021/12/21/17
hoodie.datasource.hive_sync.mode=hms
hoodie.clean.async=true
hoodie.metadata.metrics.enable=true
hoodie.parquet.max.file.size=268435456
hoodie.datasource.write.recordkey.field=id
hoodie.metadata.enable=true
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
hoodie.datasource.write.hive_style_partitioning=true
hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
hoodie.clustering.preserve.commit.metadata=true
hoodie.parquet.small.file.limit=25000
hoodie.datasource.hive_sync.enable=false
hoodie.cleaner.policy=KEEP_LATEST_COMMITS
hoodie.metrics.cloudwatch.metric.prefix=emr.datalake.prd.upsert.sessions_by_entrydate
hoodie.clustering.inline.max.commits=1
hoodie.cleaner.commits.retained=10
hoodie.deltastreamer.keygen.timebased.output.timezone=GMT
hoodie.clustering.inline=true
hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd
hoodie.clustering.plan.strategy.max.num.groups=1000
hoodie.parquet.compression.codec=snappy
hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS
hoodie.datasource.hive_sync.use_jdbc=false
hoodie.file.listing.parallelism=256
hoodie.datasource.write.partitionpath.field=entrydate:TIMESTAMP
hoodie.parquet.block.size=268435456
hoodie.upsert.shuffle.parallelism=200
hoodie.datasource.hive_s

[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-21 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463385#comment-17463385
 ] 

Harsha Teja Kanna commented on HUDI-3066:
-

Hi [~manojg] 

It is single writer, inline clustering, automatic cleaner

[^writer_log.txt]


hoodie.clustering.plan.strategy.target.file.max.bytes=268435456
hoodie.datasource.hive_sync.table=sessions_by_entrydate
hoodie.metadata.clean.async=true
hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy
hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
hoodie.datasource.hive_sync.partition_fields=entrydate
hoodie.finalize.write.parallelism=256
hoodie.clean.automatic=true
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
hoodie.deltastreamer.source.dfs.root=s3://bucket-archive/sessions_prd_v1/data/2021/12/21/17
hoodie.datasource.hive_sync.mode=hms
hoodie.clean.async=true
hoodie.metadata.metrics.enable=true
hoodie.parquet.max.file.size=268435456
hoodie.datasource.write.recordkey.field=id
hoodie.metadata.enable=true
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
hoodie.datasource.write.hive_style_partitioning=true
hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
hoodie.clustering.preserve.commit.metadata=true
hoodie.parquet.small.file.limit=25000
hoodie.datasource.hive_sync.enable=false
hoodie.cleaner.policy=KEEP_LATEST_COMMITS
hoodie.metrics.cloudwatch.metric.prefix=emr.datalake.prd.upsert.sessions_by_entrydate
hoodie.clustering.inline.max.commits=1
hoodie.cleaner.commits.retained=10
hoodie.deltastreamer.keygen.timebased.output.timezone=GMT
hoodie.clustering.inline=true
hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd
hoodie.clustering.plan.strategy.max.num.groups=1000
hoodie.parquet.compression.codec=snappy
hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS
hoodie.datasource.hive_sync.use_jdbc=false
hoodie.file.listing.parallelism=256
hoodie.datasource.write.partitionpath.field=entrydate:TIMESTAMP
hoodie.parquet.block.size=268435456
hoodie.upsert.shuffle.parallelism=200
hoodie.datasource.hive_sync.ignore_exceptions=true
hoodie.clustering.plan.strategy.small.file.limit=25000
hoodie.datasource.write.precombine.field=updatedate
hoodie.clustering.plan.strategy.sort.columns=surveyid,groupid
hoodie.metrics.reporter.type=CLOUDWATCH
hoodie.metrics.on=true
hoodie.datasource.hive_sync.database=lucid-datalake-hudi
hoodie.datasource.write.operation=upsert
hoodie.deltastreamer.keygen.timebased.input.timezone=GMT
hoodie.deltastreamer.transformer.sql=SELECT * FROM  a
 

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, 
> stderr_part2.txt, timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordRe

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-21 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: writer_log.txt

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, 
> stderr_part2.txt, timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalak

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-21 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: (was: writer_log.txt)

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, 
> stderr_part2.txt, timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-21 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: writer_log.txt

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, 
> stderr_part2.txt, timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalak

[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462993#comment-17462993
 ] 

Harsha Teja Kanna commented on HUDI-3066:
-

{*}Note{*}: I ran the recent query from 'master' as I needed a fix of running 
clustering in parallel from master.

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, 
> stderr_part2.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 

[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462993#comment-17462993
 ] 

Harsha Teja Kanna edited comment on HUDI-3066 at 12/21/21, 5:29 AM:


{*}Note{*}: I ran the recent query using 'master' as I needed a fix of running 
clustering in parallel from master.


was (Author: h7kanna):
{*}Note{*}: I ran the recent query from 'master' as I needed a fix of running 
clustering in parallel from master.

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, 
> stderr_part2.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-000

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: (was: Screen Shot 2021-12-20 at 10.17.44 PM-1.png)

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, 
> stderr_part2.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: (was: stderr_part2-1.txt)

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2.txt, 
> timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for l

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: (was: timeline-1.txt)

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2.txt, 
> timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfi

[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462986#comment-17462986
 ] 

Harsha Teja Kanna edited comment on HUDI-3066 at 12/21/21, 4:37 AM:


Complete log files for Slow run (Metadata reader on)

[^stderr_part1.txt]

[^stderr_part2.txt]


was (Author: h7kanna):
Complete log files

[^stderr_part1.txt]

[^stderr_part2.txt]

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, 
> stderr_part2.txt, timeline-1.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.lo

[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462986#comment-17462986
 ] 

Harsha Teja Kanna edited comment on HUDI-3066 at 12/21/21, 4:36 AM:


Complete log files

[^stderr_part1.txt]

[^stderr_part2.txt]


was (Author: h7kanna):
Complete log files

 

[^stderr_part1.txt]

[^stderr_part1.txt][^stderr_part2.txt]

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, 
> stderr_part2.txt, timeline-1.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-61

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: stderr_part2-1.txt

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, 
> stderr_part2.txt, timeline-1.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
>

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: stderr_part1.txt
stderr_part2.txt

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, 
> stderr_part2.txt, timeline-1.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to th

[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462986#comment-17462986
 ] 

Harsha Teja Kanna commented on HUDI-3066:
-

Complete log files

 

[^stderr_part1.txt]

[^stderr_part1.txt][^stderr_part2.txt]

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, timeline-1.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatRead

[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462981#comment-17462981
 ] 

Harsha Teja Kanna commented on HUDI-3066:
-

Metadata on reader side disabled

 

!Screen Shot 2021-12-20 at 10.17.44 PM.png!

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, timeline-1.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: Screen Shot 2021-12-20 at 10.17.44 PM-1.png

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, timeline-1.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://dat

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: Screen Shot 2021-12-20 at 10.17.44 PM.png

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_timeline.txt, metadata_timeline_archived.txt, timeline-1.txt, 
> timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.fil

[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462979#comment-17462979
 ] 

Harsha Teja Kanna commented on HUDI-3066:
-

Metadata timeline

[^metadata_timeline.txt]

 

[^metadata_timeline_archived.txt]

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, timeline-1.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: metadata_timeline_archived.txt

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, timeline-1.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.63_0-56-519',
>  fileLen

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: metadata_timeline.txt

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, metadata_timeline.txt, timeline-1.txt, 
> timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.63_0-56-519',
>  fileLen=0}
> 2021-12-18 23:37:46,122 INFO log.Ab

  1   2   >