[jira] [Commented] (HUDI-1307) spark datasource load path format is confused for snapshot and increment read mode

2021-09-26 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420358#comment-17420358
 ] 

Raymond Xu commented on HUDI-1307:
--

[~309637554] Any update on this improvement? definitely useful to align on the 
pattern. In fact, today to enable HoodieFileIndex, user should avoid passing in 
glob path. Is it still necessary to keep glob path pattern around?

> spark datasource load path format is confused for snapshot and increment read 
> mode
> --
>
> Key: HUDI-1307
> URL: https://issues.apache.org/jira/browse/HUDI-1307
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Critical
>  Labels: sev:high, user-support-issues
>
> as spark datasource read hudi table
> 1、snapshot mode
> {code:java}
>  val readHudi = spark.read.format("org.apache.hudi").load(basePath + "/*");
> should add "/*" ,otherwise will fail, because in 
> org.apache.hudi.DefaultSource.
> createRelation() will use fs.globStatus(). if do not have "/*" will not get 
> .hoodie and default dir
> val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, 
> fs){code}
>  
> 2、increment mode
> both basePath and  basePath + "/*"  is ok.This is because in 
> org.apache.hudi.DefaultSource  
> DataSourceUtils.getTablePath can support both the two format.
> {code:java}
>  val incViewDF = spark.read.format("org.apache.hudi").
>  option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
>  option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
>  option(END_INSTANTTIME_OPT_KEY, endTime).
>  load(basePath){code}
>  
> {code:java}
>  val incViewDF = spark.read.format("org.apache.hudi").
>  option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
>  option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
>  option(END_INSTANTTIME_OPT_KEY, endTime).
>  load(basePath + "/*")
>  {code}
>  
> as  increment mode and snapshot mode not coincide, user will confuse .Also 
> load use basepath +"/*"  *or "/***/*"* is  confuse. I know this is to support 
> partition.
> but i think this api will more clear for user
>  
> {code:java}
>  partition = "year = '2019'"
> spark.read .format("hudi") .load(path) .where(partition) {code}
>  
>  ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2440) Add dependency change diff script for dependency governace

2021-09-26 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2440:
-
Component/s: Usability

> Add dependency change diff script for dependency governace
> --
>
> Key: HUDI-2440
> URL: https://issues.apache.org/jira/browse/HUDI-2440
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Usability, Utilities
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>  Labels: pull-request-available
>
> Currently, hudi's dependency management is chaotic, e.g. for 
> `hudi-spark-bundle_2.11`, the dependency list is here:
> {code:java}
> HikariCP/2.5.1//HikariCP-2.5.1.jar
> ST4/4.0.4//ST4-4.0.4.jar
> aircompressor/0.15//aircompressor-0.15.jar
> annotations/17.0.0//annotations-17.0.0.jar
> ant-launcher/1.9.1//ant-launcher-1.9.1.jar
> ant/1.6.5//ant-1.6.5.jar
> ant/1.9.1//ant-1.9.1.jar
> antlr-runtime/3.5.2//antlr-runtime-3.5.2.jar
> aopalliance/1.0//aopalliance-1.0.jar
> apache-curator/2.7.1//apache-curator-2.7.1.pom
> apacheds-i18n/2.0.0-M15//apacheds-i18n-2.0.0-M15.jar
> apacheds-kerberos-codec/2.0.0-M15//apacheds-kerberos-codec-2.0.0-M15.jar
> api-asn1-api/1.0.0-M20//api-asn1-api-1.0.0-M20.jar
> api-util/1.0.0-M20//api-util-1.0.0-M20.jar
> asm/3.1//asm-3.1.jar
> avatica-metrics/1.8.0//avatica-metrics-1.8.0.jar
> avatica/1.8.0//avatica-1.8.0.jar
> avro/1.8.2//avro-1.8.2.jar
> bonecp/0.8.0.RELEASE//bonecp-0.8.0.RELEASE.jar
> calcite-core/1.10.0//calcite-core-1.10.0.jar
> calcite-druid/1.10.0//calcite-druid-1.10.0.jar
> calcite-linq4j/1.10.0//calcite-linq4j-1.10.0.jar
> commons-beanutils-core/1.8.0//commons-beanutils-core-1.8.0.jar
> commons-beanutils/1.7.0//commons-beanutils-1.7.0.jar
> commons-cli/1.2//commons-cli-1.2.jar
> commons-codec/1.4//commons-codec-1.4.jar
> commons-collections/3.2.2//commons-collections-3.2.2.jar
> commons-compiler/2.7.6//commons-compiler-2.7.6.jar
> commons-compress/1.9//commons-compress-1.9.jar
> commons-configuration/1.6//commons-configuration-1.6.jar
> commons-daemon/1.0.13//commons-daemon-1.0.13.jar
> commons-dbcp/1.4//commons-dbcp-1.4.jar
> commons-digester/1.8//commons-digester-1.8.jar
> commons-el/1.0//commons-el-1.0.jar
> commons-httpclient/3.1//commons-httpclient-3.1.jar
> commons-io/2.4//commons-io-2.4.jar
> commons-lang/2.6//commons-lang-2.6.jar
> commons-lang3/3.1//commons-lang3-3.1.jar
> commons-logging/1.2//commons-logging-1.2.jar
> commons-math/2.2//commons-math-2.2.jar
> commons-math3/3.1.1//commons-math3-3.1.1.jar
> commons-net/3.1//commons-net-3.1.jar
> commons-pool/1.5.4//commons-pool-1.5.4.jar
> curator-client/2.7.1//curator-client-2.7.1.jar
> curator-framework/2.7.1//curator-framework-2.7.1.jar
> curator-recipes/2.7.1//curator-recipes-2.7.1.jar
> datanucleus-api-jdo/4.2.4//datanucleus-api-jdo-4.2.4.jar
> datanucleus-core/4.1.17//datanucleus-core-4.1.17.jar
> datanucleus-rdbms/4.1.19//datanucleus-rdbms-4.1.19.jar
> derby/10.10.2.0//derby-10.10.2.0.jar
> disruptor/3.3.0//disruptor-3.3.0.jar
> dropwizard-metrics-hadoop-metrics2-reporter/0.1.2//dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar
> eigenbase-properties/1.1.5//eigenbase-properties-1.1.5.jar
> fastutil/7.0.13//fastutil-7.0.13.jar
> findbugs-annotations/1.3.9-1//findbugs-annotations-1.3.9-1.jar
> fluent-hc/4.4.1//fluent-hc-4.4.1.jar
> groovy-all/2.4.4//groovy-all-2.4.4.jar
> gson/2.3.1//gson-2.3.1.jar
> guava/14.0.1//guava-14.0.1.jar
> guice-assistedinject/3.0//guice-assistedinject-3.0.jar
> guice-servlet/3.0//guice-servlet-3.0.jar
> guice/3.0//guice-3.0.jar
> hadoop-annotations/2.7.3//hadoop-annotations-2.7.3.jar
> hadoop-auth/2.7.3//hadoop-auth-2.7.3.jar
> hadoop-client/2.7.3//hadoop-client-2.7.3.jar
> hadoop-common/2.7.3//hadoop-common-2.7.3.jar
> hadoop-common/2.7.3/tests/hadoop-common-2.7.3-tests.jar
> hadoop-hdfs/2.7.3//hadoop-hdfs-2.7.3.jar
> hadoop-hdfs/2.7.3/tests/hadoop-hdfs-2.7.3-tests.jar
> hadoop-mapreduce-client-app/2.7.3//hadoop-mapreduce-client-app-2.7.3.jar
> hadoop-mapreduce-client-common/2.7.3//hadoop-mapreduce-client-common-2.7.3.jar
> hadoop-mapreduce-client-core/2.7.3//hadoop-mapreduce-client-core-2.7.3.jar
> hadoop-mapreduce-client-jobclient/2.7.3//hadoop-mapreduce-client-jobclient-2.7.3.jar
> hadoop-mapreduce-client-shuffle/2.7.3//hadoop-mapreduce-client-shuffle-2.7.3.jar
> hadoop-yarn-api/2.7.3//hadoop-yarn-api-2.7.3.jar
> hadoop-yarn-client/2.7.3//hadoop-yarn-client-2.7.3.jar
> hadoop-yarn-common/2.7.3//hadoop-yarn-common-2.7.3.jar
> hadoop-yarn-registry/2.7.1//hadoop-yarn-registry-2.7.1.jar
> hadoop-yarn-server-applicationhistoryservice/2.7.2//hadoop-yarn-server-applicationhistoryservice-2.7.2.jar
> hadoop-yarn-server-common/2.7.2//hadoop-yarn-server-common-2.7.2.jar
> hadoop-yarn-server-resourcemanager/2.7.2//hadoop-yarn-server-resourcemanager-2.7.2.jar
> 

[jira] [Commented] (HUDI-2496) Inserts are precombined even with dedup disabled

2021-09-28 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421584#comment-17421584
 ] 

Raymond Xu commented on HUDI-2496:
--

[~helias_an] Sure. assigned! Please ping us even with a draft PR, we can give 
early feedback.

> Inserts are precombined even with dedup disabled
> 
>
> Key: HUDI-2496
> URL: https://issues.apache.org/jira/browse/HUDI-2496
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Sagar Sumit
>Assignee: Helias Antoniou
>Priority: Critical
>  Labels: sev:critical
> Fix For: 0.10.0
>
>
> Original GH issue https://github.com/apache/hudi/issues/3709
> Test case by [~xushiyan] : [https://github.com/apache/hudi/pull/3723/files]
> RCA by [~shivnarayan] :
> Within HoodieMergeHandle, we use a hashmap to store incoming records, where 
> keys are record keys.
>  and so, if you see 1st batch, duplicates would remain intact. but wrt 2nd 
> batch, only unique records are considered and later concatenated w/ 1st batch.
>  
> [https://github.com/apache/hudi/blob/36be28712196ff4427c41b0aa885c7fcd7356d7f/hudi-[]…]-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-2496) Inserts are precombined even with dedup disabled

2021-09-28 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-2496:


Assignee: Helias Antoniou

> Inserts are precombined even with dedup disabled
> 
>
> Key: HUDI-2496
> URL: https://issues.apache.org/jira/browse/HUDI-2496
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Sagar Sumit
>Assignee: Helias Antoniou
>Priority: Critical
>  Labels: sev:critical
> Fix For: 0.10.0
>
>
> Original GH issue https://github.com/apache/hudi/issues/3709
> Test case by [~xushiyan] : [https://github.com/apache/hudi/pull/3723/files]
> RCA by [~shivnarayan] :
> Within HoodieMergeHandle, we use a hashmap to store incoming records, where 
> keys are record keys.
>  and so, if you see 1st batch, duplicates would remain intact. but wrt 2nd 
> batch, only unique records are considered and later concatenated w/ 1st batch.
>  
> [https://github.com/apache/hudi/blob/36be28712196ff4427c41b0aa885c7fcd7356d7f/hudi-[]…]-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1998) Provide a way to find list of commits through a pythonic API

2021-09-29 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1998:
-
Description: 
TimelineUtils is a java API using which one can get the latest commit or 
instantiate HoodieActiveTImeline. Users are looking to perform the same through 
some python API

 

[https://github.com/apache/hudi/issues/2987]

 

Related issue

https://github.com/apache/hudi/issues/3641

  was:
TimelineUtils is a java API using which one can get the latest commit or 
instantiate HoodieActiveTImeline. Users are looking to perform the same through 
some python API

 

https://github.com/apache/hudi/issues/2987


> Provide a way to find list of commits through a pythonic API 
> -
>
> Key: HUDI-1998
> URL: https://issues.apache.org/jira/browse/HUDI-1998
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Priority: Minor
>
> TimelineUtils is a java API using which one can get the latest commit or 
> instantiate HoodieActiveTImeline. Users are looking to perform the same 
> through some python API
>  
> [https://github.com/apache/hudi/issues/2987]
>  
> Related issue
> https://github.com/apache/hudi/issues/3641



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2500) Spark datasource delete not working on Spark SQL created table

2021-09-29 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2500:


 Summary: Spark datasource delete not working on Spark SQL created 
table
 Key: HUDI-2500
 URL: https://issues.apache.org/jira/browse/HUDI-2500
 Project: Apache Hudi
  Issue Type: Bug
  Components: Spark Integration
Reporter: Raymond Xu
 Fix For: 0.10.0


Original issue https://github.com/apache/hudi/issues/3670



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2495) Difference in behavior between GenericRecord based key gen and Row based key gen

2021-09-30 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2495:
-
Parent: HUDI-2505
Issue Type: Sub-task  (was: Bug)

> Difference in behavior between GenericRecord based key gen and Row based key 
> gen 
> -
>
> Key: HUDI-2495
> URL: https://issues.apache.org/jira/browse/HUDI-2495
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: sev:critical
>
> when complex key gen is used and one of the field in record key is a 
> timestamp field, row writer path and rdd path gives different record key 
> values. GenericRecord path converts timestamp, where as row writer path does 
> not do any conversion. 
>  
> import java.sql.Timestamp
>  import spark.implicits._
> val df = Seq(
>  (1, Timestamp.valueOf("2014-01-01 23:00:01"), "abc"),
>  (1, Timestamp.valueOf("2014-11-30 12:40:32"), "abc"),
>  (2, Timestamp.valueOf("2016-12-29 09:54:00"), "def"),
>  (2, Timestamp.valueOf("2016-05-09 10:12:43"), "def")
>  ).toDF("typeId","eventTime", "str")
>  
> df.write.format("hudi").
>  option("hoodie.insert.shuffle.parallelism", "2").
>  option("hoodie.upsert.shuffle.parallelism", "2").
>  option("hoodie.bulkinsert.shuffle.parallelism", "2").
>  option("hoodie.datasource.write.precombine.field", "typeId").
>  option("hoodie.datasource.write.partitionpath.field", "typeId").
>  option("hoodie.datasource.write.recordkey.field", "str,eventTime").
>  
> option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.ComplexKeyGenerator").
>  option("hoodie.table.name", "hudi_tbl").
>  mode(Overwrite).
>  save("/tmp/hudi_tbl_trial/")
>  
> val hudiDF = spark.read.format("hudi").load("/tmp/hudi_tbl_trial/")
> hudiDF.createOrReplaceTempView("hudi_sql_tbl")
> spark.sql("select _hoodie_record_key, str, eventTime, typeId from 
> hudi_sql_tbl").show(false)
>  
> {code:java}
> +--+---+---+--+
> |_hoodie_record_key                |str|eventTime          |typeId|
> +--+---+---+--+
> |str:abc,eventTime:141736923200|abc|2014-11-30 12:40:32|1     |
> |str:abc,eventTime:138863520100|abc|2014-01-01 23:00:01|1     |
> |str:def,eventTime:146280316300|def|2016-05-09 10:12:43|2     |
> |str:def,eventTime:148302324000|def|2016-12-29 09:54:00|2     |
> +--+---+---+--+
> {code}
>  
>  
> // now retry w/ bulk_insert row writer path
> df.write.format("hudi").
>  option("hoodie.insert.shuffle.parallelism", "2").
>  option("hoodie.upsert.shuffle.parallelism", "2").
>  option("hoodie.bulkinsert.shuffle.parallelism", "2").
>  option("hoodie.datasource.write.precombine.field", "typeId").
>  option("hoodie.datasource.write.partitionpath.field", "typeId").
>  option("hoodie.datasource.write.recordkey.field", "str,eventTime").
>  
> option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.ComplexKeyGenerator").
>  option("hoodie.table.name", "hudi_tbl").
> "hoodie.datasource.write.operation","bulk_insert").
>  mode(Overwrite).
>  save("/tmp/hudi_tbl_trial_bulk_insert/")
>  
> val hudiDF_bulk_insert = 
> spark.read.format("hudi").load("/tmp/hudi_tbl_trial_bulk_insert/")
> hudiDF_bulk_insert.createOrReplaceTempView("hudi_sql_tbl_bulk_insert")
> spark.sql("select _hoodie_record_key, str, eventTime, typeId from 
> hudi_sql_tbl_bulk_insert").show(false)
> {code:java}
> +---+---+---+--+
> |_hoodie_record_key                     |str|eventTime          |typeId|
> +---+---+---+--+
> |str:def,eventTime:2016-05-09 10:12:43.0|def|2016-05-09 10:12:43|2     |
> |str:def,eventTime:2016-12-29 09:54:00.0|def|2016-12-29 09:54:00|2     |
> |str:abc,eventTime:2014-01-01 23:00:01.0|abc|2014-01-01 23:00:01|1     |
> |str:abc,eventTime:2014-11-30 12:40:32.0|abc|2014-11-30 12:40:32|1     |
> +---+---+---+--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2390) KeyGenerator discrepancy between DataFrame writer and SQL

2021-09-30 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2390:
-
Parent: HUDI-2505
Issue Type: Sub-task  (was: Improvement)

> KeyGenerator discrepancy between DataFrame writer and SQL
> -
>
> Key: HUDI-2390
> URL: https://issues.apache.org/jira/browse/HUDI-2390
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Affects Versions: 0.9.0
>Reporter: renhao
>Priority: Critical
>  Labels: sev:critical
>
> Test Case:
> {code:java}
>  import org.apache.hudi.QuickstartUtils._
>  import scala.collection.JavaConversions._
>  import org.apache.spark.sql.SaveMode._
>  import org.apache.hudi.DataSourceReadOptions._
>  import org.apache.hudi.DataSourceWriteOptions._
>  import org.apache.hudi.config.HoodieWriteConfig._{code}
> 1.准备数据
>  
> {code:java}
> spark.sql("create table test1(a int,b string,c string) using hudi partitioned 
> by(b) options(primaryKey='a')")
> spark.sql("insert into table test1 select 1,2,3")
> {code}
>  
> 2.创建hudi table test2
> {code:java}
> spark.sql("create table test2(a int,b string,c string) using hudi partitioned 
> by(b) options(primaryKey='a')"){code}
> 3.datasource向test2写入数据
>  
> {code:java}
> val base_data=spark.sql("select * from testdb.test1")
> base_data.write.format("hudi").
> option(TABLE_TYPE_OPT_KEY, COW_TABLE_TYPE_OPT_VAL).      
> option(RECORDKEY_FIELD_OPT_KEY, "a").      
> option(PARTITIONPATH_FIELD_OPT_KEY, "b").      
> option(KEYGENERATOR_CLASS_OPT_KEY, 
> "org.apache.hudi.keygen.SimpleKeyGenerator"). 
> option(OPERATION_OPT_KEY, "bulk_insert").      
> option(HIVE_SYNC_ENABLED_OPT_KEY, "true").      
> option(HIVE_PARTITION_FIELDS_OPT_KEY, "b").   
> option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,"org.apache.hudi.hive.MultiPartKeysValueExtractor").
>       
> option(HIVE_DATABASE_OPT_KEY, "testdb").      
> option(HIVE_TABLE_OPT_KEY, "test2").      
> option(HIVE_USE_JDBC_OPT_KEY, "true").      
> option("hoodie.bulkinsert.shuffle.parallelism", 4).
> option("hoodie.datasource.write.hive_style_partitioning", "true").      
> option(TABLE_NAME, 
> "test2").mode(Append).save(s"/user/hive/warehouse/testdb.db/test2")
> {code}
>  
> 此时执行查询结果如下:
> {code:java}
> +---+---+---+
> | a| b| c|
> +---+---+---+
> | 1| 3| 2|
> +---+---+---+{code}
> 4.删除一条记录
> {code:java}
> spark.sql("delete from testdb.test2 where a=1"){code}
> 5.执行查询,a=1的记录未被删除
> {code:java}
> spark.sql("select a,b,c from testdb.test2").show{code}
> {code:java}
> +---+---+---+
> | a| b| c|
> +---+---+---+
> | 1| 3| 2|
> +---+---+---+{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2505) [UMBRELLA] Spark DataSource APIs and Spark SQL discrepancies

2021-09-30 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2505:
-
Labels: sev:critical  (was: )

> [UMBRELLA] Spark DataSource APIs and Spark SQL discrepancies
> 
>
> Key: HUDI-2505
> URL: https://issues.apache.org/jira/browse/HUDI-2505
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Raymond Xu
>Priority: Critical
>  Labels: sev:critical
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2500) Spark datasource delete not working on Spark SQL created table

2021-09-30 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2500:
-
Parent: HUDI-2505
Issue Type: Sub-task  (was: Bug)

> Spark datasource delete not working on Spark SQL created table
> --
>
> Key: HUDI-2500
> URL: https://issues.apache.org/jira/browse/HUDI-2500
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Raymond Xu
>Priority: Critical
>  Labels: sev:critical
> Fix For: 0.10.0
>
>
> Original issue [https://github.com/apache/hudi/issues/3670]
>  
> Script to re-produce
> {code:java}
> val sparkSourceTablePath = s"${tmp.getCanonicalPath}/test_spark_table"
> val sparkSourceTableName = "test_spark_table"
> val hudiTablePath = s"${tmp.getCanonicalPath}/test_hudi_table"
> val hudiTableName = "test_hudi_table"
> println("0 - prepare source data")
> spark.createDataFrame(Seq(
>   ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
>   ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
>   ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
>   ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
>   ("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
>   ("105", "2015-01-02", "2015-01-01T13:51:42.248818Z")
> )).toDF("id", "creation_date", "last_update_time")
>   .withColumn("creation_date", expr("cast(creation_date as date)"))
>   .withColumn("id", expr("cast(id as bigint)"))
>   .write
>   .option("path", sparkSourceTablePath)
>   .mode("overwrite")
>   .format("parquet")
>   .saveAsTable(sparkSourceTableName)
> println("1 - CTAS to load data to Hudi")
> val hudiOptions = Map[String, String](
>   HoodieWriteConfig.TBL_NAME.key() -> hudiTableName,
>   DataSourceWriteOptions.TABLE_NAME.key() -> hudiTableName,
>   DataSourceWriteOptions.TABLE_TYPE.key() -> "COPY_ON_WRITE",
>   DataSourceWriteOptions.RECORDKEY_FIELD.key() -> "id",
>   DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.key() -> 
> classOf[ComplexKeyGenerator].getCanonicalName,
>   DataSourceWriteOptions.PAYLOAD_CLASS_NAME.key() -> 
> classOf[DefaultHoodieRecordPayload].getCanonicalName,
>   DataSourceWriteOptions.PARTITIONPATH_FIELD.key() -> "creation_date",
>   DataSourceWriteOptions.PRECOMBINE_FIELD.key() -> "last_update_time",
>   HoodieWriteConfig.INSERT_PARALLELISM_VALUE.key() -> "1",
>   HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key() -> "1",
>   HoodieWriteConfig.BULKINSERT_PARALLELISM_VALUE.key() -> "1",
>   HoodieWriteConfig.FINALIZE_WRITE_PARALLELISM_VALUE.key() -> "1",
>   HoodieWriteConfig.DELETE_PARALLELISM_VALUE.key() -> "1"
> )
> spark.sql(
>   s"""create table if not exists $hudiTableName using hudi
>  |  location '$hudiTablePath'
>  |  options (
>  |  type = 'cow',
>  |  primaryKey = 'id',
>  |  preCombineField = 'last_update_time'
>  |  )
>  |  partitioned by (creation_date)
>  |  AS
>  |  select id, last_update_time, creation_date from 
> $sparkSourceTableName
>  |  """.stripMargin)
> println("2 - Hudi table has all records")
> spark.sql(s"select * from $hudiTableName").show(100)
> println("3 - pick 105 to delete")
> val rec105 = spark.sql(s"select * from $hudiTableName where id = 105")
> rec105.show()
> println("4 - issue delete (Spark SQL)")
> spark.sql(s"delete from $hudiTableName where id = 105")
> println("5 - 105 is deleted")
> spark.sql(s"select * from $hudiTableName").show(100)
> println("6 - pick 104 to delete")
> val rec104 = spark.sql(s"select * from $hudiTableName where id = 104")
> rec104.show()
> println("7 - issue delete (DataSource)")
> rec104.write
>   .format("hudi")
>   .options(hudiOptions)
>   .option(DataSourceWriteOptions.OPERATION.key(), "delete")
>   .option(DataSourceWriteOptions.PAYLOAD_CLASS_NAME.key(), 
> classOf[EmptyHoodieRecordPayload].getCanonicalName)
>   .mode(SaveMode.Append)
>   .save(hudiTablePath)
> println("8 - 104 should be deleted")
> spark.sql(s"select * from $hudiTableName").show(100)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2505) [UMBRELLA] Spark DataSource APIs and Spark SQL discrepancies

2021-09-30 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2505:


 Summary: [UMBRELLA] Spark DataSource APIs and Spark SQL 
discrepancies
 Key: HUDI-2505
 URL: https://issues.apache.org/jira/browse/HUDI-2505
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Spark Integration
Reporter: Raymond Xu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2500) Spark datasource delete not working on Spark SQL created table

2021-09-30 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2500:
-
Description: 
Original issue [https://github.com/apache/hudi/issues/3670]

 

Script to re-produce
{code:java}
val sparkSourceTablePath = s"${tmp.getCanonicalPath}/test_spark_table"
val sparkSourceTableName = "test_spark_table"
val hudiTablePath = s"${tmp.getCanonicalPath}/test_hudi_table"
val hudiTableName = "test_hudi_table"
println("0 - prepare source data")
spark.createDataFrame(Seq(
  ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
  ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
  ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
  ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
  ("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
  ("105", "2015-01-02", "2015-01-01T13:51:42.248818Z")
)).toDF("id", "creation_date", "last_update_time")
  .withColumn("creation_date", expr("cast(creation_date as date)"))
  .withColumn("id", expr("cast(id as bigint)"))
  .write
  .option("path", sparkSourceTablePath)
  .mode("overwrite")
  .format("parquet")
  .saveAsTable(sparkSourceTableName)

println("1 - CTAS to load data to Hudi")
val hudiOptions = Map[String, String](
  HoodieWriteConfig.TBL_NAME.key() -> hudiTableName,
  DataSourceWriteOptions.TABLE_NAME.key() -> hudiTableName,
  DataSourceWriteOptions.TABLE_TYPE.key() -> "COPY_ON_WRITE",
  DataSourceWriteOptions.RECORDKEY_FIELD.key() -> "id",
  DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.key() -> 
classOf[ComplexKeyGenerator].getCanonicalName,
  DataSourceWriteOptions.PAYLOAD_CLASS_NAME.key() -> 
classOf[DefaultHoodieRecordPayload].getCanonicalName,
  DataSourceWriteOptions.PARTITIONPATH_FIELD.key() -> "creation_date",
  DataSourceWriteOptions.PRECOMBINE_FIELD.key() -> "last_update_time",
  HoodieWriteConfig.INSERT_PARALLELISM_VALUE.key() -> "1",
  HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key() -> "1",
  HoodieWriteConfig.BULKINSERT_PARALLELISM_VALUE.key() -> "1",
  HoodieWriteConfig.FINALIZE_WRITE_PARALLELISM_VALUE.key() -> "1",
  HoodieWriteConfig.DELETE_PARALLELISM_VALUE.key() -> "1"
)
spark.sql(
  s"""create table if not exists $hudiTableName using hudi
 |  location '$hudiTablePath'
 |  options (
 |  type = 'cow',
 |  primaryKey = 'id',
 |  preCombineField = 'last_update_time'
 |  )
 |  partitioned by (creation_date)
 |  AS
 |  select id, last_update_time, creation_date from 
$sparkSourceTableName
 |  """.stripMargin)
println("2 - Hudi table has all records")
spark.sql(s"select * from $hudiTableName").show(100)
println("3 - pick 105 to delete")
val rec105 = spark.sql(s"select * from $hudiTableName where id = 105")
rec105.show()
println("4 - issue delete (Spark SQL)")
spark.sql(s"delete from $hudiTableName where id = 105")
println("5 - 105 is deleted")
spark.sql(s"select * from $hudiTableName").show(100)
println("6 - pick 104 to delete")
val rec104 = spark.sql(s"select * from $hudiTableName where id = 104")
rec104.show()
println("7 - issue delete (DataSource)")
rec104.write
  .format("hudi")
  .options(hudiOptions)
  .option(DataSourceWriteOptions.OPERATION.key(), "delete")
  .option(DataSourceWriteOptions.PAYLOAD_CLASS_NAME.key(), 
classOf[EmptyHoodieRecordPayload].getCanonicalName)
  .mode(SaveMode.Append)
  .save(hudiTablePath)
println("8 - 104 should be deleted")
spark.sql(s"select * from $hudiTableName").show(100)
{code}

  was:Original issue https://github.com/apache/hudi/issues/3670


> Spark datasource delete not working on Spark SQL created table
> --
>
> Key: HUDI-2500
> URL: https://issues.apache.org/jira/browse/HUDI-2500
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Raymond Xu
>Priority: Critical
>  Labels: sev:critical
> Fix For: 0.10.0
>
>
> Original issue [https://github.com/apache/hudi/issues/3670]
>  
> Script to re-produce
> {code:java}
> val sparkSourceTablePath = s"${tmp.getCanonicalPath}/test_spark_table"
> val sparkSourceTableName = "test_spark_table"
> val hudiTablePath = s"${tmp.getCanonicalPath}/test_hudi_table"
> val hudiTableName = "test_hudi_table"
> println("0 - prepare source data")
> spark.createDataFrame(Seq(
>   ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
>   ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
>   ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
>   ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
>   ("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
>   ("105", "2015-01-02", "2015-01-01T13:51:42.248818Z")
> )).toDF("id", "creation_date", "last_update_time")
>   .withColumn("creation_date", expr("cast(creation_date as date)"))
>   

[jira] [Created] (HUDI-2531) [UMBRELLA] Support Dataset APIs in writer paths

2021-10-07 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2531:


 Summary: [UMBRELLA] Support Dataset APIs in writer paths
 Key: HUDI-2531
 URL: https://issues.apache.org/jira/browse/HUDI-2531
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Spark Integration
Reporter: Raymond Xu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2531) [UMBRELLA] Support Dataset APIs in writer paths

2021-10-07 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2531:
-
Labels: hudi-umbrellas sev:critical user-support-issues  (was: )

> [UMBRELLA] Support Dataset APIs in writer paths
> ---
>
> Key: HUDI-2531
> URL: https://issues.apache.org/jira/browse/HUDI-2531
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Raymond Xu
>Priority: Critical
>  Labels: hudi-umbrellas, sev:critical, user-support-issues
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2531) [UMBRELLA] Support Dataset APIs in writer paths

2021-10-07 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2531:
-
Description: To make use of Dataset APIs in writer paths instead of RDD.

> [UMBRELLA] Support Dataset APIs in writer paths
> ---
>
> Key: HUDI-2531
> URL: https://issues.apache.org/jira/browse/HUDI-2531
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Raymond Xu
>Priority: Critical
>  Labels: hudi-umbrellas, sev:critical, user-support-issues
>
> To make use of Dataset APIs in writer paths instead of RDD.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2452) spark on hudi metadata key length < 0

2021-09-22 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2452:
-
Labels: sev:critical  (was: pull-request-available sev:critical)

> spark on hudi metadata key length < 0
> -
>
> Key: HUDI-2452
> URL: https://issues.apache.org/jira/browse/HUDI-2452
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: xy
>Priority: Blocker
>  Labels: sev:critical
> Fix For: 0.10.0
>
> Attachments: metadata表.txt
>
>
> spark on hudi metadata key length <= 0, but data primary key is not "" or 
> null,error messages are in attatchment
>  
>  
> https://github.com/apache/hudi/issues/3688



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2482) Support drop partitions SQL

2021-09-22 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2482:


 Summary: Support drop partitions SQL
 Key: HUDI-2482
 URL: https://issues.apache.org/jira/browse/HUDI-2482
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Spark Integration
Reporter: Yann Byron
Assignee: Yann Byron
 Fix For: 0.10.0


Spark SQL support the following syntax to show hudi tabls's partitions.
{code:java}
SHOW PARTITIONS tableIdentifier partitionSpec?{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2482) Support drop partitions SQL

2021-09-22 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2482:
-
Parent: HUDI-1658
Issue Type: Sub-task  (was: Improvement)

> Support drop partitions SQL
> ---
>
> Key: HUDI-2482
> URL: https://issues.apache.org/jira/browse/HUDI-2482
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Yann Byron
>Priority: Major
>  Labels: features, pull-request-available
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2456) Support show partitions SQL

2021-09-22 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2456:
-
Parent: HUDI-1658
Issue Type: Sub-task  (was: Improvement)

> Support show partitions SQL
> ---
>
> Key: HUDI-2456
> URL: https://issues.apache.org/jira/browse/HUDI-2456
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: features, pull-request-available
> Fix For: 0.10.0
>
>
> Spark SQL support the following syntax to show hudi tabls's partitions.
> {code:java}
> SHOW PARTITIONS tableIdentifier partitionSpec?{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2482) Support drop partitions SQL

2021-09-22 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2482:
-
Description: (was: Spark SQL support the following syntax to show hudi 
tabls's partitions.
{code:java}
SHOW PARTITIONS tableIdentifier partitionSpec?{code}
 )

> Support drop partitions SQL
> ---
>
> Key: HUDI-2482
> URL: https://issues.apache.org/jira/browse/HUDI-2482
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: features, pull-request-available
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-2482) Support drop partitions SQL

2021-09-22 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-2482:


Assignee: (was: Yann Byron)

> Support drop partitions SQL
> ---
>
> Key: HUDI-2482
> URL: https://issues.apache.org/jira/browse/HUDI-2482
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Yann Byron
>Priority: Major
>  Labels: features, pull-request-available
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-2108) Flaky test: TestHoodieBackedMetadata.testOnlyValidPartitionsAdded:210

2021-10-02 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-2108:


Assignee: Raymond Xu  (was: Vinoth Chandar)

> Flaky test: TestHoodieBackedMetadata.testOnlyValidPartitionsAdded:210
> -
>
> Key: HUDI-2108
> URL: https://issues.apache.org/jira/browse/HUDI-2108
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinoth Chandar
>Assignee: Raymond Xu
>Priority: Major
>
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=357=logs=864947d5-8fca-5138-8394-999ccb212a1e=552b4d2f-26d5-5f2f-1d5d-e8229058b632



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2108) Flaky test: TestHoodieBackedMetadata.testOnlyValidPartitionsAdded:210

2021-10-02 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2108:
-
Status: In Progress  (was: Open)

> Flaky test: TestHoodieBackedMetadata.testOnlyValidPartitionsAdded:210
> -
>
> Key: HUDI-2108
> URL: https://issues.apache.org/jira/browse/HUDI-2108
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinoth Chandar
>Assignee: Raymond Xu
>Priority: Major
>
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=357=logs=864947d5-8fca-5138-8394-999ccb212a1e=552b4d2f-26d5-5f2f-1d5d-e8229058b632



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2516) Upgrade to Junit 5.8.1

2021-10-04 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2516:


 Summary: Upgrade to Junit 5.8.1
 Key: HUDI-2516
 URL: https://issues.apache.org/jira/browse/HUDI-2516
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Testing
Reporter: Raymond Xu
Assignee: Raymond Xu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2108) Flaky test: TestHoodieBackedMetadata.testOnlyValidPartitionsAdded:210

2021-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2108:
-
Description: 
org.apache.hudi.client.functional.TestHoodieBackedMetadata#testTableOperationsWithRestore

 
 flakiness came from
{code:java}
client.scheduleCompactionAtInstant(newCommitTime, Option.empty()); // sometimes 
never create 007.compaction.requested
client.compact(newCommitTime); // then this would fail{code}

  was:
org.apache.hudi.client.functional.TestHoodieBackedMetadata#testTableOperationsWithRestore

 
flakiness came from
client.scheduleCompactionAtInstant(newCommitTime, Option.empty()); // 
sometimes never create 007.compaction.requested
client.compact(newCommitTime); // then this would fail


> Flaky test: TestHoodieBackedMetadata.testOnlyValidPartitionsAdded:210
> -
>
> Key: HUDI-2108
> URL: https://issues.apache.org/jira/browse/HUDI-2108
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Vinoth Chandar
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
>
> org.apache.hudi.client.functional.TestHoodieBackedMetadata#testTableOperationsWithRestore
>  
>  flakiness came from
> {code:java}
> client.scheduleCompactionAtInstant(newCommitTime, Option.empty()); // 
> sometimes never create 007.compaction.requested
> client.compact(newCommitTime); // then this would fail{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2108) Flaky test: TestHoodieBackedMetadata.testOnlyValidPartitionsAdded:210

2021-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2108:
-
Component/s: Testing

> Flaky test: TestHoodieBackedMetadata.testOnlyValidPartitionsAdded:210
> -
>
> Key: HUDI-2108
> URL: https://issues.apache.org/jira/browse/HUDI-2108
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Vinoth Chandar
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
>
> org.apache.hudi.client.functional.TestHoodieBackedMetadata#testTableOperationsWithRestore
>  
> flakiness came from
> client.scheduleCompactionAtInstant(newCommitTime, Option.empty()); // 
> sometimes never create 007.compaction.requested
> client.compact(newCommitTime); // then this would fail



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2108) Flaky test: TestHoodieBackedMetadata.testOnlyValidPartitionsAdded:210

2021-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2108:
-
Description: 
org.apache.hudi.client.functional.TestHoodieBackedMetadata#testTableOperationsWithRestore

 
flakiness came from
client.scheduleCompactionAtInstant(newCommitTime, Option.empty()); // 
sometimes never create 007.compaction.requested
client.compact(newCommitTime); // then this would fail

  
was:https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=357=logs=864947d5-8fca-5138-8394-999ccb212a1e=552b4d2f-26d5-5f2f-1d5d-e8229058b632


> Flaky test: TestHoodieBackedMetadata.testOnlyValidPartitionsAdded:210
> -
>
> Key: HUDI-2108
> URL: https://issues.apache.org/jira/browse/HUDI-2108
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinoth Chandar
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
>
> org.apache.hudi.client.functional.TestHoodieBackedMetadata#testTableOperationsWithRestore
>  
> flakiness came from
> client.scheduleCompactionAtInstant(newCommitTime, Option.empty()); // 
> sometimes never create 007.compaction.requested
> client.compact(newCommitTime); // then this would fail



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2528) Flaky test: [ERROR] HoodieTableType).[2] MERGE_ON_READ(testTableOperationsWithRestore

2021-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2528:
-
Parent: HUDI-1248
Issue Type: Sub-task  (was: Bug)

> Flaky test: [ERROR] HoodieTableType).[2] 
> MERGE_ON_READ(testTableOperationsWithRestore
> -
>
> Key: HUDI-2528
> URL: https://issues.apache.org/jira/browse/HUDI-2528
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Raymond Xu
>Priority: Major
>
>  
> {code:java}
>  [ERROR] Failures:[ERROR] There files should have been rolled-back when 
> rolling back commit 002 but are still remaining. Files: 
> [file:/tmp/junit6464799159313857398/2016/03/15/9d59f0f1-9cfa-41a4-b247-6bf002ad6cc7-0_0-592-8761_001.parquet,
>  
> file:/tmp/junit6464799159313857398/2016/03/15/9d59f0f1-9cfa-41a4-b247-6bf002ad6cc7-0_0-585-8754_001.parquet]
>  ==> expected: <0> but was: <2>[ERROR] Errors:[ERROR] No Compaction 
> request available at 007 to run compaction {code}
>  
> Probably the same cause as HUDI-2108
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2529) Flaky test: ITTestHoodieFlinkCompactor.testHoodieFlinkCompactor:88

2021-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2529:
-
Description: 
{code:java}
2021-09-30T16:45:30.4276182Z 12557 [pool-15-thread-2] ERROR 
org.apache.hudi.common.table.view.PriorityBasedFileSystemView  - Got error 
running preferred function. Trying secondary
2021-09-30T16:45:30.4276903Z org.apache.hudi.exception.HoodieRemoteException: 
Connect to 0.0.0.0:46865 [/0.0.0.0] failed: Connection refused (Connection 
refused)
2021-09-30T16:45:30.4277581Zat 
org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlice(RemoteHoodieTableFileSystemView.java:297)
2021-09-30T16:45:30.4278221Zat 
org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:97)
2021-09-30T16:45:30.4278827Zat 
org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getLatestFileSlice(PriorityBasedFileSystemView.java:252)
2021-09-30T16:45:30.4279399Zat 
org.apache.hudi.io.HoodieAppendHandle.init(HoodieAppendHandle.java:135)
2021-09-30T16:45:30.4279873Zat 
org.apache.hudi.io.HoodieAppendHandle.write(HoodieAppendHandle.java:390)
2021-09-30T16:45:30.4280347Zat 
org.apache.hudi.io.HoodieWriteHandle.write(HoodieWriteHandle.java:215)
2021-09-30T16:45:30.4280863Zat 
org.apache.hudi.execution.CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteInsertHandler.java:96)
2021-09-30T16:45:30.4281447Zat 
org.apache.hudi.execution.CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteInsertHandler.java:40)
2021-09-30T16:45:30.4282039Zat 
org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
2021-09-30T16:45:30.4282624Zat 
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
2021-09-30T16:45:30.4283129Zat 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
2021-09-30T16:45:30.4283590Zat 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2021-09-30T16:45:30.4284080Zat 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2021-09-30T16:45:30.4284502Zat java.lang.Thread.run(Thread.java:748)
2021-09-30T16:45:30.4298786Z Caused by: 
org.apache.http.conn.HttpHostConnectException: Connect to 0.0.0.0:46865 
[/0.0.0.0] failed: Connection refused (Connection refused)
2021-09-30T16:45:30.4299596Zat 
org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)
2021-09-30T16:45:30.4300229Zat 
org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
2021-09-30T16:45:30.4300808Zat 
org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
2021-09-30T16:45:30.4301322Zat 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
2021-09-30T16:45:30.4301804Zat 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
2021-09-30T16:45:30.4302279Zat 
org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
2021-09-30T16:45:30.4302751Zat 
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
2021-09-30T16:45:30.4303239Zat 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
2021-09-30T16:45:30.4303940Zat 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
2021-09-30T16:45:30.4304463Zat 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
2021-09-30T16:45:30.4304983Zat 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
2021-09-30T16:45:30.4305450Zat 
org.apache.http.client.fluent.Request.execute(Request.java:151)
2021-09-30T16:45:30.4306006Zat 
org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:172)
2021-09-30T16:45:30.4306671Zat 
org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlice(RemoteHoodieTableFileSystemView.java:293)
2021-09-30T16:45:30.4307194Z... 13 more
2021-09-30T16:45:30.4307537Z Caused by: java.net.ConnectException: Connection 
refused (Connection refused)
2021-09-30T16:45:30.4307945Zat 
java.net.PlainSocketImpl.socketConnect(Native Method)
2021-09-30T16:45:30.4308362Zat 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
2021-09-30T16:45:30.4315903Zat 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:204)
2021-09-30T16:45:30.4316643Zat 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
2021-09-30T16:45:30.4317099Zat 
java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
2021-09-30T16:45:30.4317496Zat 

[jira] [Updated] (HUDI-2077) Flaky test: TestHoodieDeltaStreamer

2021-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2077:
-
Description: 
{code:java}
 [INFO] Results:8520[INFO] 8521[ERROR] Errors: 8522[ERROR]   
TestHoodieDeltaStreamer.testUpsertsMORContinuousModeWithMultipleWriters:716->testUpsertsContinuousModeWithMultipleWriters:831->runJobsInParallel:940
 » Execution{code}
 
{code:java}
2021-10-01T15:38:36.7776781Z [ERROR] 
org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.testUpsertsMORContinuousModeWithMultipleWriters
 Time elapsed: 57.945 s <<< ERROR! 2021-10-01T15:38:36.7778593Z 
java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
org.apache.hudi.exception.HoodieIOException: Failed to create file 
hdfs://localhost:46579/user/vsts/continuous_mor_mulitwriter/.hoodie/20211001153821.commit
 2021-10-01T15:38:36.7780175Z at 
org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.runJobsInParallel(TestHoodieDeltaStreamer.java:926)
 2021-10-01T15:38:36.7781191Z at 
org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.testUpsertsContinuousModeWithMultipleWriters(TestHoodieDeltaStreamer.java:818)
 2021-10-01T15:38:36.7782459Z at 
org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.testUpsertsMORContinuousModeWithMultipleWriters(TestHoodieDeltaStreamer.java:703)
 2021-10-01T15:38:36.7783719Z Caused by: java.lang.RuntimeException: 
org.apache.hudi.exception.HoodieIOException: Failed to create file 
hdfs://localhost:46579/user/vsts/continuous_mor_mulitwriter/.hoodie/20211001153821.commit
 2021-10-01T15:38:36.7784928Z at 
org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.lambda$runJobsInParallel$10(TestHoodieDeltaStreamer.java:923)
 2021-10-01T15:38:36.7786069Z Caused by: 
org.apache.hudi.exception.HoodieIOException: Failed to create file 
hdfs://localhost:46579/user/vsts/continuous_mor_mulitwriter/.hoodie/20211001153821.commit
 2021-10-01T15:38:36.7787955Z at 
org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.lambda$runJobsInParallel$10(TestHoodieDeltaStreamer.java:921)
 2021-10-01T15:38:36.7789094Z Caused by: 
org.apache.hadoop.fs.FileAlreadyExistsException: 2021-10-01T15:38:36.7789863Z 
/user/vsts/continuous_mor_mulitwriter/.hoodie/20211001153821.commit for client 
127.0.0.1 already exists 2021-10-01T15:38:36.7790732Z at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2563)
 2021-10-01T15:38:36.7791637Z at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2450)
 2021-10-01T15:38:36.7793026Z at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2334)
 2021-10-01T15:38:36.7794034Z at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:624)
 2021-10-01T15:38:36.7795041Z at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:397)
 2021-10-01T15:38:36.7796077Z at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 2021-10-01T15:38:36.7797974Z at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
 2021-10-01T15:38:36.7798852Z at 
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) 
2021-10-01T15:38:36.7799527Z at 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) 
2021-10-01T15:38:36.7800188Z at 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) 
2021-10-01T15:38:36.7800789Z at 
java.security.AccessController.doPrivileged(Native Method) 
2021-10-01T15:38:36.7801386Z at 
javax.security.auth.Subject.doAs(Subject.java:422) 2021-10-01T15:38:36.7802258Z 
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
 2021-10-01T15:38:36.7802948Z at 
org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045) 
2021-10-01T15:38:36.7803676Z 2021-10-01T15:38:36.7804333Z at 
org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.lambda$runJobsInParallel$10(TestHoodieDeltaStreamer.java:921)
 2021-10-01T15:38:36.7805070Z Caused by: org.apache.hadoop.ipc.RemoteException: 
2021-10-01T15:38:36.7805712Z 
/user/vsts/continuous_mor_mulitwriter/.hoodie/20211001153821.commit for client 
127.0.0.1 already exists 2021-10-01T15:38:36.7806633Z at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2563)
 2021-10-01T15:38:36.7807422Z at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2450)
 2021-10-01T15:38:36.7808170Z at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2334)
 2021-10-01T15:38:36.7808949Z at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:624)
 2021-10-01T15:38:36.7809836Z at 

[jira] [Closed] (HUDI-2075) Flaky test: TestRowDataToHoodieFunction

2021-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-2075.

Resolution: Cannot Reproduce

Don't see this in Azure. Re-open if this came back.

> Flaky test: TestRowDataToHoodieFunction
> ---
>
> Key: HUDI-2075
> URL: https://issues.apache.org/jira/browse/HUDI-2075
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Raymond Xu
>Priority: Major
>
> At-least 10 occurrences 
> [ERROR] Failures: 
> [ERROR]   TestRowDataToHoodieFunction.testRateLimit:72 should process at 
> least 5 seconds ==> expected:  but was: 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2529) Flaky test: ITTestHoodieFlinkCompactor.testHoodieFlinkCompactor:88

2021-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2529:
-
Attachment: 27.txt

> Flaky test: ITTestHoodieFlinkCompactor.testHoodieFlinkCompactor:88
> --
>
> Key: HUDI-2529
> URL: https://issues.apache.org/jira/browse/HUDI-2529
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Raymond Xu
>Priority: Major
> Attachments: 27.txt
>
>
> {code:java}
> 2021-09-30T16:45:30.4276182Z 12557 [pool-15-thread-2] ERROR 
> org.apache.hudi.common.table.view.PriorityBasedFileSystemView  - Got error 
> running preferred function. Trying secondary
> 2021-09-30T16:45:30.4276903Z org.apache.hudi.exception.HoodieRemoteException: 
> Connect to 0.0.0.0:46865 [/0.0.0.0] failed: Connection refused (Connection 
> refused)
> 2021-09-30T16:45:30.4277581Z  at 
> org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlice(RemoteHoodieTableFileSystemView.java:297)
> 2021-09-30T16:45:30.4278221Z  at 
> org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:97)
> 2021-09-30T16:45:30.4278827Z  at 
> org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getLatestFileSlice(PriorityBasedFileSystemView.java:252)
> 2021-09-30T16:45:30.4279399Z  at 
> org.apache.hudi.io.HoodieAppendHandle.init(HoodieAppendHandle.java:135)
> 2021-09-30T16:45:30.4279873Z  at 
> org.apache.hudi.io.HoodieAppendHandle.write(HoodieAppendHandle.java:390)
> 2021-09-30T16:45:30.4280347Z  at 
> org.apache.hudi.io.HoodieWriteHandle.write(HoodieWriteHandle.java:215)
> 2021-09-30T16:45:30.4280863Z  at 
> org.apache.hudi.execution.CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteInsertHandler.java:96)
> 2021-09-30T16:45:30.4281447Z  at 
> org.apache.hudi.execution.CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteInsertHandler.java:40)
> 2021-09-30T16:45:30.4282039Z  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
> 2021-09-30T16:45:30.4282624Z  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
> 2021-09-30T16:45:30.4283129Z  at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2021-09-30T16:45:30.4283590Z  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 2021-09-30T16:45:30.4284080Z  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 2021-09-30T16:45:30.4284502Z  at java.lang.Thread.run(Thread.java:748)
> 2021-09-30T16:45:30.4298786Z Caused by: 
> org.apache.http.conn.HttpHostConnectException: Connect to 0.0.0.0:46865 
> [/0.0.0.0] failed: Connection refused (Connection refused)
> 2021-09-30T16:45:30.4299596Z  at 
> org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)
> 2021-09-30T16:45:30.4300229Z  at 
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
> 2021-09-30T16:45:30.4300808Z  at 
> org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
> 2021-09-30T16:45:30.4301322Z  at 
> org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
> 2021-09-30T16:45:30.4301804Z  at 
> org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
> 2021-09-30T16:45:30.4302279Z  at 
> org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
> 2021-09-30T16:45:30.4302751Z  at 
> org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
> 2021-09-30T16:45:30.4303239Z  at 
> org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
> 2021-09-30T16:45:30.4303940Z  at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> 2021-09-30T16:45:30.4304463Z  at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
> 2021-09-30T16:45:30.4304983Z  at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> 2021-09-30T16:45:30.4305450Z  at 
> org.apache.http.client.fluent.Request.execute(Request.java:151)
> 2021-09-30T16:45:30.4306006Z  at 
> org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:172)
> 2021-09-30T16:45:30.4306671Z  at 
> org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlice(RemoteHoodieTableFileSystemView.java:293)
> 2021-09-30T16:45:30.4307194Z  ... 13 more
> 2021-09-30T16:45:30.4307537Z Caused by: java.net.ConnectException: Connection 
> refused (Connection refused)
> 2021-09-30T16:45:30.4307945Z  at 
> 

[jira] [Updated] (HUDI-2528) Flaky test: MERGE_ON_READ testTableOperationsWithRestore

2021-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2528:
-
Summary: Flaky test: MERGE_ON_READ testTableOperationsWithRestore  (was: 
Flaky test: [ERROR] HoodieTableType).[2] 
MERGE_ON_READ(testTableOperationsWithRestore)

> Flaky test: MERGE_ON_READ testTableOperationsWithRestore
> 
>
> Key: HUDI-2528
> URL: https://issues.apache.org/jira/browse/HUDI-2528
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Raymond Xu
>Priority: Major
>
>  
> {code:java}
>  [ERROR] Failures:[ERROR] There files should have been rolled-back when 
> rolling back commit 002 but are still remaining. Files: 
> [file:/tmp/junit6464799159313857398/2016/03/15/9d59f0f1-9cfa-41a4-b247-6bf002ad6cc7-0_0-592-8761_001.parquet,
>  
> file:/tmp/junit6464799159313857398/2016/03/15/9d59f0f1-9cfa-41a4-b247-6bf002ad6cc7-0_0-585-8754_001.parquet]
>  ==> expected: <0> but was: <2>[ERROR] Errors:[ERROR] No Compaction 
> request available at 007 to run compaction {code}
>  
> Probably the same cause as HUDI-2108
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2077) Flaky test: TestHoodieDeltaStreamer

2021-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2077:
-
Description: 
{code:java}
 [INFO] Results:8520[INFO] 8521[ERROR] Errors: 8522[ERROR]   
TestHoodieDeltaStreamer.testUpsertsMORContinuousModeWithMultipleWriters:716->testUpsertsContinuousModeWithMultipleWriters:831->runJobsInParallel:940
 » Execution{code}
 Search "testUpsertsMORContinuousModeWithMultipleWriters" in the log file for 
details.

  was:
{code:java}
 [INFO] Results:8520[INFO] 8521[ERROR] Errors: 8522[ERROR]   
TestHoodieDeltaStreamer.testUpsertsMORContinuousModeWithMultipleWriters:716->testUpsertsContinuousModeWithMultipleWriters:831->runJobsInParallel:940
 » Execution{code}
 
{code:java}
2021-10-01T15:38:36.7776781Z [ERROR] 
org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.testUpsertsMORContinuousModeWithMultipleWriters
 Time elapsed: 57.945 s <<< ERROR! 2021-10-01T15:38:36.7778593Z 
java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
org.apache.hudi.exception.HoodieIOException: Failed to create file 
hdfs://localhost:46579/user/vsts/continuous_mor_mulitwriter/.hoodie/20211001153821.commit
 2021-10-01T15:38:36.7780175Z at 
org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.runJobsInParallel(TestHoodieDeltaStreamer.java:926)
 2021-10-01T15:38:36.7781191Z at 
org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.testUpsertsContinuousModeWithMultipleWriters(TestHoodieDeltaStreamer.java:818)
 2021-10-01T15:38:36.7782459Z at 
org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.testUpsertsMORContinuousModeWithMultipleWriters(TestHoodieDeltaStreamer.java:703)
 2021-10-01T15:38:36.7783719Z Caused by: java.lang.RuntimeException: 
org.apache.hudi.exception.HoodieIOException: Failed to create file 
hdfs://localhost:46579/user/vsts/continuous_mor_mulitwriter/.hoodie/20211001153821.commit
 2021-10-01T15:38:36.7784928Z at 
org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.lambda$runJobsInParallel$10(TestHoodieDeltaStreamer.java:923)
 2021-10-01T15:38:36.7786069Z Caused by: 
org.apache.hudi.exception.HoodieIOException: Failed to create file 
hdfs://localhost:46579/user/vsts/continuous_mor_mulitwriter/.hoodie/20211001153821.commit
 2021-10-01T15:38:36.7787955Z at 
org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.lambda$runJobsInParallel$10(TestHoodieDeltaStreamer.java:921)
 2021-10-01T15:38:36.7789094Z Caused by: 
org.apache.hadoop.fs.FileAlreadyExistsException: 2021-10-01T15:38:36.7789863Z 
/user/vsts/continuous_mor_mulitwriter/.hoodie/20211001153821.commit for client 
127.0.0.1 already exists 2021-10-01T15:38:36.7790732Z at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2563)
 2021-10-01T15:38:36.7791637Z at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2450)
 2021-10-01T15:38:36.7793026Z at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2334)
 2021-10-01T15:38:36.7794034Z at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:624)
 2021-10-01T15:38:36.7795041Z at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:397)
 2021-10-01T15:38:36.7796077Z at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 2021-10-01T15:38:36.7797974Z at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
 2021-10-01T15:38:36.7798852Z at 
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) 
2021-10-01T15:38:36.7799527Z at 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) 
2021-10-01T15:38:36.7800188Z at 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) 
2021-10-01T15:38:36.7800789Z at 
java.security.AccessController.doPrivileged(Native Method) 
2021-10-01T15:38:36.7801386Z at 
javax.security.auth.Subject.doAs(Subject.java:422) 2021-10-01T15:38:36.7802258Z 
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
 2021-10-01T15:38:36.7802948Z at 
org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045) 
2021-10-01T15:38:36.7803676Z 2021-10-01T15:38:36.7804333Z at 
org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.lambda$runJobsInParallel$10(TestHoodieDeltaStreamer.java:921)
 2021-10-01T15:38:36.7805070Z Caused by: org.apache.hadoop.ipc.RemoteException: 
2021-10-01T15:38:36.7805712Z 
/user/vsts/continuous_mor_mulitwriter/.hoodie/20211001153821.commit for client 
127.0.0.1 already exists 2021-10-01T15:38:36.7806633Z at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2563)
 2021-10-01T15:38:36.7807422Z at 

[jira] [Updated] (HUDI-2077) Flaky test: TestHoodieDeltaStreamer

2021-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2077:
-
Attachment: 28.txt

> Flaky test: TestHoodieDeltaStreamer
> ---
>
> Key: HUDI-2077
> URL: https://issues.apache.org/jira/browse/HUDI-2077
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Sagar Sumit
>Priority: Major
> Attachments: 28.txt
>
>
> {code:java}
>  [INFO] Results:8520[INFO] 8521[ERROR] Errors: 8522[ERROR]   
> TestHoodieDeltaStreamer.testUpsertsMORContinuousModeWithMultipleWriters:716->testUpsertsContinuousModeWithMultipleWriters:831->runJobsInParallel:940
>  » Execution{code}
>  Search "testUpsertsMORContinuousModeWithMultipleWriters" in the log file for 
> details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2528) Flaky test: [ERROR] HoodieTableType).[2] MERGE_ON_READ(testTableOperationsWithRestore

2021-10-05 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2528:


 Summary: Flaky test: [ERROR] HoodieTableType).[2] 
MERGE_ON_READ(testTableOperationsWithRestore
 Key: HUDI-2528
 URL: https://issues.apache.org/jira/browse/HUDI-2528
 Project: Apache Hudi
  Issue Type: Bug
  Components: Testing
Reporter: Raymond Xu


 
{code:java}
 [ERROR] Failures:[ERROR] There files should have been rolled-back when 
rolling back commit 002 but are still remaining. Files: 
[file:/tmp/junit6464799159313857398/2016/03/15/9d59f0f1-9cfa-41a4-b247-6bf002ad6cc7-0_0-592-8761_001.parquet,
 
file:/tmp/junit6464799159313857398/2016/03/15/9d59f0f1-9cfa-41a4-b247-6bf002ad6cc7-0_0-585-8754_001.parquet]
 ==> expected: <0> but was: <2>[ERROR] Errors:[ERROR] No Compaction 
request available at 007 to run compaction {code}
 

Probably the same cause as HUDI-2108
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1248) [UMBRELLA] Tests cleanup and fixes

2021-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1248:
-
Priority: Critical  (was: Major)

> [UMBRELLA] Tests cleanup and fixes
> --
>
> Key: HUDI-1248
> URL: https://issues.apache.org/jira/browse/HUDI-1248
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Critical
>  Labels: hudi-umbrellas, pull-request-available
>
> There are quite few tickets that requires some fixes to tests. Creating this 
> umbrella ticket to track all efforts.
>  
> https://issues.apache.org/jira/browse/HUDI-1055 remove .parquet from tests.
>  https://issues.apache.org/jira/browse/HUDI-1033 ITTestRepairsCommand and 
> TestRepairsCommand
>  https://issues.apache.org/jira/browse/HUDI-1010 memory leak.
>  https://issues.apache.org/jira/browse/HUDI-997 memory leak
>  https://issues.apache.org/jira/browse/HUDI-664 : Adjust Logging levels to 
> reduce verbose log msgs in hudi-client
>  https://issues.apache.org/jira/browse/HUDI-623: Remove 
> UpgradePayloadFromUberToApache
>  https://issues.apache.org/jira/browse/HUDI-541: Replace variables/comments 
> named "data files" to "base file"
>  https://issues.apache.org/jira/browse/HUDI-347: Fix 
> TestHoodieClientOnCopyOnWriteStorage Tests with modular private methods
>  https://issues.apache.org/jira/browse/HUDI-323: Docker demo/integ-test 
> stdout/stderr output only available on process exit
>  https://issues.apache.org/jira/browse/HUDI-284: Need Tests for Hudi handling 
> of schema evolution
>  https://issues.apache.org/jira/browse/HUDI-154: Enable Rollback case in 
> HoodieRealtimeRecordReaderTest.testReader
> https://issues.apache.org/jira/browse/HUDI-1143 timestamp micros. 
> https://issues.apache.org/jira/browse/HUDI-1989: flaky tests in 
> TestHoodieMergeOnReadTable



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-2527) Flaky test: TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict

2021-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-2527:


Assignee: Raymond Xu

> Flaky test: 
> TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict
> -
>
> Key: HUDI-2527
> URL: https://issues.apache.org/jira/browse/HUDI-2527
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2527) Flaky test: TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict

2021-10-05 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2527:


 Summary: Flaky test: 
TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict
 Key: HUDI-2527
 URL: https://issues.apache.org/jira/browse/HUDI-2527
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Testing
Reporter: Raymond Xu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2529) Flaky test: ITTestHoodieFlinkCompactor.testHoodieFlinkCompactor:88

2021-10-05 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2529:


 Summary: Flaky test: 
ITTestHoodieFlinkCompactor.testHoodieFlinkCompactor:88
 Key: HUDI-2529
 URL: https://issues.apache.org/jira/browse/HUDI-2529
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Testing
Reporter: Raymond Xu


https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=2474=logs=3b6e910d-b98f-5de6-b9cb-1e5ff571f5de=30b5aae4-0ea0-5566-42d0-febf71a7061a=114962



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-2076) Flaky test: TestHoodieMultiTableDeltaStreamer

2021-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-2076.

Resolution: Cannot Reproduce

Don't see this in Azure. Re-open if this came back.

> Flaky test: TestHoodieMultiTableDeltaStreamer
> -
>
> Key: HUDI-2076
> URL: https://issues.apache.org/jira/browse/HUDI-2076
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Raymond Xu
>Priority: Major
>
> At-least 4 occurrences
> [ERROR] Failures: 
> [ERROR]   
> TestHoodieMultiTableDeltaStreamer.testMultiTableExecutionWithKafkaSource:168 
> expected:  but was: 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-2078) Flaky test: TestCleaner

2021-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-2078.

Resolution: Cannot Reproduce

Don't see this in Azure. Re-open if this came back.

> Flaky test: TestCleaner
> ---
>
> Key: HUDI-2078
> URL: https://issues.apache.org/jira/browse/HUDI-2078
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Raymond Xu
>Priority: Major
>
> * TestCleaner.testKeepLatestCommits
>  * TestCleaner.testKeepLatestFileVersions:673 Must clean at least 1 file ==> 
> expected: <2> but was: <1>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2527) Flaky test: TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict

2021-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2527:
-
Description: 
Test case does not make sense for COW table. Should remove COW from the test 
param.

Consider rewrite the prep logic.

> Flaky test: 
> TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict
> -
>
> Key: HUDI-2527
> URL: https://issues.apache.org/jira/browse/HUDI-2527
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>
> Test case does not make sense for COW table. Should remove COW from the test 
> param.
> Consider rewrite the prep logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2527) Flaky test: TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict

2021-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2527:
-
Description: 
 
{code:java}
 [ERROR] Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 61.795 
s <<< FAILURE! - in org.apache.hudi.client.TestHoodieClientMultiWriter
[ERROR] 
org.apache.hudi.client.TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict(HoodieTableType)[1]
 Time elapsed: 9.689 s <<< ERROR!java.util.concurrent.ExecutionException: 
java.lang.RuntimeException: org.apache.hudi.exception.HoodieHeartbeatException: 
Unable to generate heartbeat at 
org.apache.hudi.client.TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict(TestHoodieClientMultiWriter.java:227)
Caused by: java.lang.RuntimeException: 
org.apache.hudi.exception.HoodieHeartbeatException: Unable to generate 
heartbeat at 
org.apache.hudi.client.TestHoodieClientMultiWriter.lambda$testMultiWriterWithAsyncTableServicesWithConflict$5(TestHoodieClientMultiWriter.java:205)
Caused by: org.apache.hudi.exception.HoodieHeartbeatException: Unable to 
generate heartbeat at 
org.apache.hudi.client.TestHoodieClientMultiWriter.createCommitWithInserts(TestHoodieClientMultiWriter.java:285)
 at 
org.apache.hudi.client.TestHoodieClientMultiWriter.lambda$testMultiWriterWithAsyncTableServicesWithConflict$5(TestHoodieClientMultiWriter.java:202)
Caused by: org.apache.hadoop.util.Shell$ExitCodeException:chmod: cannot 
access '/tmp/junit213441136342269/dataset/.hoodie/.heartbeat/.007.crc': No 
such file or directory       at 
org.apache.hudi.client.TestHoodieClientMultiWriter.createCommitWithInserts(TestHoodieClientMultiWriter.java:285)
 at 
org.apache.hudi.client.TestHoodieClientMultiWriter.lambda$testMultiWriterWithAsyncTableServicesWithConflict$5(TestHoodieClientMultiWriter.java:202)
 

[ERROR] Errors: 
[ERROR]   
TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict:227
 » Execution{code}
 
 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=2352=logs=600e7de6-e133-5e69-e615-50ee129b3c08=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7

 

Test case does not make sense for COW table. Should remove COW from the test 
param.

Consider rewrite the prep logic.

  was:
Test case does not make sense for COW table. Should remove COW from the test 
param.

Consider rewrite the prep logic.


> Flaky test: 
> TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict
> -
>
> Key: HUDI-2527
> URL: https://issues.apache.org/jira/browse/HUDI-2527
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>
>  
> {code:java}
>  [ERROR] Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 61.795 s <<< FAILURE! - in org.apache.hudi.client.TestHoodieClientMultiWriter 
>[ERROR] 
> org.apache.hudi.client.TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict(HoodieTableType)[1]
>  Time elapsed: 9.689 s <<< ERROR!java.util.concurrent.ExecutionException: 
> java.lang.RuntimeException: 
> org.apache.hudi.exception.HoodieHeartbeatException: Unable to generate 
> heartbeat at 
> org.apache.hudi.client.TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict(TestHoodieClientMultiWriter.java:227)
> Caused by: java.lang.RuntimeException: 
> org.apache.hudi.exception.HoodieHeartbeatException: Unable to generate 
> heartbeat at 
> org.apache.hudi.client.TestHoodieClientMultiWriter.lambda$testMultiWriterWithAsyncTableServicesWithConflict$5(TestHoodieClientMultiWriter.java:205)
> Caused by: org.apache.hudi.exception.HoodieHeartbeatException: Unable to 
> generate heartbeat at 
> org.apache.hudi.client.TestHoodieClientMultiWriter.createCommitWithInserts(TestHoodieClientMultiWriter.java:285)
>  at 
> org.apache.hudi.client.TestHoodieClientMultiWriter.lambda$testMultiWriterWithAsyncTableServicesWithConflict$5(TestHoodieClientMultiWriter.java:202)
> Caused by: org.apache.hadoop.util.Shell$ExitCodeException:chmod: 
> cannot access 
> '/tmp/junit213441136342269/dataset/.hoodie/.heartbeat/.007.crc': No such 
> file or directory       at 
> org.apache.hudi.client.TestHoodieClientMultiWriter.createCommitWithInserts(TestHoodieClientMultiWriter.java:285)
>  at 
> org.apache.hudi.client.TestHoodieClientMultiWriter.lambda$testMultiWriterWithAsyncTableServicesWithConflict$5(TestHoodieClientMultiWriter.java:202)
>  
> [ERROR] Errors: 
> [ERROR]   
> TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict:227
>  » Execution{code}
>  
>  
> 

[jira] [Updated] (HUDI-864) parquet schema conflict: optional binary (UTF8) is not a group

2021-09-27 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-864:

Affects Version/s: 0.9.0

> parquet schema conflict: optional binary  (UTF8) is not a group
> ---
>
> Key: HUDI-864
> URL: https://issues.apache.org/jira/browse/HUDI-864
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core, Spark Integration
>Affects Versions: 0.5.2, 0.9.0
>Reporter: Roland Johann
>Priority: Blocker
>  Labels: sev:critical, user-support-issues
>
> When dealing with struct types like this
> {code:json}
> {
>   "type": "struct",
>   "fields": [
> {
>   "name": "categoryResults",
>   "type": {
> "type": "array",
> "elementType": {
>   "type": "struct",
>   "fields": [
> {
>   "name": "categoryId",
>   "type": "string",
>   "nullable": true,
>   "metadata": {}
> }
>   ]
> },
> "containsNull": true
>   },
>   "nullable": true,
>   "metadata": {}
> }
>   ]
> }
> {code}
> The second ingest batch throws that exception:
> {code}
> ERROR [Executor task launch worker for task 15] 
> commit.BaseCommitActionExecutor (BaseCommitActionExecutor.java:264) - Error 
> upserting bucketType UPDATE for partition :0
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieException: operation has failed
>   at 
> org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdateInternal(CommitActionExecutor.java:100)
>   at 
> org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdate(CommitActionExecutor.java:76)
>   at 
> org.apache.hudi.table.action.deltacommit.DeltaCommitActionExecutor.handleUpdate(DeltaCommitActionExecutor.java:73)
>   at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:258)
>   at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleInsertPartition(BaseCommitActionExecutor.java:271)
>   at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:104)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
>   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: 

[jira] [Updated] (HUDI-864) parquet schema conflict: optional binary (UTF8) is not a group

2021-09-27 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-864:

Affects Version/s: 0.6.0
   0.5.3
   0.7.0
   0.8.0

> parquet schema conflict: optional binary  (UTF8) is not a group
> ---
>
> Key: HUDI-864
> URL: https://issues.apache.org/jira/browse/HUDI-864
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core, Spark Integration
>Affects Versions: 0.5.2, 0.6.0, 0.5.3, 0.7.0, 0.8.0, 0.9.0
>Reporter: Roland Johann
>Priority: Blocker
>  Labels: sev:critical, user-support-issues
>
> When dealing with struct types like this
> {code:json}
> {
>   "type": "struct",
>   "fields": [
> {
>   "name": "categoryResults",
>   "type": {
> "type": "array",
> "elementType": {
>   "type": "struct",
>   "fields": [
> {
>   "name": "categoryId",
>   "type": "string",
>   "nullable": true,
>   "metadata": {}
> }
>   ]
> },
> "containsNull": true
>   },
>   "nullable": true,
>   "metadata": {}
> }
>   ]
> }
> {code}
> The second ingest batch throws that exception:
> {code}
> ERROR [Executor task launch worker for task 15] 
> commit.BaseCommitActionExecutor (BaseCommitActionExecutor.java:264) - Error 
> upserting bucketType UPDATE for partition :0
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieException: operation has failed
>   at 
> org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdateInternal(CommitActionExecutor.java:100)
>   at 
> org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdate(CommitActionExecutor.java:76)
>   at 
> org.apache.hudi.table.action.deltacommit.DeltaCommitActionExecutor.handleUpdate(DeltaCommitActionExecutor.java:73)
>   at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:258)
>   at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleInsertPartition(BaseCommitActionExecutor.java:271)
>   at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:104)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
>   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Updated] (HUDI-2390) KeyGenerator discrepancy between DataFrame writer and SQL

2021-09-27 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2390:
-
Labels: sev:critical  (was: )

> KeyGenerator discrepancy between DataFrame writer and SQL
> -
>
> Key: HUDI-2390
> URL: https://issues.apache.org/jira/browse/HUDI-2390
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Affects Versions: 0.9.0
>Reporter: renhao
>Priority: Critical
>  Labels: sev:critical
>
> Test Case:
> {code:java}
>  import org.apache.hudi.QuickstartUtils._
>  import scala.collection.JavaConversions._
>  import org.apache.spark.sql.SaveMode._
>  import org.apache.hudi.DataSourceReadOptions._
>  import org.apache.hudi.DataSourceWriteOptions._
>  import org.apache.hudi.config.HoodieWriteConfig._{code}
> 1.准备数据
>  
> {code:java}
> spark.sql("create table test1(a int,b string,c string) using hudi partitioned 
> by(b) options(primaryKey='a')")
> spark.sql("insert into table test1 select 1,2,3")
> {code}
>  
> 2.创建hudi table test2
> {code:java}
> spark.sql("create table test2(a int,b string,c string) using hudi partitioned 
> by(b) options(primaryKey='a')"){code}
> 3.datasource向test2写入数据
>  
> {code:java}
> val base_data=spark.sql("select * from testdb.test1")
> base_data.write.format("hudi").
> option(TABLE_TYPE_OPT_KEY, COW_TABLE_TYPE_OPT_VAL).      
> option(RECORDKEY_FIELD_OPT_KEY, "a").      
> option(PARTITIONPATH_FIELD_OPT_KEY, "b").      
> option(KEYGENERATOR_CLASS_OPT_KEY, 
> "org.apache.hudi.keygen.SimpleKeyGenerator"). 
> option(OPERATION_OPT_KEY, "bulk_insert").      
> option(HIVE_SYNC_ENABLED_OPT_KEY, "true").      
> option(HIVE_PARTITION_FIELDS_OPT_KEY, "b").   
> option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,"org.apache.hudi.hive.MultiPartKeysValueExtractor").
>       
> option(HIVE_DATABASE_OPT_KEY, "testdb").      
> option(HIVE_TABLE_OPT_KEY, "test2").      
> option(HIVE_USE_JDBC_OPT_KEY, "true").      
> option("hoodie.bulkinsert.shuffle.parallelism", 4).
> option("hoodie.datasource.write.hive_style_partitioning", "true").      
> option(TABLE_NAME, 
> "test2").mode(Append).save(s"/user/hive/warehouse/testdb.db/test2")
> {code}
>  
> 此时执行查询结果如下:
> {code:java}
> +---+---+---+
> | a| b| c|
> +---+---+---+
> | 1| 3| 2|
> +---+---+---+{code}
> 4.删除一条记录
> {code:java}
> spark.sql("delete from testdb.test2 where a=1"){code}
> 5.执行查询,a=1的记录未被删除
> {code:java}
> spark.sql("select a,b,c from testdb.test2").show{code}
> {code:java}
> +---+---+---+
> | a| b| c|
> +---+---+---+
> | 1| 3| 2|
> +---+---+---+{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2390) KeyGenerator discrepancy between DataFrame writer and SQL

2021-09-27 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2390:
-
Priority: Critical  (was: Minor)

> KeyGenerator discrepancy between DataFrame writer and SQL
> -
>
> Key: HUDI-2390
> URL: https://issues.apache.org/jira/browse/HUDI-2390
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Affects Versions: 0.9.0
>Reporter: renhao
>Priority: Critical
>
> Test Case:
> {code:java}
>  import org.apache.hudi.QuickstartUtils._
>  import scala.collection.JavaConversions._
>  import org.apache.spark.sql.SaveMode._
>  import org.apache.hudi.DataSourceReadOptions._
>  import org.apache.hudi.DataSourceWriteOptions._
>  import org.apache.hudi.config.HoodieWriteConfig._{code}
> 1.准备数据
>  
> {code:java}
> spark.sql("create table test1(a int,b string,c string) using hudi partitioned 
> by(b) options(primaryKey='a')")
> spark.sql("insert into table test1 select 1,2,3")
> {code}
>  
> 2.创建hudi table test2
> {code:java}
> spark.sql("create table test2(a int,b string,c string) using hudi partitioned 
> by(b) options(primaryKey='a')"){code}
> 3.datasource向test2写入数据
>  
> {code:java}
> val base_data=spark.sql("select * from testdb.test1")
> base_data.write.format("hudi").
> option(TABLE_TYPE_OPT_KEY, COW_TABLE_TYPE_OPT_VAL).      
> option(RECORDKEY_FIELD_OPT_KEY, "a").      
> option(PARTITIONPATH_FIELD_OPT_KEY, "b").      
> option(KEYGENERATOR_CLASS_OPT_KEY, 
> "org.apache.hudi.keygen.SimpleKeyGenerator"). 
> option(OPERATION_OPT_KEY, "bulk_insert").      
> option(HIVE_SYNC_ENABLED_OPT_KEY, "true").      
> option(HIVE_PARTITION_FIELDS_OPT_KEY, "b").   
> option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,"org.apache.hudi.hive.MultiPartKeysValueExtractor").
>       
> option(HIVE_DATABASE_OPT_KEY, "testdb").      
> option(HIVE_TABLE_OPT_KEY, "test2").      
> option(HIVE_USE_JDBC_OPT_KEY, "true").      
> option("hoodie.bulkinsert.shuffle.parallelism", 4).
> option("hoodie.datasource.write.hive_style_partitioning", "true").      
> option(TABLE_NAME, 
> "test2").mode(Append).save(s"/user/hive/warehouse/testdb.db/test2")
> {code}
>  
> 此时执行查询结果如下:
> {code:java}
> +---+---+---+
> | a| b| c|
> +---+---+---+
> | 1| 3| 2|
> +---+---+---+{code}
> 4.删除一条记录
> {code:java}
> spark.sql("delete from testdb.test2 where a=1"){code}
> 5.执行查询,a=1的记录未被删除
> {code:java}
> spark.sql("select a,b,c from testdb.test2").show{code}
> {code:java}
> +---+---+---+
> | a| b| c|
> +---+---+---+
> | 1| 3| 2|
> +---+---+---+{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2495) Difference in behavior between GenericRecord based key gen and Row based key gen

2021-09-27 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2495:
-
Priority: Critical  (was: Major)

> Difference in behavior between GenericRecord based key gen and Row based key 
> gen 
> -
>
> Key: HUDI-2495
> URL: https://issues.apache.org/jira/browse/HUDI-2495
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: sev:critical
>
> when complex key gen is used and one of the field in record key is a 
> timestamp field, row writer path and rdd path gives different record key 
> values. GenericRecord path converts timestamp, where as row writer path does 
> not do any conversion. 
>  
> import java.sql.Timestamp
>  import spark.implicits._
> val df = Seq(
>  (1, Timestamp.valueOf("2014-01-01 23:00:01"), "abc"),
>  (1, Timestamp.valueOf("2014-11-30 12:40:32"), "abc"),
>  (2, Timestamp.valueOf("2016-12-29 09:54:00"), "def"),
>  (2, Timestamp.valueOf("2016-05-09 10:12:43"), "def")
>  ).toDF("typeId","eventTime", "str")
>  
> df.write.format("hudi").
>  option("hoodie.insert.shuffle.parallelism", "2").
>  option("hoodie.upsert.shuffle.parallelism", "2").
>  option("hoodie.bulkinsert.shuffle.parallelism", "2").
>  option("hoodie.datasource.write.precombine.field", "typeId").
>  option("hoodie.datasource.write.partitionpath.field", "typeId").
>  option("hoodie.datasource.write.recordkey.field", "str,eventTime").
>  
> option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.ComplexKeyGenerator").
>  option("hoodie.table.name", "hudi_tbl").
>  mode(Overwrite).
>  save("/tmp/hudi_tbl_trial/")
>  
> val hudiDF = spark.read.format("hudi").load("/tmp/hudi_tbl_trial/")
> hudiDF.createOrReplaceTempView("hudi_sql_tbl")
> spark.sql("select _hoodie_record_key, str, eventTime, typeId from 
> hudi_sql_tbl").show(false)
>  
> {code:java}
> +--+---+---+--+
> |_hoodie_record_key                |str|eventTime          |typeId|
> +--+---+---+--+
> |str:abc,eventTime:141736923200|abc|2014-11-30 12:40:32|1     |
> |str:abc,eventTime:138863520100|abc|2014-01-01 23:00:01|1     |
> |str:def,eventTime:146280316300|def|2016-05-09 10:12:43|2     |
> |str:def,eventTime:148302324000|def|2016-12-29 09:54:00|2     |
> +--+---+---+--+
> {code}
>  
>  
> // now retry w/ bulk_insert row writer path
> df.write.format("hudi").
>  option("hoodie.insert.shuffle.parallelism", "2").
>  option("hoodie.upsert.shuffle.parallelism", "2").
>  option("hoodie.bulkinsert.shuffle.parallelism", "2").
>  option("hoodie.datasource.write.precombine.field", "typeId").
>  option("hoodie.datasource.write.partitionpath.field", "typeId").
>  option("hoodie.datasource.write.recordkey.field", "str,eventTime").
>  
> option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.ComplexKeyGenerator").
>  option("hoodie.table.name", "hudi_tbl").
> "hoodie.datasource.write.operation","bulk_insert").
>  mode(Overwrite).
>  save("/tmp/hudi_tbl_trial_bulk_insert/")
>  
> val hudiDF_bulk_insert = 
> spark.read.format("hudi").load("/tmp/hudi_tbl_trial_bulk_insert/")
> hudiDF_bulk_insert.createOrReplaceTempView("hudi_sql_tbl_bulk_insert")
> spark.sql("select _hoodie_record_key, str, eventTime, typeId from 
> hudi_sql_tbl_bulk_insert").show(false)
> {code:java}
> +---+---+---+--+
> |_hoodie_record_key                     |str|eventTime          |typeId|
> +---+---+---+--+
> |str:def,eventTime:2016-05-09 10:12:43.0|def|2016-05-09 10:12:43|2     |
> |str:def,eventTime:2016-12-29 09:54:00.0|def|2016-12-29 09:54:00|2     |
> |str:abc,eventTime:2014-01-01 23:00:01.0|abc|2014-01-01 23:00:01|1     |
> |str:abc,eventTime:2014-11-30 12:40:32.0|abc|2014-11-30 12:40:32|1     |
> +---+---+---+--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2495) Difference in behavior between GenericRecord based key gen and Row based key gen

2021-09-27 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2495:
-
Component/s: Spark Integration

> Difference in behavior between GenericRecord based key gen and Row based key 
> gen 
> -
>
> Key: HUDI-2495
> URL: https://issues.apache.org/jira/browse/HUDI-2495
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: sev:critical
>
> when complex key gen is used and one of the field in record key is a 
> timestamp field, row writer path and rdd path gives different record key 
> values. GenericRecord path converts timestamp, where as row writer path does 
> not do any conversion. 
>  
> import java.sql.Timestamp
>  import spark.implicits._
> val df = Seq(
>  (1, Timestamp.valueOf("2014-01-01 23:00:01"), "abc"),
>  (1, Timestamp.valueOf("2014-11-30 12:40:32"), "abc"),
>  (2, Timestamp.valueOf("2016-12-29 09:54:00"), "def"),
>  (2, Timestamp.valueOf("2016-05-09 10:12:43"), "def")
>  ).toDF("typeId","eventTime", "str")
>  
> df.write.format("hudi").
>  option("hoodie.insert.shuffle.parallelism", "2").
>  option("hoodie.upsert.shuffle.parallelism", "2").
>  option("hoodie.bulkinsert.shuffle.parallelism", "2").
>  option("hoodie.datasource.write.precombine.field", "typeId").
>  option("hoodie.datasource.write.partitionpath.field", "typeId").
>  option("hoodie.datasource.write.recordkey.field", "str,eventTime").
>  
> option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.ComplexKeyGenerator").
>  option("hoodie.table.name", "hudi_tbl").
>  mode(Overwrite).
>  save("/tmp/hudi_tbl_trial/")
>  
> val hudiDF = spark.read.format("hudi").load("/tmp/hudi_tbl_trial/")
> hudiDF.createOrReplaceTempView("hudi_sql_tbl")
> spark.sql("select _hoodie_record_key, str, eventTime, typeId from 
> hudi_sql_tbl").show(false)
>  
> {code:java}
> +--+---+---+--+
> |_hoodie_record_key                |str|eventTime          |typeId|
> +--+---+---+--+
> |str:abc,eventTime:141736923200|abc|2014-11-30 12:40:32|1     |
> |str:abc,eventTime:138863520100|abc|2014-01-01 23:00:01|1     |
> |str:def,eventTime:146280316300|def|2016-05-09 10:12:43|2     |
> |str:def,eventTime:148302324000|def|2016-12-29 09:54:00|2     |
> +--+---+---+--+
> {code}
>  
>  
> // now retry w/ bulk_insert row writer path
> df.write.format("hudi").
>  option("hoodie.insert.shuffle.parallelism", "2").
>  option("hoodie.upsert.shuffle.parallelism", "2").
>  option("hoodie.bulkinsert.shuffle.parallelism", "2").
>  option("hoodie.datasource.write.precombine.field", "typeId").
>  option("hoodie.datasource.write.partitionpath.field", "typeId").
>  option("hoodie.datasource.write.recordkey.field", "str,eventTime").
>  
> option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.ComplexKeyGenerator").
>  option("hoodie.table.name", "hudi_tbl").
> "hoodie.datasource.write.operation","bulk_insert").
>  mode(Overwrite).
>  save("/tmp/hudi_tbl_trial_bulk_insert/")
>  
> val hudiDF_bulk_insert = 
> spark.read.format("hudi").load("/tmp/hudi_tbl_trial_bulk_insert/")
> hudiDF_bulk_insert.createOrReplaceTempView("hudi_sql_tbl_bulk_insert")
> spark.sql("select _hoodie_record_key, str, eventTime, typeId from 
> hudi_sql_tbl_bulk_insert").show(false)
> {code:java}
> +---+---+---+--+
> |_hoodie_record_key                     |str|eventTime          |typeId|
> +---+---+---+--+
> |str:def,eventTime:2016-05-09 10:12:43.0|def|2016-05-09 10:12:43|2     |
> |str:def,eventTime:2016-12-29 09:54:00.0|def|2016-12-29 09:54:00|2     |
> |str:abc,eventTime:2014-01-01 23:00:01.0|abc|2014-01-01 23:00:01|1     |
> |str:abc,eventTime:2014-11-30 12:40:32.0|abc|2014-11-30 12:40:32|1     |
> +---+---+---+--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2496) Inserts are precombined even with dedup disabled

2021-09-27 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2496:
-
Fix Version/s: 0.10.0

> Inserts are precombined even with dedup disabled
> 
>
> Key: HUDI-2496
> URL: https://issues.apache.org/jira/browse/HUDI-2496
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: sev:critical
> Fix For: 0.10.0
>
>
> Test case by [~xushiyan] : https://github.com/apache/hudi/pull/3723/files
> RCA by [~shivnarayan] :
> Within HoodieMergeHandle, we use a hashmap to store incoming records, where 
> keys are record keys.
> and so, if you see 1st batch, duplicates would remain intact. but wrt 2nd 
> batch, only unique records are considered and later concatenated w/ 1st batch.
> https://github.com/apache/hudi/blob/36be28712196ff4427c41b0aa885c7fcd7356d7f/hudi-[…]-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2496) Inserts are precombined even with dedup disabled

2021-09-27 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2496:
-
Component/s: Writer Core

> Inserts are precombined even with dedup disabled
> 
>
> Key: HUDI-2496
> URL: https://issues.apache.org/jira/browse/HUDI-2496
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: sev:critical
>
> Test case by [~xushiyan] : https://github.com/apache/hudi/pull/3723/files
> RCA by [~shivnarayan] :
> Within HoodieMergeHandle, we use a hashmap to store incoming records, where 
> keys are record keys.
> and so, if you see 1st batch, duplicates would remain intact. but wrt 2nd 
> batch, only unique records are considered and later concatenated w/ 1st batch.
> https://github.com/apache/hudi/blob/36be28712196ff4427c41b0aa885c7fcd7356d7f/hudi-[…]-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2496) Inserts are precombined even with dedup disabled

2021-09-27 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2496:
-
Labels: sev:critical  (was: writer)

> Inserts are precombined even with dedup disabled
> 
>
> Key: HUDI-2496
> URL: https://issues.apache.org/jira/browse/HUDI-2496
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: sev:critical
>
> Test case by [~xushiyan] : https://github.com/apache/hudi/pull/3723/files
> RCA by [~shivnarayan] :
> Within HoodieMergeHandle, we use a hashmap to store incoming records, where 
> keys are record keys.
> and so, if you see 1st batch, duplicates would remain intact. but wrt 2nd 
> batch, only unique records are considered and later concatenated w/ 1st batch.
> https://github.com/apache/hudi/blob/36be28712196ff4427c41b0aa885c7fcd7356d7f/hudi-[…]-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2496) Inserts are precombined even with dedup disabled

2021-09-27 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2496:
-
Description: 
Original GH issue https://github.com/apache/hudi/issues/3709

Test case by [~xushiyan] : [https://github.com/apache/hudi/pull/3723/files]

RCA by [~shivnarayan] :

Within HoodieMergeHandle, we use a hashmap to store incoming records, where 
keys are record keys.
 and so, if you see 1st batch, duplicates would remain intact. but wrt 2nd 
batch, only unique records are considered and later concatenated w/ 1st batch.
 
[https://github.com/apache/hudi/blob/36be28712196ff4427c41b0aa885c7fcd7356d7f/hudi-[]…]-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java

  was:
Test case by [~xushiyan] : https://github.com/apache/hudi/pull/3723/files

RCA by [~shivnarayan] :

Within HoodieMergeHandle, we use a hashmap to store incoming records, where 
keys are record keys.
and so, if you see 1st batch, duplicates would remain intact. but wrt 2nd 
batch, only unique records are considered and later concatenated w/ 1st batch.
https://github.com/apache/hudi/blob/36be28712196ff4427c41b0aa885c7fcd7356d7f/hudi-[…]-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java



> Inserts are precombined even with dedup disabled
> 
>
> Key: HUDI-2496
> URL: https://issues.apache.org/jira/browse/HUDI-2496
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: sev:critical
> Fix For: 0.10.0
>
>
> Original GH issue https://github.com/apache/hudi/issues/3709
> Test case by [~xushiyan] : [https://github.com/apache/hudi/pull/3723/files]
> RCA by [~shivnarayan] :
> Within HoodieMergeHandle, we use a hashmap to store incoming records, where 
> keys are record keys.
>  and so, if you see 1st batch, duplicates would remain intact. but wrt 2nd 
> batch, only unique records are considered and later concatenated w/ 1st batch.
>  
> [https://github.com/apache/hudi/blob/36be28712196ff4427c41b0aa885c7fcd7356d7f/hudi-[]…]-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2496) Inserts are precombined even with dedup disabled

2021-09-27 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2496:
-
Priority: Critical  (was: Major)

> Inserts are precombined even with dedup disabled
> 
>
> Key: HUDI-2496
> URL: https://issues.apache.org/jira/browse/HUDI-2496
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Sagar Sumit
>Priority: Critical
>  Labels: sev:critical
> Fix For: 0.10.0
>
>
> Original GH issue https://github.com/apache/hudi/issues/3709
> Test case by [~xushiyan] : [https://github.com/apache/hudi/pull/3723/files]
> RCA by [~shivnarayan] :
> Within HoodieMergeHandle, we use a hashmap to store incoming records, where 
> keys are record keys.
>  and so, if you see 1st batch, duplicates would remain intact. but wrt 2nd 
> batch, only unique records are considered and later concatenated w/ 1st batch.
>  
> [https://github.com/apache/hudi/blob/36be28712196ff4427c41b0aa885c7fcd7356d7f/hudi-[]…]-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2608) Support JSON schema in schema registry provider

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2608:
-
Description: 
To work with JSON kafka source.

 

Original issue

https://github.com/apache/hudi/issues/3835

> Support JSON schema in schema registry provider
> ---
>
> Key: HUDI-2608
> URL: https://issues.apache.org/jira/browse/HUDI-2608
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: Raymond Xu
>Priority: Major
>
> To work with JSON kafka source.
>  
> Original issue
> https://github.com/apache/hudi/issues/3835



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2608) Support JSON schema in schema registry provider

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2608:
-
Labels: sev:normal user-support-issues  (was: )

> Support JSON schema in schema registry provider
> ---
>
> Key: HUDI-2608
> URL: https://issues.apache.org/jira/browse/HUDI-2608
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: Raymond Xu
>Priority: Major
>  Labels: sev:normal, user-support-issues
>
> To work with JSON kafka source.
>  
> Original issue
> https://github.com/apache/hudi/issues/3835



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2608) Support JSON schema in schema registry provider

2021-10-24 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2608:


 Summary: Support JSON schema in schema registry provider
 Key: HUDI-2608
 URL: https://issues.apache.org/jira/browse/HUDI-2608
 Project: Apache Hudi
  Issue Type: New Feature
  Components: DeltaStreamer
Reporter: Raymond Xu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2610) Fix Spark version info for hudi table CTAS from another hudi table

2021-10-24 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2610:


 Summary: Fix Spark version info for hudi table CTAS from another 
hudi table
 Key: HUDI-2610
 URL: https://issues.apache.org/jira/browse/HUDI-2610
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Spark Integration
Reporter: Raymond Xu


See details in the original issue

 

https://github.com/apache/hudi/issues/3662#issuecomment-938489457



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2609) Clarify small file configs in config page

2021-10-24 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2609:


 Summary: Clarify small file configs in config page
 Key: HUDI-2609
 URL: https://issues.apache.org/jira/browse/HUDI-2609
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Docs
Reporter: Raymond Xu


The knowledge should be preserved in docs close to the related config keys

https://github.com/apache/hudi/issues/3676#issuecomment-922508543



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2609) Clarify small file configs in config page

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2609:
-
Labels: user-support-issues  (was: )

> Clarify small file configs in config page
> -
>
> Key: HUDI-2609
> URL: https://issues.apache.org/jira/browse/HUDI-2609
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: Raymond Xu
>Priority: Minor
>  Labels: user-support-issues
>
> The knowledge should be preserved in docs close to the related config keys
> https://github.com/apache/hudi/issues/3676#issuecomment-922508543



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2611) `create table if not exists` should print message instead of throwing error

2021-10-24 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2611:


 Summary: `create table if not exists` should print message instead 
of throwing error
 Key: HUDI-2611
 URL: https://issues.apache.org/jira/browse/HUDI-2611
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Spark Integration
Reporter: Raymond Xu


See details in

https://github.com/apache/hudi/issues/3845#issue-1033218877



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2617) Implement HBase Index for Dataset

2021-10-25 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2617:


 Summary: Implement HBase Index for Dataset
 Key: HUDI-2617
 URL: https://issues.apache.org/jira/browse/HUDI-2617
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Raymond Xu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1430:
-
Description: End to end upsert operation, with proper functional tests 
coverage.

> Implement SparkDataFrameWriteClient with SimpleIndex
> 
>
> Key: HUDI-1430
> URL: https://issues.apache.org/jira/browse/HUDI-1430
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>
> End to end upsert operation, with proper functional tests coverage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2619) Make table services work with Dataset

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2619:
-
Description: Clustering, Compaction, Clean should also work with 
Dataset

> Make table services work with Dataset
> --
>
> Key: HUDI-2619
> URL: https://issues.apache.org/jira/browse/HUDI-2619
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>
> Clustering, Compaction, Clean should also work with Dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2531) [UMBRELLA] Support Dataset APIs in writer paths

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2531:
-
Status: In Progress  (was: Open)

> [UMBRELLA] Support Dataset APIs in writer paths
> ---
>
> Key: HUDI-2531
> URL: https://issues.apache.org/jira/browse/HUDI-2531
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: hudi-umbrellas
> Fix For: 0.10.0
>
>
> To make use of Dataset APIs in writer paths instead of RDD.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-2615) Decouple HoodieRecordPayload with Hoodie table, table services, and index

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-2615:


Assignee: Raymond Xu

> Decouple HoodieRecordPayload with Hoodie table, table services, and index
> -
>
> Key: HUDI-2615
> URL: https://issues.apache.org/jira/browse/HUDI-2615
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>
> HoodieTable, HoodieIndex, and compaction, clustering services should be 
> independent of HoodieRecordPayload



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2077) Flaky test: TestHoodieDeltaStreamer

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2077:
-
Priority: Critical  (was: Major)

> Flaky test: TestHoodieDeltaStreamer
> ---
>
> Key: HUDI-2077
> URL: https://issues.apache.org/jira/browse/HUDI-2077
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Sagar Sumit
>Priority: Critical
>  Labels: pull-request-available
> Attachments: 28.txt, hudi_2077_schema_mismatch.txt
>
>
> {code:java}
>  [INFO] Results:8520[INFO] 8521[ERROR] Errors: 8522[ERROR]   
> TestHoodieDeltaStreamer.testUpsertsMORContinuousModeWithMultipleWriters:716->testUpsertsContinuousModeWithMultipleWriters:831->runJobsInParallel:940
>  » Execution{code}
>  Search "testUpsertsMORContinuousModeWithMultipleWriters" in the log file for 
> details.
> {quote} 
> 1730667 [pool-1461-thread-1] WARN 
> org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer - Got error : }}
>  org.apache.hudi.exception.HoodieIOException: Could not check if 
> hdfs://localhost:4/user/vsts/continuous_mor_mulitwriter is a valid table 
>  at 
> org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:59)
>  
>  at 
> org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:112)
>  
>  at 
> org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:73)
>  
>  at 
> org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:606)
>  
>  at 
> org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer$TestHelpers.assertAtleastNDeltaCommitsAfterCommit(TestHoodieDeltaStreamer.java:322)
>  
>  at 
> org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.lambda$runJobsInParallel$8(TestHoodieDeltaStreamer.java:906)
>  
>  at 
> org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer$TestHelpers.lambda$waitTillCondition$0(TestHoodieDeltaStreamer.java:347)
>  
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
>  at java.lang.Thread.run(Thread.java:748) 
>  {{Caused by: java.net.ConnectException: Call From fv-az238-328/10.1.0.24 to 
> localhost:4 failed on connection exception: java.net.ConnectException: 
> Connection refused; For more details see: 
> [http://wiki.apache.org/hadoop/ConnectionRefused]
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2531) [UMBRELLA] Support Dataset APIs in writer paths

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2531:
-
Fix Version/s: 0.10.0

> [UMBRELLA] Support Dataset APIs in writer paths
> ---
>
> Key: HUDI-2531
> URL: https://issues.apache.org/jira/browse/HUDI-2531
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: hudi-umbrellas
> Fix For: 0.10.0
>
>
> To make use of Dataset APIs in writer paths instead of RDD.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2616) Implement BloomIndex for Dataset

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2616:
-
Fix Version/s: 0.10.0

> Implement BloomIndex for Dataset
> -
>
> Key: HUDI-2616
> URL: https://issues.apache.org/jira/browse/HUDI-2616
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Raymond Xu
>Priority: Major
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2617) Implement HBase Index for Dataset

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2617:
-
Fix Version/s: 0.10.0

> Implement HBase Index for Dataset
> --
>
> Key: HUDI-2617
> URL: https://issues.apache.org/jira/browse/HUDI-2617
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2615) Decouple HoodieRecordPayload with Hoodie table, table services, and index

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2615:
-
Fix Version/s: 0.10.0

> Decouple HoodieRecordPayload with Hoodie table, table services, and index
> -
>
> Key: HUDI-2615
> URL: https://issues.apache.org/jira/browse/HUDI-2615
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>
> HoodieTable, HoodieIndex, and compaction, clustering services should be 
> independent of HoodieRecordPayload



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1869) Upgrading Spark3 To 3.1

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-1869:


Assignee: Yann Byron  (was: pengzhiwei)

> Upgrading Spark3 To 3.1
> ---
>
> Key: HUDI-1869
> URL: https://issues.apache.org/jira/browse/HUDI-1869
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: Yann Byron
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Spark 3.1 has changed some behavior of the internal class and interface for 
> both spark-sql and spark-core module.
> Currently hudi can't compile success under the spark 3.1. We need support sql 
> support for spark 3.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2621) Optimize DataFrameWriter on small file handling

2021-10-25 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2621:


 Summary: Optimize DataFrameWriter on small file handling
 Key: HUDI-2621
 URL: https://issues.apache.org/jira/browse/HUDI-2621
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Raymond Xu
 Fix For: 0.10.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2623) Make hudi-bot comment at PR thread bottom

2021-10-25 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2623:


 Summary: Make hudi-bot comment at PR thread bottom
 Key: HUDI-2623
 URL: https://issues.apache.org/jira/browse/HUDI-2623
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Testing
Reporter: Raymond Xu
 Fix For: 0.10.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2287) Partition pruning not working on Hudi dataset

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2287:
-
Status: In Progress  (was: Open)

> Partition pruning not working on Hudi dataset
> -
>
> Key: HUDI-2287
> URL: https://issues.apache.org/jira/browse/HUDI-2287
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Performance
>Reporter: Rajkumar Gunasekaran
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi, we have created  a Hudi dataset which has two level partition like this
> {code:java}
> s3://somes3bucket/partition1=value/partition2=value
> {code}
> where _partition1_ and _partition2_ is of type string
> When running a simple count query using Hudi format in spark-shell, it takes 
> almost 3 minutes to complete
>   
> {code:scala}
> spark.read.format("hudi").load("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
>  
> res1: Long = 
> attempt 1: 3.2 minutes
>  attempt 2: 2.5 minutes
> {code}
> In the Spark UI ~9000 tasks (which is approximately equivalent to the total 
> no of files in the ENTIRE dataset s3://somes3bucket) are used for 
> computation. Seems like spark is reading the entire dataset instead of 
> *partition pruning.*...and then filtering the dataset based on the where 
> clause
> Whereas, if I use the parquet format to read the dataset, the query only 
> takes ~30 seconds (vis-a-vis 3 minutes with Hudi format)
> {code:scala}
> spark.read.parquet("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
> res2: Long = 
> ~ 30 seconds
> {code}
> In the spark UI, only 1361 (ie 1361 tasks) files are scanned (vis-a-vis ~9000 
> files in Hudi) and takes only 15 seconds
> Any idea why partition pruning is not working when using Hudi format? 
> Wondering if I am missing any configuration during the creation of the 
> dataset?
> PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0 and here is 
> the configuration I have used for creating the dataset
> {code:scala}
> df.writeStream
>  .trigger(Trigger.ProcessingTime(s"${param.triggerTimeInSeconds} seconds"))
>  .partitionBy("partition1","partition2")
>  .format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, param.hiveNHudiTableName.get)
>  //--
>  .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy")
>  .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, 
> param.expectedFileSizeInBytes)
>  .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, 
> HoodieStorageConfig.DEFAULT_PARQUET_BLOCK_SIZE_BYTES)
>  //--
>  .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 
> (param.expectedFileSizeInBytes / 100) * 80)
>  .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "true")
>  .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, 
> param.runCompactionAfterNDeltaCommits.get)
>  //--
>  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "record_key_id")
>  .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
> classOf[CustomKeyGenerator].getName)
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
> "partition1:SIMPLE,partition2:SIMPLE")
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
> hudiTablePrecombineKey)
>  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>  //.option(DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY, "false")
>  .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
>  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> "partition1,partition2")
>  .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, param.hiveDb.get)
>  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, 
> param.hiveNHudiTableName.get)
>  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> classOf[MultiPartKeysValueExtractor].getName)
>  .outputMode(OutputMode.Append())
>  .queryName(s"${param.hiveDb}_${param.hiveNHudiTableName}_query"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1706) Test flakiness w/ multiwriter test

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1706:
-
Priority: Major  (was: Blocker)

> Test flakiness w/ multiwriter test
> --
>
> Key: HUDI-1706
> URL: https://issues.apache.org/jira/browse/HUDI-1706
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.10.0
>
>
> [https://api.travis-ci.com/v3/job/492130170/log.txt]
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2615) Decouple HoodieRecordPayload with Hoodie table, table services, and index

2021-10-25 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2615:


 Summary: Decouple HoodieRecordPayload with Hoodie table, table 
services, and index
 Key: HUDI-2615
 URL: https://issues.apache.org/jira/browse/HUDI-2615
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Raymond Xu


HoodieTable, HoodieIndex, and compaction, clustering services should be 
independent of HoodieRecordPayload



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2531) [UMBRELLA] Support Dataset APIs in writer paths

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2531:
-
Priority: Blocker  (was: Critical)

> [UMBRELLA] Support Dataset APIs in writer paths
> ---
>
> Key: HUDI-2531
> URL: https://issues.apache.org/jira/browse/HUDI-2531
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: hudi-umbrellas
>
> To make use of Dataset APIs in writer paths instead of RDD.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2616) Implement BloomIndex for Dataset

2021-10-25 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2616:


 Summary: Implement BloomIndex for Dataset
 Key: HUDI-2616
 URL: https://issues.apache.org/jira/browse/HUDI-2616
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Raymond Xu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2618) Implement write operations other than upsert in SparkDataFrameWriteClient

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2618:
-
Story Points: 4  (was: 3)

> Implement write operations other than upsert in SparkDataFrameWriteClient
> -
>
> Key: HUDI-2618
> URL: https://issues.apache.org/jira/browse/HUDI-2618
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>
> insert, insert_prepped, insert_overwrite, insert_overwrite_table, delete, 
> delete_partitions, bulk_insert



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-1970) Performance testing/certification of key SQL DMLs

2021-10-25 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433587#comment-17433587
 ] 

Raymond Xu edited comment on HUDI-1970 at 10/25/21, 7:13 AM:
-

* 1B records (randomized values in the example trip model)
 * 100 partitions, evenly distributed, `year=*/month=*/day=*`, 50 parquet files 
/ partition
 * hudi: 109.8 GB = 22.4 MB parquet x 5000
 * delta: 70.9 GB = 14.5 MB parquet x 5000

|SQL|Hudi 0.9.0|
|select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 
20.0|129.352|108.312|104.914|
|select count(*) from hudi_trips_snapshot|96.001|83.839|66.973|
|select count(*) from hudi_trips_snapshot where year = '2020' and month = '03' 
and day = '01'|1.880|1.776|1.767|
|select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where 
year='2020' and month='03' and day='01' and fare between 20 and 
50|3.650|3.147|3.086|


was (Author: xushiyan):
* 1B records (randomized values in the example trip model)
 * 100 partitions, evenly distributed, year=*/month=*/day=*, 50 parquet files / 
partition
 * EMR 6.2 Spark 3.0.1-amzn-0
 * S3, parquet compression snappy
 * hudi: 109.8 GB = 22.4 MB parquet x 5000
 * delta: 70.9 GB = 14.5 MB parquet x 5000

|SQL|Hudi 0.9.0|
|select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 
20.0|129.352|108.312|104.914|
|select count(*) from hudi_trips_snapshot|96.001|83.839|66.973|
|select count(*) from hudi_trips_snapshot where year = '2020' and month = '03' 
and day = '01'|1.880|1.776|1.767|
|select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where 
year='2020' and month='03' and day='01' and fare between 20 and 
50|3.650|3.147|3.086|

> Performance testing/certification of key SQL DMLs
> -
>
> Key: HUDI-1970
> URL: https://issues.apache.org/jira/browse/HUDI-1970
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Performance, Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1970) Performance testing/certification of key SQL DMLs

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1970:
-
Status: Patch Available  (was: In Progress)

> Performance testing/certification of key SQL DMLs
> -
>
> Key: HUDI-1970
> URL: https://issues.apache.org/jira/browse/HUDI-1970
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Performance, Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1430:
-
Status: In Progress  (was: Open)

> Implement SparkDataFrameWriteClient with SimpleIndex
> 
>
> Key: HUDI-1430
> URL: https://issues.apache.org/jira/browse/HUDI-1430
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1430:
-
Story Points: 2

> Implement SparkDataFrameWriteClient with SimpleIndex
> 
>
> Key: HUDI-1430
> URL: https://issues.apache.org/jira/browse/HUDI-1430
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2618) Implement write operations other than upsert in SparkDataFrameWriteClient

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2618:
-
Summary: Implement write operations other than upsert in 
SparkDataFrameWriteClient  (was: Implement operations other than upsert in 
SparkDataFrameWriteClient)

> Implement write operations other than upsert in SparkDataFrameWriteClient
> -
>
> Key: HUDI-2618
> URL: https://issues.apache.org/jira/browse/HUDI-2618
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2618) Implement write operations other than upsert in SparkDataFrameWriteClient

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2618:
-
Description: insert, insert_prepped, insert_overwrite, 
insert_overwrite_table, delete, delete_partitions, bulk_insert

> Implement write operations other than upsert in SparkDataFrameWriteClient
> -
>
> Key: HUDI-2618
> URL: https://issues.apache.org/jira/browse/HUDI-2618
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>
> insert, insert_prepped, insert_overwrite, insert_overwrite_table, delete, 
> delete_partitions, bulk_insert



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1885) Support Delete/Update Non-Pk Table

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-1885:


Assignee: Yann Byron

> Support Delete/Update Non-Pk Table
> --
>
> Key: HUDI-1885
> URL: https://issues.apache.org/jira/browse/HUDI-1885
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: Yann Byron
>Priority: Blocker
> Fix For: 0.10.0
>
>
> Allow to delete/update a non-pk table.
> {code:java}
> create table h0 (
>   id int,
>   name string,
>   price double
> ) using hudi;
> delete from h0 where id = 10;
> update h0 set price = 10 where id = 12;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-2234) MERGE INTO works only ON primary key

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-2234:


Assignee: Yann Byron  (was: pengzhiwei)

> MERGE INTO works only ON primary key
> 
>
> Key: HUDI-2234
> URL: https://issues.apache.org/jira/browse/HUDI-2234
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Sagar Sumit
>Assignee: Yann Byron
>Priority: Blocker
> Fix For: 0.10.0
>
>
> {code:sql}
> drop table if exists hudi_gh_ext_fixed;
> create table hudi_gh_ext_fixed (id int, name string, price double, ts long) 
> using hudi options(primaryKey = 'id', precombineField = 'ts') location 
> 'file:///tmp/hudi-h4-fixed';
> insert into hudi_gh_ext_fixed values(3, 'AMZN', 300, 120);
> insert into hudi_gh_ext_fixed values(2, 'UBER', 300, 120);
> insert into hudi_gh_ext_fixed values(4, 'GOOG', 300, 120);
> update hudi_gh_ext_fixed set price = 150.0 where name = 'UBER';
> drop table if exists hudi_fixed;
> create table hudi_fixed (id int, name string, price double, ts long) using 
> hudi options(primaryKey = 'id', precombineField = 'ts') partitioned by (ts) 
> location 'file:///tmp/hudi-h4-part-fixed';
> insert into hudi_fixed values(2, 'UBER', 200, 120);
> MERGE INTO hudi_fixed 
> USING (select id, name, price, ts from hudi_gh_ext_fixed) updates
> ON hudi_fixed.name = updates.name
> WHEN MATCHED THEN
>   UPDATE SET *
> WHEN NOT MATCHED
>   THEN INSERT *;
> -- java.lang.IllegalArgumentException: Merge Key[name] is not Equal to the 
> defined primary key[id] in table hudi_fixed
> --at 
> org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.buildMergeIntoConfig(MergeIntoHoodieTableCommand.scala:425)
> --at 
> org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.run(MergeIntoHoodieTableCommand.scala:146)
> --at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> --at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> --at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
> --at 
> org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-2287) Partition pruning not working on Hudi dataset

2021-10-25 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433586#comment-17433586
 ] 

Raymond Xu commented on HUDI-2287:
--

[~rjkumr] it's likely caused by your `hoodie.table.partition.fields` config in 
your hoodie.properties. As you're using a CustomKeyGenerator, not sure how that 
affects the partition field settings. In case of SimpleKeyGenerator, you'd 
expect `hoodie.table.partition.fields=partition1,partition2`. You can manually 
modify it and by setting it right for your CustomKeyGenerator's logic, you 
should be able to get partition pruning to work.

> Partition pruning not working on Hudi dataset
> -
>
> Key: HUDI-2287
> URL: https://issues.apache.org/jira/browse/HUDI-2287
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Performance
>Reporter: Rajkumar Gunasekaran
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi, we have created  a Hudi dataset which has two level partition like this
> {code:java}
> s3://somes3bucket/partition1=value/partition2=value
> {code}
> where _partition1_ and _partition2_ is of type string
> When running a simple count query using Hudi format in spark-shell, it takes 
> almost 3 minutes to complete
>   
> {code:scala}
> spark.read.format("hudi").load("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
>  
> res1: Long = 
> attempt 1: 3.2 minutes
>  attempt 2: 2.5 minutes
> {code}
> In the Spark UI ~9000 tasks (which is approximately equivalent to the total 
> no of files in the ENTIRE dataset s3://somes3bucket) are used for 
> computation. Seems like spark is reading the entire dataset instead of 
> *partition pruning.*...and then filtering the dataset based on the where 
> clause
> Whereas, if I use the parquet format to read the dataset, the query only 
> takes ~30 seconds (vis-a-vis 3 minutes with Hudi format)
> {code:scala}
> spark.read.parquet("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
> res2: Long = 
> ~ 30 seconds
> {code}
> In the spark UI, only 1361 (ie 1361 tasks) files are scanned (vis-a-vis ~9000 
> files in Hudi) and takes only 15 seconds
> Any idea why partition pruning is not working when using Hudi format? 
> Wondering if I am missing any configuration during the creation of the 
> dataset?
> PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0 and here is 
> the configuration I have used for creating the dataset
> {code:scala}
> df.writeStream
>  .trigger(Trigger.ProcessingTime(s"${param.triggerTimeInSeconds} seconds"))
>  .partitionBy("partition1","partition2")
>  .format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, param.hiveNHudiTableName.get)
>  //--
>  .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy")
>  .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, 
> param.expectedFileSizeInBytes)
>  .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, 
> HoodieStorageConfig.DEFAULT_PARQUET_BLOCK_SIZE_BYTES)
>  //--
>  .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 
> (param.expectedFileSizeInBytes / 100) * 80)
>  .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "true")
>  .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, 
> param.runCompactionAfterNDeltaCommits.get)
>  //--
>  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "record_key_id")
>  .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
> classOf[CustomKeyGenerator].getName)
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
> "partition1:SIMPLE,partition2:SIMPLE")
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
> hudiTablePrecombineKey)
>  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>  //.option(DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY, "false")
>  .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
>  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> "partition1,partition2")
>  .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, param.hiveDb.get)
>  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, 
> param.hiveNHudiTableName.get)
>  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> classOf[MultiPartKeysValueExtractor].getName)
>  .outputMode(OutputMode.Append())
>  .queryName(s"${param.hiveDb}_${param.hiveNHudiTableName}_query"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2287) Partition pruning not working on Hudi dataset

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2287:
-
Priority: Major  (was: Blocker)

> Partition pruning not working on Hudi dataset
> -
>
> Key: HUDI-2287
> URL: https://issues.apache.org/jira/browse/HUDI-2287
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Performance
>Reporter: Rajkumar Gunasekaran
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.10.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi, we have created  a Hudi dataset which has two level partition like this
> {code:java}
> s3://somes3bucket/partition1=value/partition2=value
> {code}
> where _partition1_ and _partition2_ is of type string
> When running a simple count query using Hudi format in spark-shell, it takes 
> almost 3 minutes to complete
>   
> {code:scala}
> spark.read.format("hudi").load("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
>  
> res1: Long = 
> attempt 1: 3.2 minutes
>  attempt 2: 2.5 minutes
> {code}
> In the Spark UI ~9000 tasks (which is approximately equivalent to the total 
> no of files in the ENTIRE dataset s3://somes3bucket) are used for 
> computation. Seems like spark is reading the entire dataset instead of 
> *partition pruning.*...and then filtering the dataset based on the where 
> clause
> Whereas, if I use the parquet format to read the dataset, the query only 
> takes ~30 seconds (vis-a-vis 3 minutes with Hudi format)
> {code:scala}
> spark.read.parquet("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
> res2: Long = 
> ~ 30 seconds
> {code}
> In the spark UI, only 1361 (ie 1361 tasks) files are scanned (vis-a-vis ~9000 
> files in Hudi) and takes only 15 seconds
> Any idea why partition pruning is not working when using Hudi format? 
> Wondering if I am missing any configuration during the creation of the 
> dataset?
> PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0 and here is 
> the configuration I have used for creating the dataset
> {code:scala}
> df.writeStream
>  .trigger(Trigger.ProcessingTime(s"${param.triggerTimeInSeconds} seconds"))
>  .partitionBy("partition1","partition2")
>  .format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, param.hiveNHudiTableName.get)
>  //--
>  .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy")
>  .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, 
> param.expectedFileSizeInBytes)
>  .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, 
> HoodieStorageConfig.DEFAULT_PARQUET_BLOCK_SIZE_BYTES)
>  //--
>  .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 
> (param.expectedFileSizeInBytes / 100) * 80)
>  .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "true")
>  .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, 
> param.runCompactionAfterNDeltaCommits.get)
>  //--
>  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "record_key_id")
>  .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
> classOf[CustomKeyGenerator].getName)
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
> "partition1:SIMPLE,partition2:SIMPLE")
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
> hudiTablePrecombineKey)
>  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>  //.option(DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY, "false")
>  .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
>  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> "partition1,partition2")
>  .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, param.hiveDb.get)
>  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, 
> param.hiveNHudiTableName.get)
>  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> classOf[MultiPartKeysValueExtractor].getName)
>  .outputMode(OutputMode.Append())
>  .queryName(s"${param.hiveDb}_${param.hiveNHudiTableName}_query"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1430:
-
Parent: HUDI-2531
Issue Type: Sub-task  (was: Improvement)

> Implement SparkDataFrameWriteClient with SimpleIndex
> 
>
> Key: HUDI-1430
> URL: https://issues.apache.org/jira/browse/HUDI-1430
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1430:
-
Summary: Implement SparkDataFrameWriteClient with SimpleIndex  (was: 
Support Dataset write w/o conversion to RDD)

> Implement SparkDataFrameWriteClient with SimpleIndex
> 
>
> Key: HUDI-1430
> URL: https://issues.apache.org/jira/browse/HUDI-1430
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2621) Enhance DataFrameWriter with small file handling

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2621:
-
Summary: Enhance DataFrameWriter with small file handling  (was: Optimize 
DataFrameWriter on small file handling)

> Enhance DataFrameWriter with small file handling
> 
>
> Key: HUDI-2621
> URL: https://issues.apache.org/jira/browse/HUDI-2621
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2287) Partition pruning not working on Hudi dataset

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2287:
-
Priority: Blocker  (was: Major)

> Partition pruning not working on Hudi dataset
> -
>
> Key: HUDI-2287
> URL: https://issues.apache.org/jira/browse/HUDI-2287
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Performance
>Reporter: Rajkumar Gunasekaran
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi, we have created  a Hudi dataset which has two level partition like this
> {code:java}
> s3://somes3bucket/partition1=value/partition2=value
> {code}
> where _partition1_ and _partition2_ is of type string
> When running a simple count query using Hudi format in spark-shell, it takes 
> almost 3 minutes to complete
>   
> {code:scala}
> spark.read.format("hudi").load("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
>  
> res1: Long = 
> attempt 1: 3.2 minutes
>  attempt 2: 2.5 minutes
> {code}
> In the Spark UI ~9000 tasks (which is approximately equivalent to the total 
> no of files in the ENTIRE dataset s3://somes3bucket) are used for 
> computation. Seems like spark is reading the entire dataset instead of 
> *partition pruning.*...and then filtering the dataset based on the where 
> clause
> Whereas, if I use the parquet format to read the dataset, the query only 
> takes ~30 seconds (vis-a-vis 3 minutes with Hudi format)
> {code:scala}
> spark.read.parquet("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
> res2: Long = 
> ~ 30 seconds
> {code}
> In the spark UI, only 1361 (ie 1361 tasks) files are scanned (vis-a-vis ~9000 
> files in Hudi) and takes only 15 seconds
> Any idea why partition pruning is not working when using Hudi format? 
> Wondering if I am missing any configuration during the creation of the 
> dataset?
> PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0 and here is 
> the configuration I have used for creating the dataset
> {code:scala}
> df.writeStream
>  .trigger(Trigger.ProcessingTime(s"${param.triggerTimeInSeconds} seconds"))
>  .partitionBy("partition1","partition2")
>  .format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, param.hiveNHudiTableName.get)
>  //--
>  .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy")
>  .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, 
> param.expectedFileSizeInBytes)
>  .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, 
> HoodieStorageConfig.DEFAULT_PARQUET_BLOCK_SIZE_BYTES)
>  //--
>  .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 
> (param.expectedFileSizeInBytes / 100) * 80)
>  .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "true")
>  .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, 
> param.runCompactionAfterNDeltaCommits.get)
>  //--
>  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "record_key_id")
>  .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
> classOf[CustomKeyGenerator].getName)
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
> "partition1:SIMPLE,partition2:SIMPLE")
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
> hudiTablePrecombineKey)
>  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>  //.option(DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY, "false")
>  .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
>  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> "partition1,partition2")
>  .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, param.hiveDb.get)
>  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, 
> param.hiveNHudiTableName.get)
>  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> classOf[MultiPartKeysValueExtractor].getName)
>  .outputMode(OutputMode.Append())
>  .queryName(s"${param.hiveDb}_${param.hiveNHudiTableName}_query"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2615) Decouple HoodieRecordPayload with Hoodie table, table services, and index

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2615:
-
Status: In Progress  (was: Open)

> Decouple HoodieRecordPayload with Hoodie table, table services, and index
> -
>
> Key: HUDI-2615
> URL: https://issues.apache.org/jira/browse/HUDI-2615
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>
> HoodieTable, HoodieIndex, and compaction, clustering services should be 
> independent of HoodieRecordPayload



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2287) Partition pruning not working on Hudi dataset

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2287:
-
Status: Patch Available  (was: In Progress)

> Partition pruning not working on Hudi dataset
> -
>
> Key: HUDI-2287
> URL: https://issues.apache.org/jira/browse/HUDI-2287
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Performance
>Reporter: Rajkumar Gunasekaran
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi, we have created  a Hudi dataset which has two level partition like this
> {code:java}
> s3://somes3bucket/partition1=value/partition2=value
> {code}
> where _partition1_ and _partition2_ is of type string
> When running a simple count query using Hudi format in spark-shell, it takes 
> almost 3 minutes to complete
>   
> {code:scala}
> spark.read.format("hudi").load("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
>  
> res1: Long = 
> attempt 1: 3.2 minutes
>  attempt 2: 2.5 minutes
> {code}
> In the Spark UI ~9000 tasks (which is approximately equivalent to the total 
> no of files in the ENTIRE dataset s3://somes3bucket) are used for 
> computation. Seems like spark is reading the entire dataset instead of 
> *partition pruning.*...and then filtering the dataset based on the where 
> clause
> Whereas, if I use the parquet format to read the dataset, the query only 
> takes ~30 seconds (vis-a-vis 3 minutes with Hudi format)
> {code:scala}
> spark.read.parquet("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
> res2: Long = 
> ~ 30 seconds
> {code}
> In the spark UI, only 1361 (ie 1361 tasks) files are scanned (vis-a-vis ~9000 
> files in Hudi) and takes only 15 seconds
> Any idea why partition pruning is not working when using Hudi format? 
> Wondering if I am missing any configuration during the creation of the 
> dataset?
> PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0 and here is 
> the configuration I have used for creating the dataset
> {code:scala}
> df.writeStream
>  .trigger(Trigger.ProcessingTime(s"${param.triggerTimeInSeconds} seconds"))
>  .partitionBy("partition1","partition2")
>  .format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, param.hiveNHudiTableName.get)
>  //--
>  .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy")
>  .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, 
> param.expectedFileSizeInBytes)
>  .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, 
> HoodieStorageConfig.DEFAULT_PARQUET_BLOCK_SIZE_BYTES)
>  //--
>  .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 
> (param.expectedFileSizeInBytes / 100) * 80)
>  .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "true")
>  .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, 
> param.runCompactionAfterNDeltaCommits.get)
>  //--
>  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "record_key_id")
>  .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
> classOf[CustomKeyGenerator].getName)
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
> "partition1:SIMPLE,partition2:SIMPLE")
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
> hudiTablePrecombineKey)
>  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>  //.option(DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY, "false")
>  .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
>  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> "partition1,partition2")
>  .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, param.hiveDb.get)
>  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, 
> param.hiveNHudiTableName.get)
>  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> classOf[MultiPartKeysValueExtractor].getName)
>  .outputMode(OutputMode.Append())
>  .queryName(s"${param.hiveDb}_${param.hiveNHudiTableName}_query"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1970) Performance testing/certification of key SQL DMLs

2021-10-25 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433587#comment-17433587
 ] 

Raymond Xu commented on HUDI-1970:
--

* 1B records (randomized values in the example trip model)
 * 100 partitions, evenly distributed, year=*/month=*/day=*, 50 parquet files / 
partition
 * EMR 6.2 Spark 3.0.1-amzn-0
 * S3, parquet compression snappy
 * hudi: 109.8 GB = 22.4 MB parquet x 5000
 * delta: 70.9 GB = 14.5 MB parquet x 5000

|SQL|Hudi 0.9.0|
|select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 
20.0|129.352|108.312|104.914|
|select count(*) from hudi_trips_snapshot|96.001|83.839|66.973|
|select count(*) from hudi_trips_snapshot where year = '2020' and month = '03' 
and day = '01'|1.880|1.776|1.767|
|select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where 
year='2020' and month='03' and day='01' and fare between 20 and 
50|3.650|3.147|3.086|

> Performance testing/certification of key SQL DMLs
> -
>
> Key: HUDI-1970
> URL: https://issues.apache.org/jira/browse/HUDI-1970
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Performance, Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1970) Performance testing/certification of key SQL DMLs

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1970:
-
Status: In Progress  (was: Open)

> Performance testing/certification of key SQL DMLs
> -
>
> Key: HUDI-1970
> URL: https://issues.apache.org/jira/browse/HUDI-1970
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Performance, Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex

2021-10-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1430:
-
Story Points: 3  (was: 2)

> Implement SparkDataFrameWriteClient with SimpleIndex
> 
>
> Key: HUDI-1430
> URL: https://issues.apache.org/jira/browse/HUDI-1430
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


<    2   3   4   5   6   7   8   9   10   11   >