[jira] [Commented] (HUDI-1307) spark datasource load path format is confused for snapshot and increment read mode
[ https://issues.apache.org/jira/browse/HUDI-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420358#comment-17420358 ] Raymond Xu commented on HUDI-1307: -- [~309637554] Any update on this improvement? definitely useful to align on the pattern. In fact, today to enable HoodieFileIndex, user should avoid passing in glob path. Is it still necessary to keep glob path pattern around? > spark datasource load path format is confused for snapshot and increment read > mode > -- > > Key: HUDI-1307 > URL: https://issues.apache.org/jira/browse/HUDI-1307 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: liwei >Assignee: liwei >Priority: Critical > Labels: sev:high, user-support-issues > > as spark datasource read hudi table > 1、snapshot mode > {code:java} > val readHudi = spark.read.format("org.apache.hudi").load(basePath + "/*"); > should add "/*" ,otherwise will fail, because in > org.apache.hudi.DefaultSource. > createRelation() will use fs.globStatus(). if do not have "/*" will not get > .hoodie and default dir > val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, > fs){code} > > 2、increment mode > both basePath and basePath + "/*" is ok.This is because in > org.apache.hudi.DefaultSource > DataSourceUtils.getTablePath can support both the two format. > {code:java} > val incViewDF = spark.read.format("org.apache.hudi"). > option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL). > option(BEGIN_INSTANTTIME_OPT_KEY, beginTime). > option(END_INSTANTTIME_OPT_KEY, endTime). > load(basePath){code} > > {code:java} > val incViewDF = spark.read.format("org.apache.hudi"). > option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL). > option(BEGIN_INSTANTTIME_OPT_KEY, beginTime). > option(END_INSTANTTIME_OPT_KEY, endTime). > load(basePath + "/*") > {code} > > as increment mode and snapshot mode not coincide, user will confuse .Also > load use basepath +"/*" *or "/***/*"* is confuse. I know this is to support > partition. > but i think this api will more clear for user > > {code:java} > partition = "year = '2019'" > spark.read .format("hudi") .load(path) .where(partition) {code} > > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2440) Add dependency change diff script for dependency governace
[ https://issues.apache.org/jira/browse/HUDI-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2440: - Component/s: Usability > Add dependency change diff script for dependency governace > -- > > Key: HUDI-2440 > URL: https://issues.apache.org/jira/browse/HUDI-2440 > Project: Apache Hudi > Issue Type: Improvement > Components: Usability, Utilities >Reporter: vinoyang >Assignee: vinoyang >Priority: Major > Labels: pull-request-available > > Currently, hudi's dependency management is chaotic, e.g. for > `hudi-spark-bundle_2.11`, the dependency list is here: > {code:java} > HikariCP/2.5.1//HikariCP-2.5.1.jar > ST4/4.0.4//ST4-4.0.4.jar > aircompressor/0.15//aircompressor-0.15.jar > annotations/17.0.0//annotations-17.0.0.jar > ant-launcher/1.9.1//ant-launcher-1.9.1.jar > ant/1.6.5//ant-1.6.5.jar > ant/1.9.1//ant-1.9.1.jar > antlr-runtime/3.5.2//antlr-runtime-3.5.2.jar > aopalliance/1.0//aopalliance-1.0.jar > apache-curator/2.7.1//apache-curator-2.7.1.pom > apacheds-i18n/2.0.0-M15//apacheds-i18n-2.0.0-M15.jar > apacheds-kerberos-codec/2.0.0-M15//apacheds-kerberos-codec-2.0.0-M15.jar > api-asn1-api/1.0.0-M20//api-asn1-api-1.0.0-M20.jar > api-util/1.0.0-M20//api-util-1.0.0-M20.jar > asm/3.1//asm-3.1.jar > avatica-metrics/1.8.0//avatica-metrics-1.8.0.jar > avatica/1.8.0//avatica-1.8.0.jar > avro/1.8.2//avro-1.8.2.jar > bonecp/0.8.0.RELEASE//bonecp-0.8.0.RELEASE.jar > calcite-core/1.10.0//calcite-core-1.10.0.jar > calcite-druid/1.10.0//calcite-druid-1.10.0.jar > calcite-linq4j/1.10.0//calcite-linq4j-1.10.0.jar > commons-beanutils-core/1.8.0//commons-beanutils-core-1.8.0.jar > commons-beanutils/1.7.0//commons-beanutils-1.7.0.jar > commons-cli/1.2//commons-cli-1.2.jar > commons-codec/1.4//commons-codec-1.4.jar > commons-collections/3.2.2//commons-collections-3.2.2.jar > commons-compiler/2.7.6//commons-compiler-2.7.6.jar > commons-compress/1.9//commons-compress-1.9.jar > commons-configuration/1.6//commons-configuration-1.6.jar > commons-daemon/1.0.13//commons-daemon-1.0.13.jar > commons-dbcp/1.4//commons-dbcp-1.4.jar > commons-digester/1.8//commons-digester-1.8.jar > commons-el/1.0//commons-el-1.0.jar > commons-httpclient/3.1//commons-httpclient-3.1.jar > commons-io/2.4//commons-io-2.4.jar > commons-lang/2.6//commons-lang-2.6.jar > commons-lang3/3.1//commons-lang3-3.1.jar > commons-logging/1.2//commons-logging-1.2.jar > commons-math/2.2//commons-math-2.2.jar > commons-math3/3.1.1//commons-math3-3.1.1.jar > commons-net/3.1//commons-net-3.1.jar > commons-pool/1.5.4//commons-pool-1.5.4.jar > curator-client/2.7.1//curator-client-2.7.1.jar > curator-framework/2.7.1//curator-framework-2.7.1.jar > curator-recipes/2.7.1//curator-recipes-2.7.1.jar > datanucleus-api-jdo/4.2.4//datanucleus-api-jdo-4.2.4.jar > datanucleus-core/4.1.17//datanucleus-core-4.1.17.jar > datanucleus-rdbms/4.1.19//datanucleus-rdbms-4.1.19.jar > derby/10.10.2.0//derby-10.10.2.0.jar > disruptor/3.3.0//disruptor-3.3.0.jar > dropwizard-metrics-hadoop-metrics2-reporter/0.1.2//dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar > eigenbase-properties/1.1.5//eigenbase-properties-1.1.5.jar > fastutil/7.0.13//fastutil-7.0.13.jar > findbugs-annotations/1.3.9-1//findbugs-annotations-1.3.9-1.jar > fluent-hc/4.4.1//fluent-hc-4.4.1.jar > groovy-all/2.4.4//groovy-all-2.4.4.jar > gson/2.3.1//gson-2.3.1.jar > guava/14.0.1//guava-14.0.1.jar > guice-assistedinject/3.0//guice-assistedinject-3.0.jar > guice-servlet/3.0//guice-servlet-3.0.jar > guice/3.0//guice-3.0.jar > hadoop-annotations/2.7.3//hadoop-annotations-2.7.3.jar > hadoop-auth/2.7.3//hadoop-auth-2.7.3.jar > hadoop-client/2.7.3//hadoop-client-2.7.3.jar > hadoop-common/2.7.3//hadoop-common-2.7.3.jar > hadoop-common/2.7.3/tests/hadoop-common-2.7.3-tests.jar > hadoop-hdfs/2.7.3//hadoop-hdfs-2.7.3.jar > hadoop-hdfs/2.7.3/tests/hadoop-hdfs-2.7.3-tests.jar > hadoop-mapreduce-client-app/2.7.3//hadoop-mapreduce-client-app-2.7.3.jar > hadoop-mapreduce-client-common/2.7.3//hadoop-mapreduce-client-common-2.7.3.jar > hadoop-mapreduce-client-core/2.7.3//hadoop-mapreduce-client-core-2.7.3.jar > hadoop-mapreduce-client-jobclient/2.7.3//hadoop-mapreduce-client-jobclient-2.7.3.jar > hadoop-mapreduce-client-shuffle/2.7.3//hadoop-mapreduce-client-shuffle-2.7.3.jar > hadoop-yarn-api/2.7.3//hadoop-yarn-api-2.7.3.jar > hadoop-yarn-client/2.7.3//hadoop-yarn-client-2.7.3.jar > hadoop-yarn-common/2.7.3//hadoop-yarn-common-2.7.3.jar > hadoop-yarn-registry/2.7.1//hadoop-yarn-registry-2.7.1.jar > hadoop-yarn-server-applicationhistoryservice/2.7.2//hadoop-yarn-server-applicationhistoryservice-2.7.2.jar > hadoop-yarn-server-common/2.7.2//hadoop-yarn-server-common-2.7.2.jar > hadoop-yarn-server-resourcemanager/2.7.2//hadoop-yarn-server-resourcemanager-2.7.2.jar >
[jira] [Commented] (HUDI-2496) Inserts are precombined even with dedup disabled
[ https://issues.apache.org/jira/browse/HUDI-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421584#comment-17421584 ] Raymond Xu commented on HUDI-2496: -- [~helias_an] Sure. assigned! Please ping us even with a draft PR, we can give early feedback. > Inserts are precombined even with dedup disabled > > > Key: HUDI-2496 > URL: https://issues.apache.org/jira/browse/HUDI-2496 > Project: Apache Hudi > Issue Type: Bug > Components: Writer Core >Reporter: Sagar Sumit >Assignee: Helias Antoniou >Priority: Critical > Labels: sev:critical > Fix For: 0.10.0 > > > Original GH issue https://github.com/apache/hudi/issues/3709 > Test case by [~xushiyan] : [https://github.com/apache/hudi/pull/3723/files] > RCA by [~shivnarayan] : > Within HoodieMergeHandle, we use a hashmap to store incoming records, where > keys are record keys. > and so, if you see 1st batch, duplicates would remain intact. but wrt 2nd > batch, only unique records are considered and later concatenated w/ 1st batch. > > [https://github.com/apache/hudi/blob/36be28712196ff4427c41b0aa885c7fcd7356d7f/hudi-[]…]-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-2496) Inserts are precombined even with dedup disabled
[ https://issues.apache.org/jira/browse/HUDI-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu reassigned HUDI-2496: Assignee: Helias Antoniou > Inserts are precombined even with dedup disabled > > > Key: HUDI-2496 > URL: https://issues.apache.org/jira/browse/HUDI-2496 > Project: Apache Hudi > Issue Type: Bug > Components: Writer Core >Reporter: Sagar Sumit >Assignee: Helias Antoniou >Priority: Critical > Labels: sev:critical > Fix For: 0.10.0 > > > Original GH issue https://github.com/apache/hudi/issues/3709 > Test case by [~xushiyan] : [https://github.com/apache/hudi/pull/3723/files] > RCA by [~shivnarayan] : > Within HoodieMergeHandle, we use a hashmap to store incoming records, where > keys are record keys. > and so, if you see 1st batch, duplicates would remain intact. but wrt 2nd > batch, only unique records are considered and later concatenated w/ 1st batch. > > [https://github.com/apache/hudi/blob/36be28712196ff4427c41b0aa885c7fcd7356d7f/hudi-[]…]-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1998) Provide a way to find list of commits through a pythonic API
[ https://issues.apache.org/jira/browse/HUDI-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1998: - Description: TimelineUtils is a java API using which one can get the latest commit or instantiate HoodieActiveTImeline. Users are looking to perform the same through some python API [https://github.com/apache/hudi/issues/2987] Related issue https://github.com/apache/hudi/issues/3641 was: TimelineUtils is a java API using which one can get the latest commit or instantiate HoodieActiveTImeline. Users are looking to perform the same through some python API https://github.com/apache/hudi/issues/2987 > Provide a way to find list of commits through a pythonic API > - > > Key: HUDI-1998 > URL: https://issues.apache.org/jira/browse/HUDI-1998 > Project: Apache Hudi > Issue Type: New Feature > Components: Writer Core >Reporter: Nishith Agarwal >Priority: Minor > > TimelineUtils is a java API using which one can get the latest commit or > instantiate HoodieActiveTImeline. Users are looking to perform the same > through some python API > > [https://github.com/apache/hudi/issues/2987] > > Related issue > https://github.com/apache/hudi/issues/3641 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2500) Spark datasource delete not working on Spark SQL created table
Raymond Xu created HUDI-2500: Summary: Spark datasource delete not working on Spark SQL created table Key: HUDI-2500 URL: https://issues.apache.org/jira/browse/HUDI-2500 Project: Apache Hudi Issue Type: Bug Components: Spark Integration Reporter: Raymond Xu Fix For: 0.10.0 Original issue https://github.com/apache/hudi/issues/3670 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2495) Difference in behavior between GenericRecord based key gen and Row based key gen
[ https://issues.apache.org/jira/browse/HUDI-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2495: - Parent: HUDI-2505 Issue Type: Sub-task (was: Bug) > Difference in behavior between GenericRecord based key gen and Row based key > gen > - > > Key: HUDI-2495 > URL: https://issues.apache.org/jira/browse/HUDI-2495 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Critical > Labels: sev:critical > > when complex key gen is used and one of the field in record key is a > timestamp field, row writer path and rdd path gives different record key > values. GenericRecord path converts timestamp, where as row writer path does > not do any conversion. > > import java.sql.Timestamp > import spark.implicits._ > val df = Seq( > (1, Timestamp.valueOf("2014-01-01 23:00:01"), "abc"), > (1, Timestamp.valueOf("2014-11-30 12:40:32"), "abc"), > (2, Timestamp.valueOf("2016-12-29 09:54:00"), "def"), > (2, Timestamp.valueOf("2016-05-09 10:12:43"), "def") > ).toDF("typeId","eventTime", "str") > > df.write.format("hudi"). > option("hoodie.insert.shuffle.parallelism", "2"). > option("hoodie.upsert.shuffle.parallelism", "2"). > option("hoodie.bulkinsert.shuffle.parallelism", "2"). > option("hoodie.datasource.write.precombine.field", "typeId"). > option("hoodie.datasource.write.partitionpath.field", "typeId"). > option("hoodie.datasource.write.recordkey.field", "str,eventTime"). > > option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.ComplexKeyGenerator"). > option("hoodie.table.name", "hudi_tbl"). > mode(Overwrite). > save("/tmp/hudi_tbl_trial/") > > val hudiDF = spark.read.format("hudi").load("/tmp/hudi_tbl_trial/") > hudiDF.createOrReplaceTempView("hudi_sql_tbl") > spark.sql("select _hoodie_record_key, str, eventTime, typeId from > hudi_sql_tbl").show(false) > > {code:java} > +--+---+---+--+ > |_hoodie_record_key |str|eventTime |typeId| > +--+---+---+--+ > |str:abc,eventTime:141736923200|abc|2014-11-30 12:40:32|1 | > |str:abc,eventTime:138863520100|abc|2014-01-01 23:00:01|1 | > |str:def,eventTime:146280316300|def|2016-05-09 10:12:43|2 | > |str:def,eventTime:148302324000|def|2016-12-29 09:54:00|2 | > +--+---+---+--+ > {code} > > > // now retry w/ bulk_insert row writer path > df.write.format("hudi"). > option("hoodie.insert.shuffle.parallelism", "2"). > option("hoodie.upsert.shuffle.parallelism", "2"). > option("hoodie.bulkinsert.shuffle.parallelism", "2"). > option("hoodie.datasource.write.precombine.field", "typeId"). > option("hoodie.datasource.write.partitionpath.field", "typeId"). > option("hoodie.datasource.write.recordkey.field", "str,eventTime"). > > option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.ComplexKeyGenerator"). > option("hoodie.table.name", "hudi_tbl"). > "hoodie.datasource.write.operation","bulk_insert"). > mode(Overwrite). > save("/tmp/hudi_tbl_trial_bulk_insert/") > > val hudiDF_bulk_insert = > spark.read.format("hudi").load("/tmp/hudi_tbl_trial_bulk_insert/") > hudiDF_bulk_insert.createOrReplaceTempView("hudi_sql_tbl_bulk_insert") > spark.sql("select _hoodie_record_key, str, eventTime, typeId from > hudi_sql_tbl_bulk_insert").show(false) > {code:java} > +---+---+---+--+ > |_hoodie_record_key |str|eventTime |typeId| > +---+---+---+--+ > |str:def,eventTime:2016-05-09 10:12:43.0|def|2016-05-09 10:12:43|2 | > |str:def,eventTime:2016-12-29 09:54:00.0|def|2016-12-29 09:54:00|2 | > |str:abc,eventTime:2014-01-01 23:00:01.0|abc|2014-01-01 23:00:01|1 | > |str:abc,eventTime:2014-11-30 12:40:32.0|abc|2014-11-30 12:40:32|1 | > +---+---+---+--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2390) KeyGenerator discrepancy between DataFrame writer and SQL
[ https://issues.apache.org/jira/browse/HUDI-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2390: - Parent: HUDI-2505 Issue Type: Sub-task (was: Improvement) > KeyGenerator discrepancy between DataFrame writer and SQL > - > > Key: HUDI-2390 > URL: https://issues.apache.org/jira/browse/HUDI-2390 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Affects Versions: 0.9.0 >Reporter: renhao >Priority: Critical > Labels: sev:critical > > Test Case: > {code:java} > import org.apache.hudi.QuickstartUtils._ > import scala.collection.JavaConversions._ > import org.apache.spark.sql.SaveMode._ > import org.apache.hudi.DataSourceReadOptions._ > import org.apache.hudi.DataSourceWriteOptions._ > import org.apache.hudi.config.HoodieWriteConfig._{code} > 1.准备数据 > > {code:java} > spark.sql("create table test1(a int,b string,c string) using hudi partitioned > by(b) options(primaryKey='a')") > spark.sql("insert into table test1 select 1,2,3") > {code} > > 2.创建hudi table test2 > {code:java} > spark.sql("create table test2(a int,b string,c string) using hudi partitioned > by(b) options(primaryKey='a')"){code} > 3.datasource向test2写入数据 > > {code:java} > val base_data=spark.sql("select * from testdb.test1") > base_data.write.format("hudi"). > option(TABLE_TYPE_OPT_KEY, COW_TABLE_TYPE_OPT_VAL). > option(RECORDKEY_FIELD_OPT_KEY, "a"). > option(PARTITIONPATH_FIELD_OPT_KEY, "b"). > option(KEYGENERATOR_CLASS_OPT_KEY, > "org.apache.hudi.keygen.SimpleKeyGenerator"). > option(OPERATION_OPT_KEY, "bulk_insert"). > option(HIVE_SYNC_ENABLED_OPT_KEY, "true"). > option(HIVE_PARTITION_FIELDS_OPT_KEY, "b"). > option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,"org.apache.hudi.hive.MultiPartKeysValueExtractor"). > > option(HIVE_DATABASE_OPT_KEY, "testdb"). > option(HIVE_TABLE_OPT_KEY, "test2"). > option(HIVE_USE_JDBC_OPT_KEY, "true"). > option("hoodie.bulkinsert.shuffle.parallelism", 4). > option("hoodie.datasource.write.hive_style_partitioning", "true"). > option(TABLE_NAME, > "test2").mode(Append).save(s"/user/hive/warehouse/testdb.db/test2") > {code} > > 此时执行查询结果如下: > {code:java} > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 3| 2| > +---+---+---+{code} > 4.删除一条记录 > {code:java} > spark.sql("delete from testdb.test2 where a=1"){code} > 5.执行查询,a=1的记录未被删除 > {code:java} > spark.sql("select a,b,c from testdb.test2").show{code} > {code:java} > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 3| 2| > +---+---+---+{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2505) [UMBRELLA] Spark DataSource APIs and Spark SQL discrepancies
[ https://issues.apache.org/jira/browse/HUDI-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2505: - Labels: sev:critical (was: ) > [UMBRELLA] Spark DataSource APIs and Spark SQL discrepancies > > > Key: HUDI-2505 > URL: https://issues.apache.org/jira/browse/HUDI-2505 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: Raymond Xu >Priority: Critical > Labels: sev:critical > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2500) Spark datasource delete not working on Spark SQL created table
[ https://issues.apache.org/jira/browse/HUDI-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2500: - Parent: HUDI-2505 Issue Type: Sub-task (was: Bug) > Spark datasource delete not working on Spark SQL created table > -- > > Key: HUDI-2500 > URL: https://issues.apache.org/jira/browse/HUDI-2500 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: Raymond Xu >Priority: Critical > Labels: sev:critical > Fix For: 0.10.0 > > > Original issue [https://github.com/apache/hudi/issues/3670] > > Script to re-produce > {code:java} > val sparkSourceTablePath = s"${tmp.getCanonicalPath}/test_spark_table" > val sparkSourceTableName = "test_spark_table" > val hudiTablePath = s"${tmp.getCanonicalPath}/test_hudi_table" > val hudiTableName = "test_hudi_table" > println("0 - prepare source data") > spark.createDataFrame(Seq( > ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"), > ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"), > ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"), > ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"), > ("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"), > ("105", "2015-01-02", "2015-01-01T13:51:42.248818Z") > )).toDF("id", "creation_date", "last_update_time") > .withColumn("creation_date", expr("cast(creation_date as date)")) > .withColumn("id", expr("cast(id as bigint)")) > .write > .option("path", sparkSourceTablePath) > .mode("overwrite") > .format("parquet") > .saveAsTable(sparkSourceTableName) > println("1 - CTAS to load data to Hudi") > val hudiOptions = Map[String, String]( > HoodieWriteConfig.TBL_NAME.key() -> hudiTableName, > DataSourceWriteOptions.TABLE_NAME.key() -> hudiTableName, > DataSourceWriteOptions.TABLE_TYPE.key() -> "COPY_ON_WRITE", > DataSourceWriteOptions.RECORDKEY_FIELD.key() -> "id", > DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.key() -> > classOf[ComplexKeyGenerator].getCanonicalName, > DataSourceWriteOptions.PAYLOAD_CLASS_NAME.key() -> > classOf[DefaultHoodieRecordPayload].getCanonicalName, > DataSourceWriteOptions.PARTITIONPATH_FIELD.key() -> "creation_date", > DataSourceWriteOptions.PRECOMBINE_FIELD.key() -> "last_update_time", > HoodieWriteConfig.INSERT_PARALLELISM_VALUE.key() -> "1", > HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key() -> "1", > HoodieWriteConfig.BULKINSERT_PARALLELISM_VALUE.key() -> "1", > HoodieWriteConfig.FINALIZE_WRITE_PARALLELISM_VALUE.key() -> "1", > HoodieWriteConfig.DELETE_PARALLELISM_VALUE.key() -> "1" > ) > spark.sql( > s"""create table if not exists $hudiTableName using hudi > | location '$hudiTablePath' > | options ( > | type = 'cow', > | primaryKey = 'id', > | preCombineField = 'last_update_time' > | ) > | partitioned by (creation_date) > | AS > | select id, last_update_time, creation_date from > $sparkSourceTableName > | """.stripMargin) > println("2 - Hudi table has all records") > spark.sql(s"select * from $hudiTableName").show(100) > println("3 - pick 105 to delete") > val rec105 = spark.sql(s"select * from $hudiTableName where id = 105") > rec105.show() > println("4 - issue delete (Spark SQL)") > spark.sql(s"delete from $hudiTableName where id = 105") > println("5 - 105 is deleted") > spark.sql(s"select * from $hudiTableName").show(100) > println("6 - pick 104 to delete") > val rec104 = spark.sql(s"select * from $hudiTableName where id = 104") > rec104.show() > println("7 - issue delete (DataSource)") > rec104.write > .format("hudi") > .options(hudiOptions) > .option(DataSourceWriteOptions.OPERATION.key(), "delete") > .option(DataSourceWriteOptions.PAYLOAD_CLASS_NAME.key(), > classOf[EmptyHoodieRecordPayload].getCanonicalName) > .mode(SaveMode.Append) > .save(hudiTablePath) > println("8 - 104 should be deleted") > spark.sql(s"select * from $hudiTableName").show(100) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2505) [UMBRELLA] Spark DataSource APIs and Spark SQL discrepancies
Raymond Xu created HUDI-2505: Summary: [UMBRELLA] Spark DataSource APIs and Spark SQL discrepancies Key: HUDI-2505 URL: https://issues.apache.org/jira/browse/HUDI-2505 Project: Apache Hudi Issue Type: Improvement Components: Spark Integration Reporter: Raymond Xu -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2500) Spark datasource delete not working on Spark SQL created table
[ https://issues.apache.org/jira/browse/HUDI-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2500: - Description: Original issue [https://github.com/apache/hudi/issues/3670] Script to re-produce {code:java} val sparkSourceTablePath = s"${tmp.getCanonicalPath}/test_spark_table" val sparkSourceTableName = "test_spark_table" val hudiTablePath = s"${tmp.getCanonicalPath}/test_hudi_table" val hudiTableName = "test_hudi_table" println("0 - prepare source data") spark.createDataFrame(Seq( ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"), ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"), ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"), ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"), ("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"), ("105", "2015-01-02", "2015-01-01T13:51:42.248818Z") )).toDF("id", "creation_date", "last_update_time") .withColumn("creation_date", expr("cast(creation_date as date)")) .withColumn("id", expr("cast(id as bigint)")) .write .option("path", sparkSourceTablePath) .mode("overwrite") .format("parquet") .saveAsTable(sparkSourceTableName) println("1 - CTAS to load data to Hudi") val hudiOptions = Map[String, String]( HoodieWriteConfig.TBL_NAME.key() -> hudiTableName, DataSourceWriteOptions.TABLE_NAME.key() -> hudiTableName, DataSourceWriteOptions.TABLE_TYPE.key() -> "COPY_ON_WRITE", DataSourceWriteOptions.RECORDKEY_FIELD.key() -> "id", DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.key() -> classOf[ComplexKeyGenerator].getCanonicalName, DataSourceWriteOptions.PAYLOAD_CLASS_NAME.key() -> classOf[DefaultHoodieRecordPayload].getCanonicalName, DataSourceWriteOptions.PARTITIONPATH_FIELD.key() -> "creation_date", DataSourceWriteOptions.PRECOMBINE_FIELD.key() -> "last_update_time", HoodieWriteConfig.INSERT_PARALLELISM_VALUE.key() -> "1", HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key() -> "1", HoodieWriteConfig.BULKINSERT_PARALLELISM_VALUE.key() -> "1", HoodieWriteConfig.FINALIZE_WRITE_PARALLELISM_VALUE.key() -> "1", HoodieWriteConfig.DELETE_PARALLELISM_VALUE.key() -> "1" ) spark.sql( s"""create table if not exists $hudiTableName using hudi | location '$hudiTablePath' | options ( | type = 'cow', | primaryKey = 'id', | preCombineField = 'last_update_time' | ) | partitioned by (creation_date) | AS | select id, last_update_time, creation_date from $sparkSourceTableName | """.stripMargin) println("2 - Hudi table has all records") spark.sql(s"select * from $hudiTableName").show(100) println("3 - pick 105 to delete") val rec105 = spark.sql(s"select * from $hudiTableName where id = 105") rec105.show() println("4 - issue delete (Spark SQL)") spark.sql(s"delete from $hudiTableName where id = 105") println("5 - 105 is deleted") spark.sql(s"select * from $hudiTableName").show(100) println("6 - pick 104 to delete") val rec104 = spark.sql(s"select * from $hudiTableName where id = 104") rec104.show() println("7 - issue delete (DataSource)") rec104.write .format("hudi") .options(hudiOptions) .option(DataSourceWriteOptions.OPERATION.key(), "delete") .option(DataSourceWriteOptions.PAYLOAD_CLASS_NAME.key(), classOf[EmptyHoodieRecordPayload].getCanonicalName) .mode(SaveMode.Append) .save(hudiTablePath) println("8 - 104 should be deleted") spark.sql(s"select * from $hudiTableName").show(100) {code} was:Original issue https://github.com/apache/hudi/issues/3670 > Spark datasource delete not working on Spark SQL created table > -- > > Key: HUDI-2500 > URL: https://issues.apache.org/jira/browse/HUDI-2500 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Reporter: Raymond Xu >Priority: Critical > Labels: sev:critical > Fix For: 0.10.0 > > > Original issue [https://github.com/apache/hudi/issues/3670] > > Script to re-produce > {code:java} > val sparkSourceTablePath = s"${tmp.getCanonicalPath}/test_spark_table" > val sparkSourceTableName = "test_spark_table" > val hudiTablePath = s"${tmp.getCanonicalPath}/test_hudi_table" > val hudiTableName = "test_hudi_table" > println("0 - prepare source data") > spark.createDataFrame(Seq( > ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"), > ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"), > ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"), > ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"), > ("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"), > ("105", "2015-01-02", "2015-01-01T13:51:42.248818Z") > )).toDF("id", "creation_date", "last_update_time") > .withColumn("creation_date", expr("cast(creation_date as date)")) >
[jira] [Created] (HUDI-2531) [UMBRELLA] Support Dataset APIs in writer paths
Raymond Xu created HUDI-2531: Summary: [UMBRELLA] Support Dataset APIs in writer paths Key: HUDI-2531 URL: https://issues.apache.org/jira/browse/HUDI-2531 Project: Apache Hudi Issue Type: New Feature Components: Spark Integration Reporter: Raymond Xu -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2531) [UMBRELLA] Support Dataset APIs in writer paths
[ https://issues.apache.org/jira/browse/HUDI-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2531: - Labels: hudi-umbrellas sev:critical user-support-issues (was: ) > [UMBRELLA] Support Dataset APIs in writer paths > --- > > Key: HUDI-2531 > URL: https://issues.apache.org/jira/browse/HUDI-2531 > Project: Apache Hudi > Issue Type: New Feature > Components: Spark Integration >Reporter: Raymond Xu >Priority: Critical > Labels: hudi-umbrellas, sev:critical, user-support-issues > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2531) [UMBRELLA] Support Dataset APIs in writer paths
[ https://issues.apache.org/jira/browse/HUDI-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2531: - Description: To make use of Dataset APIs in writer paths instead of RDD. > [UMBRELLA] Support Dataset APIs in writer paths > --- > > Key: HUDI-2531 > URL: https://issues.apache.org/jira/browse/HUDI-2531 > Project: Apache Hudi > Issue Type: New Feature > Components: Spark Integration >Reporter: Raymond Xu >Priority: Critical > Labels: hudi-umbrellas, sev:critical, user-support-issues > > To make use of Dataset APIs in writer paths instead of RDD. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2452) spark on hudi metadata key length < 0
[ https://issues.apache.org/jira/browse/HUDI-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2452: - Labels: sev:critical (was: pull-request-available sev:critical) > spark on hudi metadata key length < 0 > - > > Key: HUDI-2452 > URL: https://issues.apache.org/jira/browse/HUDI-2452 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Reporter: xy >Priority: Blocker > Labels: sev:critical > Fix For: 0.10.0 > > Attachments: metadata表.txt > > > spark on hudi metadata key length <= 0, but data primary key is not "" or > null,error messages are in attatchment > > > https://github.com/apache/hudi/issues/3688 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2482) Support drop partitions SQL
Raymond Xu created HUDI-2482: Summary: Support drop partitions SQL Key: HUDI-2482 URL: https://issues.apache.org/jira/browse/HUDI-2482 Project: Apache Hudi Issue Type: Improvement Components: Spark Integration Reporter: Yann Byron Assignee: Yann Byron Fix For: 0.10.0 Spark SQL support the following syntax to show hudi tabls's partitions. {code:java} SHOW PARTITIONS tableIdentifier partitionSpec?{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2482) Support drop partitions SQL
[ https://issues.apache.org/jira/browse/HUDI-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2482: - Parent: HUDI-1658 Issue Type: Sub-task (was: Improvement) > Support drop partitions SQL > --- > > Key: HUDI-2482 > URL: https://issues.apache.org/jira/browse/HUDI-2482 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: Yann Byron >Priority: Major > Labels: features, pull-request-available > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2456) Support show partitions SQL
[ https://issues.apache.org/jira/browse/HUDI-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2456: - Parent: HUDI-1658 Issue Type: Sub-task (was: Improvement) > Support show partitions SQL > --- > > Key: HUDI-2456 > URL: https://issues.apache.org/jira/browse/HUDI-2456 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: Yann Byron >Assignee: Yann Byron >Priority: Major > Labels: features, pull-request-available > Fix For: 0.10.0 > > > Spark SQL support the following syntax to show hudi tabls's partitions. > {code:java} > SHOW PARTITIONS tableIdentifier partitionSpec?{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2482) Support drop partitions SQL
[ https://issues.apache.org/jira/browse/HUDI-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2482: - Description: (was: Spark SQL support the following syntax to show hudi tabls's partitions. {code:java} SHOW PARTITIONS tableIdentifier partitionSpec?{code} ) > Support drop partitions SQL > --- > > Key: HUDI-2482 > URL: https://issues.apache.org/jira/browse/HUDI-2482 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: Yann Byron >Assignee: Yann Byron >Priority: Major > Labels: features, pull-request-available > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-2482) Support drop partitions SQL
[ https://issues.apache.org/jira/browse/HUDI-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu reassigned HUDI-2482: Assignee: (was: Yann Byron) > Support drop partitions SQL > --- > > Key: HUDI-2482 > URL: https://issues.apache.org/jira/browse/HUDI-2482 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: Yann Byron >Priority: Major > Labels: features, pull-request-available > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-2108) Flaky test: TestHoodieBackedMetadata.testOnlyValidPartitionsAdded:210
[ https://issues.apache.org/jira/browse/HUDI-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu reassigned HUDI-2108: Assignee: Raymond Xu (was: Vinoth Chandar) > Flaky test: TestHoodieBackedMetadata.testOnlyValidPartitionsAdded:210 > - > > Key: HUDI-2108 > URL: https://issues.apache.org/jira/browse/HUDI-2108 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Vinoth Chandar >Assignee: Raymond Xu >Priority: Major > > https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=357=logs=864947d5-8fca-5138-8394-999ccb212a1e=552b4d2f-26d5-5f2f-1d5d-e8229058b632 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2108) Flaky test: TestHoodieBackedMetadata.testOnlyValidPartitionsAdded:210
[ https://issues.apache.org/jira/browse/HUDI-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2108: - Status: In Progress (was: Open) > Flaky test: TestHoodieBackedMetadata.testOnlyValidPartitionsAdded:210 > - > > Key: HUDI-2108 > URL: https://issues.apache.org/jira/browse/HUDI-2108 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Vinoth Chandar >Assignee: Raymond Xu >Priority: Major > > https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=357=logs=864947d5-8fca-5138-8394-999ccb212a1e=552b4d2f-26d5-5f2f-1d5d-e8229058b632 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2516) Upgrade to Junit 5.8.1
Raymond Xu created HUDI-2516: Summary: Upgrade to Junit 5.8.1 Key: HUDI-2516 URL: https://issues.apache.org/jira/browse/HUDI-2516 Project: Apache Hudi Issue Type: Sub-task Components: Testing Reporter: Raymond Xu Assignee: Raymond Xu -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2108) Flaky test: TestHoodieBackedMetadata.testOnlyValidPartitionsAdded:210
[ https://issues.apache.org/jira/browse/HUDI-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2108: - Description: org.apache.hudi.client.functional.TestHoodieBackedMetadata#testTableOperationsWithRestore flakiness came from {code:java} client.scheduleCompactionAtInstant(newCommitTime, Option.empty()); // sometimes never create 007.compaction.requested client.compact(newCommitTime); // then this would fail{code} was: org.apache.hudi.client.functional.TestHoodieBackedMetadata#testTableOperationsWithRestore flakiness came from client.scheduleCompactionAtInstant(newCommitTime, Option.empty()); // sometimes never create 007.compaction.requested client.compact(newCommitTime); // then this would fail > Flaky test: TestHoodieBackedMetadata.testOnlyValidPartitionsAdded:210 > - > > Key: HUDI-2108 > URL: https://issues.apache.org/jira/browse/HUDI-2108 > Project: Apache Hudi > Issue Type: Sub-task > Components: Testing >Reporter: Vinoth Chandar >Assignee: Raymond Xu >Priority: Major > Labels: pull-request-available > > org.apache.hudi.client.functional.TestHoodieBackedMetadata#testTableOperationsWithRestore > > flakiness came from > {code:java} > client.scheduleCompactionAtInstant(newCommitTime, Option.empty()); // > sometimes never create 007.compaction.requested > client.compact(newCommitTime); // then this would fail{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2108) Flaky test: TestHoodieBackedMetadata.testOnlyValidPartitionsAdded:210
[ https://issues.apache.org/jira/browse/HUDI-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2108: - Component/s: Testing > Flaky test: TestHoodieBackedMetadata.testOnlyValidPartitionsAdded:210 > - > > Key: HUDI-2108 > URL: https://issues.apache.org/jira/browse/HUDI-2108 > Project: Apache Hudi > Issue Type: Sub-task > Components: Testing >Reporter: Vinoth Chandar >Assignee: Raymond Xu >Priority: Major > Labels: pull-request-available > > org.apache.hudi.client.functional.TestHoodieBackedMetadata#testTableOperationsWithRestore > > flakiness came from > client.scheduleCompactionAtInstant(newCommitTime, Option.empty()); // > sometimes never create 007.compaction.requested > client.compact(newCommitTime); // then this would fail -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2108) Flaky test: TestHoodieBackedMetadata.testOnlyValidPartitionsAdded:210
[ https://issues.apache.org/jira/browse/HUDI-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2108: - Description: org.apache.hudi.client.functional.TestHoodieBackedMetadata#testTableOperationsWithRestore flakiness came from client.scheduleCompactionAtInstant(newCommitTime, Option.empty()); // sometimes never create 007.compaction.requested client.compact(newCommitTime); // then this would fail was:https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=357=logs=864947d5-8fca-5138-8394-999ccb212a1e=552b4d2f-26d5-5f2f-1d5d-e8229058b632 > Flaky test: TestHoodieBackedMetadata.testOnlyValidPartitionsAdded:210 > - > > Key: HUDI-2108 > URL: https://issues.apache.org/jira/browse/HUDI-2108 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Vinoth Chandar >Assignee: Raymond Xu >Priority: Major > Labels: pull-request-available > > org.apache.hudi.client.functional.TestHoodieBackedMetadata#testTableOperationsWithRestore > > flakiness came from > client.scheduleCompactionAtInstant(newCommitTime, Option.empty()); // > sometimes never create 007.compaction.requested > client.compact(newCommitTime); // then this would fail -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2528) Flaky test: [ERROR] HoodieTableType).[2] MERGE_ON_READ(testTableOperationsWithRestore
[ https://issues.apache.org/jira/browse/HUDI-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2528: - Parent: HUDI-1248 Issue Type: Sub-task (was: Bug) > Flaky test: [ERROR] HoodieTableType).[2] > MERGE_ON_READ(testTableOperationsWithRestore > - > > Key: HUDI-2528 > URL: https://issues.apache.org/jira/browse/HUDI-2528 > Project: Apache Hudi > Issue Type: Sub-task > Components: Testing >Reporter: Raymond Xu >Priority: Major > > > {code:java} > [ERROR] Failures:[ERROR] There files should have been rolled-back when > rolling back commit 002 but are still remaining. Files: > [file:/tmp/junit6464799159313857398/2016/03/15/9d59f0f1-9cfa-41a4-b247-6bf002ad6cc7-0_0-592-8761_001.parquet, > > file:/tmp/junit6464799159313857398/2016/03/15/9d59f0f1-9cfa-41a4-b247-6bf002ad6cc7-0_0-585-8754_001.parquet] > ==> expected: <0> but was: <2>[ERROR] Errors:[ERROR] No Compaction > request available at 007 to run compaction {code} > > Probably the same cause as HUDI-2108 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2529) Flaky test: ITTestHoodieFlinkCompactor.testHoodieFlinkCompactor:88
[ https://issues.apache.org/jira/browse/HUDI-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2529: - Description: {code:java} 2021-09-30T16:45:30.4276182Z 12557 [pool-15-thread-2] ERROR org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Got error running preferred function. Trying secondary 2021-09-30T16:45:30.4276903Z org.apache.hudi.exception.HoodieRemoteException: Connect to 0.0.0.0:46865 [/0.0.0.0] failed: Connection refused (Connection refused) 2021-09-30T16:45:30.4277581Zat org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlice(RemoteHoodieTableFileSystemView.java:297) 2021-09-30T16:45:30.4278221Zat org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:97) 2021-09-30T16:45:30.4278827Zat org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getLatestFileSlice(PriorityBasedFileSystemView.java:252) 2021-09-30T16:45:30.4279399Zat org.apache.hudi.io.HoodieAppendHandle.init(HoodieAppendHandle.java:135) 2021-09-30T16:45:30.4279873Zat org.apache.hudi.io.HoodieAppendHandle.write(HoodieAppendHandle.java:390) 2021-09-30T16:45:30.4280347Zat org.apache.hudi.io.HoodieWriteHandle.write(HoodieWriteHandle.java:215) 2021-09-30T16:45:30.4280863Zat org.apache.hudi.execution.CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteInsertHandler.java:96) 2021-09-30T16:45:30.4281447Zat org.apache.hudi.execution.CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteInsertHandler.java:40) 2021-09-30T16:45:30.4282039Zat org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37) 2021-09-30T16:45:30.4282624Zat org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121) 2021-09-30T16:45:30.4283129Zat java.util.concurrent.FutureTask.run(FutureTask.java:266) 2021-09-30T16:45:30.4283590Zat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 2021-09-30T16:45:30.4284080Zat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 2021-09-30T16:45:30.4284502Zat java.lang.Thread.run(Thread.java:748) 2021-09-30T16:45:30.4298786Z Caused by: org.apache.http.conn.HttpHostConnectException: Connect to 0.0.0.0:46865 [/0.0.0.0] failed: Connection refused (Connection refused) 2021-09-30T16:45:30.4299596Zat org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151) 2021-09-30T16:45:30.4300229Zat org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353) 2021-09-30T16:45:30.4300808Zat org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380) 2021-09-30T16:45:30.4301322Zat org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) 2021-09-30T16:45:30.4301804Zat org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) 2021-09-30T16:45:30.4302279Zat org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88) 2021-09-30T16:45:30.4302751Zat org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) 2021-09-30T16:45:30.4303239Zat org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) 2021-09-30T16:45:30.4303940Zat org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) 2021-09-30T16:45:30.4304463Zat org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107) 2021-09-30T16:45:30.4304983Zat org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) 2021-09-30T16:45:30.4305450Zat org.apache.http.client.fluent.Request.execute(Request.java:151) 2021-09-30T16:45:30.4306006Zat org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:172) 2021-09-30T16:45:30.4306671Zat org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlice(RemoteHoodieTableFileSystemView.java:293) 2021-09-30T16:45:30.4307194Z... 13 more 2021-09-30T16:45:30.4307537Z Caused by: java.net.ConnectException: Connection refused (Connection refused) 2021-09-30T16:45:30.4307945Zat java.net.PlainSocketImpl.socketConnect(Native Method) 2021-09-30T16:45:30.4308362Zat java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) 2021-09-30T16:45:30.4315903Zat java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:204) 2021-09-30T16:45:30.4316643Zat java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) 2021-09-30T16:45:30.4317099Zat java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) 2021-09-30T16:45:30.4317496Zat
[jira] [Updated] (HUDI-2077) Flaky test: TestHoodieDeltaStreamer
[ https://issues.apache.org/jira/browse/HUDI-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2077: - Description: {code:java} [INFO] Results:8520[INFO] 8521[ERROR] Errors: 8522[ERROR] TestHoodieDeltaStreamer.testUpsertsMORContinuousModeWithMultipleWriters:716->testUpsertsContinuousModeWithMultipleWriters:831->runJobsInParallel:940 » Execution{code} {code:java} 2021-10-01T15:38:36.7776781Z [ERROR] org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.testUpsertsMORContinuousModeWithMultipleWriters Time elapsed: 57.945 s <<< ERROR! 2021-10-01T15:38:36.7778593Z java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.hudi.exception.HoodieIOException: Failed to create file hdfs://localhost:46579/user/vsts/continuous_mor_mulitwriter/.hoodie/20211001153821.commit 2021-10-01T15:38:36.7780175Z at org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.runJobsInParallel(TestHoodieDeltaStreamer.java:926) 2021-10-01T15:38:36.7781191Z at org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.testUpsertsContinuousModeWithMultipleWriters(TestHoodieDeltaStreamer.java:818) 2021-10-01T15:38:36.7782459Z at org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.testUpsertsMORContinuousModeWithMultipleWriters(TestHoodieDeltaStreamer.java:703) 2021-10-01T15:38:36.7783719Z Caused by: java.lang.RuntimeException: org.apache.hudi.exception.HoodieIOException: Failed to create file hdfs://localhost:46579/user/vsts/continuous_mor_mulitwriter/.hoodie/20211001153821.commit 2021-10-01T15:38:36.7784928Z at org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.lambda$runJobsInParallel$10(TestHoodieDeltaStreamer.java:923) 2021-10-01T15:38:36.7786069Z Caused by: org.apache.hudi.exception.HoodieIOException: Failed to create file hdfs://localhost:46579/user/vsts/continuous_mor_mulitwriter/.hoodie/20211001153821.commit 2021-10-01T15:38:36.7787955Z at org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.lambda$runJobsInParallel$10(TestHoodieDeltaStreamer.java:921) 2021-10-01T15:38:36.7789094Z Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 2021-10-01T15:38:36.7789863Z /user/vsts/continuous_mor_mulitwriter/.hoodie/20211001153821.commit for client 127.0.0.1 already exists 2021-10-01T15:38:36.7790732Z at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2563) 2021-10-01T15:38:36.7791637Z at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2450) 2021-10-01T15:38:36.7793026Z at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2334) 2021-10-01T15:38:36.7794034Z at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:624) 2021-10-01T15:38:36.7795041Z at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:397) 2021-10-01T15:38:36.7796077Z at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) 2021-10-01T15:38:36.7797974Z at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) 2021-10-01T15:38:36.7798852Z at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) 2021-10-01T15:38:36.7799527Z at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) 2021-10-01T15:38:36.7800188Z at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) 2021-10-01T15:38:36.7800789Z at java.security.AccessController.doPrivileged(Native Method) 2021-10-01T15:38:36.7801386Z at javax.security.auth.Subject.doAs(Subject.java:422) 2021-10-01T15:38:36.7802258Z at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) 2021-10-01T15:38:36.7802948Z at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045) 2021-10-01T15:38:36.7803676Z 2021-10-01T15:38:36.7804333Z at org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.lambda$runJobsInParallel$10(TestHoodieDeltaStreamer.java:921) 2021-10-01T15:38:36.7805070Z Caused by: org.apache.hadoop.ipc.RemoteException: 2021-10-01T15:38:36.7805712Z /user/vsts/continuous_mor_mulitwriter/.hoodie/20211001153821.commit for client 127.0.0.1 already exists 2021-10-01T15:38:36.7806633Z at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2563) 2021-10-01T15:38:36.7807422Z at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2450) 2021-10-01T15:38:36.7808170Z at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2334) 2021-10-01T15:38:36.7808949Z at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:624) 2021-10-01T15:38:36.7809836Z at
[jira] [Closed] (HUDI-2075) Flaky test: TestRowDataToHoodieFunction
[ https://issues.apache.org/jira/browse/HUDI-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-2075. Resolution: Cannot Reproduce Don't see this in Azure. Re-open if this came back. > Flaky test: TestRowDataToHoodieFunction > --- > > Key: HUDI-2075 > URL: https://issues.apache.org/jira/browse/HUDI-2075 > Project: Apache Hudi > Issue Type: Sub-task > Components: Testing >Reporter: Raymond Xu >Priority: Major > > At-least 10 occurrences > [ERROR] Failures: > [ERROR] TestRowDataToHoodieFunction.testRateLimit:72 should process at > least 5 seconds ==> expected: but was: -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2529) Flaky test: ITTestHoodieFlinkCompactor.testHoodieFlinkCompactor:88
[ https://issues.apache.org/jira/browse/HUDI-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2529: - Attachment: 27.txt > Flaky test: ITTestHoodieFlinkCompactor.testHoodieFlinkCompactor:88 > -- > > Key: HUDI-2529 > URL: https://issues.apache.org/jira/browse/HUDI-2529 > Project: Apache Hudi > Issue Type: Sub-task > Components: Testing >Reporter: Raymond Xu >Priority: Major > Attachments: 27.txt > > > {code:java} > 2021-09-30T16:45:30.4276182Z 12557 [pool-15-thread-2] ERROR > org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Got error > running preferred function. Trying secondary > 2021-09-30T16:45:30.4276903Z org.apache.hudi.exception.HoodieRemoteException: > Connect to 0.0.0.0:46865 [/0.0.0.0] failed: Connection refused (Connection > refused) > 2021-09-30T16:45:30.4277581Z at > org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlice(RemoteHoodieTableFileSystemView.java:297) > 2021-09-30T16:45:30.4278221Z at > org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:97) > 2021-09-30T16:45:30.4278827Z at > org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getLatestFileSlice(PriorityBasedFileSystemView.java:252) > 2021-09-30T16:45:30.4279399Z at > org.apache.hudi.io.HoodieAppendHandle.init(HoodieAppendHandle.java:135) > 2021-09-30T16:45:30.4279873Z at > org.apache.hudi.io.HoodieAppendHandle.write(HoodieAppendHandle.java:390) > 2021-09-30T16:45:30.4280347Z at > org.apache.hudi.io.HoodieWriteHandle.write(HoodieWriteHandle.java:215) > 2021-09-30T16:45:30.4280863Z at > org.apache.hudi.execution.CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteInsertHandler.java:96) > 2021-09-30T16:45:30.4281447Z at > org.apache.hudi.execution.CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteInsertHandler.java:40) > 2021-09-30T16:45:30.4282039Z at > org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37) > 2021-09-30T16:45:30.4282624Z at > org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121) > 2021-09-30T16:45:30.4283129Z at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > 2021-09-30T16:45:30.4283590Z at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > 2021-09-30T16:45:30.4284080Z at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > 2021-09-30T16:45:30.4284502Z at java.lang.Thread.run(Thread.java:748) > 2021-09-30T16:45:30.4298786Z Caused by: > org.apache.http.conn.HttpHostConnectException: Connect to 0.0.0.0:46865 > [/0.0.0.0] failed: Connection refused (Connection refused) > 2021-09-30T16:45:30.4299596Z at > org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151) > 2021-09-30T16:45:30.4300229Z at > org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353) > 2021-09-30T16:45:30.4300808Z at > org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380) > 2021-09-30T16:45:30.4301322Z at > org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) > 2021-09-30T16:45:30.4301804Z at > org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) > 2021-09-30T16:45:30.4302279Z at > org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88) > 2021-09-30T16:45:30.4302751Z at > org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) > 2021-09-30T16:45:30.4303239Z at > org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) > 2021-09-30T16:45:30.4303940Z at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > 2021-09-30T16:45:30.4304463Z at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107) > 2021-09-30T16:45:30.4304983Z at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > 2021-09-30T16:45:30.4305450Z at > org.apache.http.client.fluent.Request.execute(Request.java:151) > 2021-09-30T16:45:30.4306006Z at > org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:172) > 2021-09-30T16:45:30.4306671Z at > org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlice(RemoteHoodieTableFileSystemView.java:293) > 2021-09-30T16:45:30.4307194Z ... 13 more > 2021-09-30T16:45:30.4307537Z Caused by: java.net.ConnectException: Connection > refused (Connection refused) > 2021-09-30T16:45:30.4307945Z at >
[jira] [Updated] (HUDI-2528) Flaky test: MERGE_ON_READ testTableOperationsWithRestore
[ https://issues.apache.org/jira/browse/HUDI-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2528: - Summary: Flaky test: MERGE_ON_READ testTableOperationsWithRestore (was: Flaky test: [ERROR] HoodieTableType).[2] MERGE_ON_READ(testTableOperationsWithRestore) > Flaky test: MERGE_ON_READ testTableOperationsWithRestore > > > Key: HUDI-2528 > URL: https://issues.apache.org/jira/browse/HUDI-2528 > Project: Apache Hudi > Issue Type: Sub-task > Components: Testing >Reporter: Raymond Xu >Priority: Major > > > {code:java} > [ERROR] Failures:[ERROR] There files should have been rolled-back when > rolling back commit 002 but are still remaining. Files: > [file:/tmp/junit6464799159313857398/2016/03/15/9d59f0f1-9cfa-41a4-b247-6bf002ad6cc7-0_0-592-8761_001.parquet, > > file:/tmp/junit6464799159313857398/2016/03/15/9d59f0f1-9cfa-41a4-b247-6bf002ad6cc7-0_0-585-8754_001.parquet] > ==> expected: <0> but was: <2>[ERROR] Errors:[ERROR] No Compaction > request available at 007 to run compaction {code} > > Probably the same cause as HUDI-2108 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2077) Flaky test: TestHoodieDeltaStreamer
[ https://issues.apache.org/jira/browse/HUDI-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2077: - Description: {code:java} [INFO] Results:8520[INFO] 8521[ERROR] Errors: 8522[ERROR] TestHoodieDeltaStreamer.testUpsertsMORContinuousModeWithMultipleWriters:716->testUpsertsContinuousModeWithMultipleWriters:831->runJobsInParallel:940 » Execution{code} Search "testUpsertsMORContinuousModeWithMultipleWriters" in the log file for details. was: {code:java} [INFO] Results:8520[INFO] 8521[ERROR] Errors: 8522[ERROR] TestHoodieDeltaStreamer.testUpsertsMORContinuousModeWithMultipleWriters:716->testUpsertsContinuousModeWithMultipleWriters:831->runJobsInParallel:940 » Execution{code} {code:java} 2021-10-01T15:38:36.7776781Z [ERROR] org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.testUpsertsMORContinuousModeWithMultipleWriters Time elapsed: 57.945 s <<< ERROR! 2021-10-01T15:38:36.7778593Z java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.hudi.exception.HoodieIOException: Failed to create file hdfs://localhost:46579/user/vsts/continuous_mor_mulitwriter/.hoodie/20211001153821.commit 2021-10-01T15:38:36.7780175Z at org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.runJobsInParallel(TestHoodieDeltaStreamer.java:926) 2021-10-01T15:38:36.7781191Z at org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.testUpsertsContinuousModeWithMultipleWriters(TestHoodieDeltaStreamer.java:818) 2021-10-01T15:38:36.7782459Z at org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.testUpsertsMORContinuousModeWithMultipleWriters(TestHoodieDeltaStreamer.java:703) 2021-10-01T15:38:36.7783719Z Caused by: java.lang.RuntimeException: org.apache.hudi.exception.HoodieIOException: Failed to create file hdfs://localhost:46579/user/vsts/continuous_mor_mulitwriter/.hoodie/20211001153821.commit 2021-10-01T15:38:36.7784928Z at org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.lambda$runJobsInParallel$10(TestHoodieDeltaStreamer.java:923) 2021-10-01T15:38:36.7786069Z Caused by: org.apache.hudi.exception.HoodieIOException: Failed to create file hdfs://localhost:46579/user/vsts/continuous_mor_mulitwriter/.hoodie/20211001153821.commit 2021-10-01T15:38:36.7787955Z at org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.lambda$runJobsInParallel$10(TestHoodieDeltaStreamer.java:921) 2021-10-01T15:38:36.7789094Z Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 2021-10-01T15:38:36.7789863Z /user/vsts/continuous_mor_mulitwriter/.hoodie/20211001153821.commit for client 127.0.0.1 already exists 2021-10-01T15:38:36.7790732Z at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2563) 2021-10-01T15:38:36.7791637Z at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2450) 2021-10-01T15:38:36.7793026Z at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2334) 2021-10-01T15:38:36.7794034Z at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:624) 2021-10-01T15:38:36.7795041Z at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:397) 2021-10-01T15:38:36.7796077Z at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) 2021-10-01T15:38:36.7797974Z at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) 2021-10-01T15:38:36.7798852Z at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) 2021-10-01T15:38:36.7799527Z at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) 2021-10-01T15:38:36.7800188Z at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) 2021-10-01T15:38:36.7800789Z at java.security.AccessController.doPrivileged(Native Method) 2021-10-01T15:38:36.7801386Z at javax.security.auth.Subject.doAs(Subject.java:422) 2021-10-01T15:38:36.7802258Z at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) 2021-10-01T15:38:36.7802948Z at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045) 2021-10-01T15:38:36.7803676Z 2021-10-01T15:38:36.7804333Z at org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.lambda$runJobsInParallel$10(TestHoodieDeltaStreamer.java:921) 2021-10-01T15:38:36.7805070Z Caused by: org.apache.hadoop.ipc.RemoteException: 2021-10-01T15:38:36.7805712Z /user/vsts/continuous_mor_mulitwriter/.hoodie/20211001153821.commit for client 127.0.0.1 already exists 2021-10-01T15:38:36.7806633Z at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2563) 2021-10-01T15:38:36.7807422Z at
[jira] [Updated] (HUDI-2077) Flaky test: TestHoodieDeltaStreamer
[ https://issues.apache.org/jira/browse/HUDI-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2077: - Attachment: 28.txt > Flaky test: TestHoodieDeltaStreamer > --- > > Key: HUDI-2077 > URL: https://issues.apache.org/jira/browse/HUDI-2077 > Project: Apache Hudi > Issue Type: Sub-task > Components: Testing >Reporter: Raymond Xu >Assignee: Sagar Sumit >Priority: Major > Attachments: 28.txt > > > {code:java} > [INFO] Results:8520[INFO] 8521[ERROR] Errors: 8522[ERROR] > TestHoodieDeltaStreamer.testUpsertsMORContinuousModeWithMultipleWriters:716->testUpsertsContinuousModeWithMultipleWriters:831->runJobsInParallel:940 > » Execution{code} > Search "testUpsertsMORContinuousModeWithMultipleWriters" in the log file for > details. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2528) Flaky test: [ERROR] HoodieTableType).[2] MERGE_ON_READ(testTableOperationsWithRestore
Raymond Xu created HUDI-2528: Summary: Flaky test: [ERROR] HoodieTableType).[2] MERGE_ON_READ(testTableOperationsWithRestore Key: HUDI-2528 URL: https://issues.apache.org/jira/browse/HUDI-2528 Project: Apache Hudi Issue Type: Bug Components: Testing Reporter: Raymond Xu {code:java} [ERROR] Failures:[ERROR] There files should have been rolled-back when rolling back commit 002 but are still remaining. Files: [file:/tmp/junit6464799159313857398/2016/03/15/9d59f0f1-9cfa-41a4-b247-6bf002ad6cc7-0_0-592-8761_001.parquet, file:/tmp/junit6464799159313857398/2016/03/15/9d59f0f1-9cfa-41a4-b247-6bf002ad6cc7-0_0-585-8754_001.parquet] ==> expected: <0> but was: <2>[ERROR] Errors:[ERROR] No Compaction request available at 007 to run compaction {code} Probably the same cause as HUDI-2108 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1248) [UMBRELLA] Tests cleanup and fixes
[ https://issues.apache.org/jira/browse/HUDI-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1248: - Priority: Critical (was: Major) > [UMBRELLA] Tests cleanup and fixes > -- > > Key: HUDI-1248 > URL: https://issues.apache.org/jira/browse/HUDI-1248 > Project: Apache Hudi > Issue Type: Improvement > Components: Testing >Reporter: sivabalan narayanan >Assignee: Raymond Xu >Priority: Critical > Labels: hudi-umbrellas, pull-request-available > > There are quite few tickets that requires some fixes to tests. Creating this > umbrella ticket to track all efforts. > > https://issues.apache.org/jira/browse/HUDI-1055 remove .parquet from tests. > https://issues.apache.org/jira/browse/HUDI-1033 ITTestRepairsCommand and > TestRepairsCommand > https://issues.apache.org/jira/browse/HUDI-1010 memory leak. > https://issues.apache.org/jira/browse/HUDI-997 memory leak > https://issues.apache.org/jira/browse/HUDI-664 : Adjust Logging levels to > reduce verbose log msgs in hudi-client > https://issues.apache.org/jira/browse/HUDI-623: Remove > UpgradePayloadFromUberToApache > https://issues.apache.org/jira/browse/HUDI-541: Replace variables/comments > named "data files" to "base file" > https://issues.apache.org/jira/browse/HUDI-347: Fix > TestHoodieClientOnCopyOnWriteStorage Tests with modular private methods > https://issues.apache.org/jira/browse/HUDI-323: Docker demo/integ-test > stdout/stderr output only available on process exit > https://issues.apache.org/jira/browse/HUDI-284: Need Tests for Hudi handling > of schema evolution > https://issues.apache.org/jira/browse/HUDI-154: Enable Rollback case in > HoodieRealtimeRecordReaderTest.testReader > https://issues.apache.org/jira/browse/HUDI-1143 timestamp micros. > https://issues.apache.org/jira/browse/HUDI-1989: flaky tests in > TestHoodieMergeOnReadTable -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-2527) Flaky test: TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict
[ https://issues.apache.org/jira/browse/HUDI-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu reassigned HUDI-2527: Assignee: Raymond Xu > Flaky test: > TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict > - > > Key: HUDI-2527 > URL: https://issues.apache.org/jira/browse/HUDI-2527 > Project: Apache Hudi > Issue Type: Sub-task > Components: Testing >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2527) Flaky test: TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict
Raymond Xu created HUDI-2527: Summary: Flaky test: TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict Key: HUDI-2527 URL: https://issues.apache.org/jira/browse/HUDI-2527 Project: Apache Hudi Issue Type: Sub-task Components: Testing Reporter: Raymond Xu -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2529) Flaky test: ITTestHoodieFlinkCompactor.testHoodieFlinkCompactor:88
Raymond Xu created HUDI-2529: Summary: Flaky test: ITTestHoodieFlinkCompactor.testHoodieFlinkCompactor:88 Key: HUDI-2529 URL: https://issues.apache.org/jira/browse/HUDI-2529 Project: Apache Hudi Issue Type: Sub-task Components: Testing Reporter: Raymond Xu https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=2474=logs=3b6e910d-b98f-5de6-b9cb-1e5ff571f5de=30b5aae4-0ea0-5566-42d0-febf71a7061a=114962 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-2076) Flaky test: TestHoodieMultiTableDeltaStreamer
[ https://issues.apache.org/jira/browse/HUDI-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-2076. Resolution: Cannot Reproduce Don't see this in Azure. Re-open if this came back. > Flaky test: TestHoodieMultiTableDeltaStreamer > - > > Key: HUDI-2076 > URL: https://issues.apache.org/jira/browse/HUDI-2076 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Raymond Xu >Priority: Major > > At-least 4 occurrences > [ERROR] Failures: > [ERROR] > TestHoodieMultiTableDeltaStreamer.testMultiTableExecutionWithKafkaSource:168 > expected: but was: -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-2078) Flaky test: TestCleaner
[ https://issues.apache.org/jira/browse/HUDI-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-2078. Resolution: Cannot Reproduce Don't see this in Azure. Re-open if this came back. > Flaky test: TestCleaner > --- > > Key: HUDI-2078 > URL: https://issues.apache.org/jira/browse/HUDI-2078 > Project: Apache Hudi > Issue Type: Sub-task > Components: Testing >Reporter: Raymond Xu >Priority: Major > > * TestCleaner.testKeepLatestCommits > * TestCleaner.testKeepLatestFileVersions:673 Must clean at least 1 file ==> > expected: <2> but was: <1> -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2527) Flaky test: TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict
[ https://issues.apache.org/jira/browse/HUDI-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2527: - Description: Test case does not make sense for COW table. Should remove COW from the test param. Consider rewrite the prep logic. > Flaky test: > TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict > - > > Key: HUDI-2527 > URL: https://issues.apache.org/jira/browse/HUDI-2527 > Project: Apache Hudi > Issue Type: Sub-task > Components: Testing >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Major > > Test case does not make sense for COW table. Should remove COW from the test > param. > Consider rewrite the prep logic. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2527) Flaky test: TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict
[ https://issues.apache.org/jira/browse/HUDI-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2527: - Description: {code:java} [ERROR] Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 61.795 s <<< FAILURE! - in org.apache.hudi.client.TestHoodieClientMultiWriter [ERROR] org.apache.hudi.client.TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict(HoodieTableType)[1] Time elapsed: 9.689 s <<< ERROR!java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.hudi.exception.HoodieHeartbeatException: Unable to generate heartbeat at org.apache.hudi.client.TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict(TestHoodieClientMultiWriter.java:227) Caused by: java.lang.RuntimeException: org.apache.hudi.exception.HoodieHeartbeatException: Unable to generate heartbeat at org.apache.hudi.client.TestHoodieClientMultiWriter.lambda$testMultiWriterWithAsyncTableServicesWithConflict$5(TestHoodieClientMultiWriter.java:205) Caused by: org.apache.hudi.exception.HoodieHeartbeatException: Unable to generate heartbeat at org.apache.hudi.client.TestHoodieClientMultiWriter.createCommitWithInserts(TestHoodieClientMultiWriter.java:285) at org.apache.hudi.client.TestHoodieClientMultiWriter.lambda$testMultiWriterWithAsyncTableServicesWithConflict$5(TestHoodieClientMultiWriter.java:202) Caused by: org.apache.hadoop.util.Shell$ExitCodeException:chmod: cannot access '/tmp/junit213441136342269/dataset/.hoodie/.heartbeat/.007.crc': No such file or directory at org.apache.hudi.client.TestHoodieClientMultiWriter.createCommitWithInserts(TestHoodieClientMultiWriter.java:285) at org.apache.hudi.client.TestHoodieClientMultiWriter.lambda$testMultiWriterWithAsyncTableServicesWithConflict$5(TestHoodieClientMultiWriter.java:202) [ERROR] Errors: [ERROR] TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict:227 » Execution{code} https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=2352=logs=600e7de6-e133-5e69-e615-50ee129b3c08=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7 Test case does not make sense for COW table. Should remove COW from the test param. Consider rewrite the prep logic. was: Test case does not make sense for COW table. Should remove COW from the test param. Consider rewrite the prep logic. > Flaky test: > TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict > - > > Key: HUDI-2527 > URL: https://issues.apache.org/jira/browse/HUDI-2527 > Project: Apache Hudi > Issue Type: Sub-task > Components: Testing >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Major > > > {code:java} > [ERROR] Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 61.795 s <<< FAILURE! - in org.apache.hudi.client.TestHoodieClientMultiWriter >[ERROR] > org.apache.hudi.client.TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict(HoodieTableType)[1] > Time elapsed: 9.689 s <<< ERROR!java.util.concurrent.ExecutionException: > java.lang.RuntimeException: > org.apache.hudi.exception.HoodieHeartbeatException: Unable to generate > heartbeat at > org.apache.hudi.client.TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict(TestHoodieClientMultiWriter.java:227) > Caused by: java.lang.RuntimeException: > org.apache.hudi.exception.HoodieHeartbeatException: Unable to generate > heartbeat at > org.apache.hudi.client.TestHoodieClientMultiWriter.lambda$testMultiWriterWithAsyncTableServicesWithConflict$5(TestHoodieClientMultiWriter.java:205) > Caused by: org.apache.hudi.exception.HoodieHeartbeatException: Unable to > generate heartbeat at > org.apache.hudi.client.TestHoodieClientMultiWriter.createCommitWithInserts(TestHoodieClientMultiWriter.java:285) > at > org.apache.hudi.client.TestHoodieClientMultiWriter.lambda$testMultiWriterWithAsyncTableServicesWithConflict$5(TestHoodieClientMultiWriter.java:202) > Caused by: org.apache.hadoop.util.Shell$ExitCodeException:chmod: > cannot access > '/tmp/junit213441136342269/dataset/.hoodie/.heartbeat/.007.crc': No such > file or directory at > org.apache.hudi.client.TestHoodieClientMultiWriter.createCommitWithInserts(TestHoodieClientMultiWriter.java:285) > at > org.apache.hudi.client.TestHoodieClientMultiWriter.lambda$testMultiWriterWithAsyncTableServicesWithConflict$5(TestHoodieClientMultiWriter.java:202) > > [ERROR] Errors: > [ERROR] > TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict:227 > » Execution{code} > > >
[jira] [Updated] (HUDI-864) parquet schema conflict: optional binary (UTF8) is not a group
[ https://issues.apache.org/jira/browse/HUDI-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-864: Affects Version/s: 0.9.0 > parquet schema conflict: optional binary (UTF8) is not a group > --- > > Key: HUDI-864 > URL: https://issues.apache.org/jira/browse/HUDI-864 > Project: Apache Hudi > Issue Type: Bug > Components: Common Core, Spark Integration >Affects Versions: 0.5.2, 0.9.0 >Reporter: Roland Johann >Priority: Blocker > Labels: sev:critical, user-support-issues > > When dealing with struct types like this > {code:json} > { > "type": "struct", > "fields": [ > { > "name": "categoryResults", > "type": { > "type": "array", > "elementType": { > "type": "struct", > "fields": [ > { > "name": "categoryId", > "type": "string", > "nullable": true, > "metadata": {} > } > ] > }, > "containsNull": true > }, > "nullable": true, > "metadata": {} > } > ] > } > {code} > The second ingest batch throws that exception: > {code} > ERROR [Executor task launch worker for task 15] > commit.BaseCommitActionExecutor (BaseCommitActionExecutor.java:264) - Error > upserting bucketType UPDATE for partition :0 > org.apache.hudi.exception.HoodieException: > org.apache.hudi.exception.HoodieException: > java.util.concurrent.ExecutionException: > org.apache.hudi.exception.HoodieException: operation has failed > at > org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdateInternal(CommitActionExecutor.java:100) > at > org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdate(CommitActionExecutor.java:76) > at > org.apache.hudi.table.action.deltacommit.DeltaCommitActionExecutor.handleUpdate(DeltaCommitActionExecutor.java:73) > at > org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:258) > at > org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleInsertPartition(BaseCommitActionExecutor.java:271) > at > org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:104) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by:
[jira] [Updated] (HUDI-864) parquet schema conflict: optional binary (UTF8) is not a group
[ https://issues.apache.org/jira/browse/HUDI-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-864: Affects Version/s: 0.6.0 0.5.3 0.7.0 0.8.0 > parquet schema conflict: optional binary (UTF8) is not a group > --- > > Key: HUDI-864 > URL: https://issues.apache.org/jira/browse/HUDI-864 > Project: Apache Hudi > Issue Type: Bug > Components: Common Core, Spark Integration >Affects Versions: 0.5.2, 0.6.0, 0.5.3, 0.7.0, 0.8.0, 0.9.0 >Reporter: Roland Johann >Priority: Blocker > Labels: sev:critical, user-support-issues > > When dealing with struct types like this > {code:json} > { > "type": "struct", > "fields": [ > { > "name": "categoryResults", > "type": { > "type": "array", > "elementType": { > "type": "struct", > "fields": [ > { > "name": "categoryId", > "type": "string", > "nullable": true, > "metadata": {} > } > ] > }, > "containsNull": true > }, > "nullable": true, > "metadata": {} > } > ] > } > {code} > The second ingest batch throws that exception: > {code} > ERROR [Executor task launch worker for task 15] > commit.BaseCommitActionExecutor (BaseCommitActionExecutor.java:264) - Error > upserting bucketType UPDATE for partition :0 > org.apache.hudi.exception.HoodieException: > org.apache.hudi.exception.HoodieException: > java.util.concurrent.ExecutionException: > org.apache.hudi.exception.HoodieException: operation has failed > at > org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdateInternal(CommitActionExecutor.java:100) > at > org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdate(CommitActionExecutor.java:76) > at > org.apache.hudi.table.action.deltacommit.DeltaCommitActionExecutor.handleUpdate(DeltaCommitActionExecutor.java:73) > at > org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:258) > at > org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleInsertPartition(BaseCommitActionExecutor.java:271) > at > org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:104) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at >
[jira] [Updated] (HUDI-2390) KeyGenerator discrepancy between DataFrame writer and SQL
[ https://issues.apache.org/jira/browse/HUDI-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2390: - Labels: sev:critical (was: ) > KeyGenerator discrepancy between DataFrame writer and SQL > - > > Key: HUDI-2390 > URL: https://issues.apache.org/jira/browse/HUDI-2390 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Affects Versions: 0.9.0 >Reporter: renhao >Priority: Critical > Labels: sev:critical > > Test Case: > {code:java} > import org.apache.hudi.QuickstartUtils._ > import scala.collection.JavaConversions._ > import org.apache.spark.sql.SaveMode._ > import org.apache.hudi.DataSourceReadOptions._ > import org.apache.hudi.DataSourceWriteOptions._ > import org.apache.hudi.config.HoodieWriteConfig._{code} > 1.准备数据 > > {code:java} > spark.sql("create table test1(a int,b string,c string) using hudi partitioned > by(b) options(primaryKey='a')") > spark.sql("insert into table test1 select 1,2,3") > {code} > > 2.创建hudi table test2 > {code:java} > spark.sql("create table test2(a int,b string,c string) using hudi partitioned > by(b) options(primaryKey='a')"){code} > 3.datasource向test2写入数据 > > {code:java} > val base_data=spark.sql("select * from testdb.test1") > base_data.write.format("hudi"). > option(TABLE_TYPE_OPT_KEY, COW_TABLE_TYPE_OPT_VAL). > option(RECORDKEY_FIELD_OPT_KEY, "a"). > option(PARTITIONPATH_FIELD_OPT_KEY, "b"). > option(KEYGENERATOR_CLASS_OPT_KEY, > "org.apache.hudi.keygen.SimpleKeyGenerator"). > option(OPERATION_OPT_KEY, "bulk_insert"). > option(HIVE_SYNC_ENABLED_OPT_KEY, "true"). > option(HIVE_PARTITION_FIELDS_OPT_KEY, "b"). > option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,"org.apache.hudi.hive.MultiPartKeysValueExtractor"). > > option(HIVE_DATABASE_OPT_KEY, "testdb"). > option(HIVE_TABLE_OPT_KEY, "test2"). > option(HIVE_USE_JDBC_OPT_KEY, "true"). > option("hoodie.bulkinsert.shuffle.parallelism", 4). > option("hoodie.datasource.write.hive_style_partitioning", "true"). > option(TABLE_NAME, > "test2").mode(Append).save(s"/user/hive/warehouse/testdb.db/test2") > {code} > > 此时执行查询结果如下: > {code:java} > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 3| 2| > +---+---+---+{code} > 4.删除一条记录 > {code:java} > spark.sql("delete from testdb.test2 where a=1"){code} > 5.执行查询,a=1的记录未被删除 > {code:java} > spark.sql("select a,b,c from testdb.test2").show{code} > {code:java} > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 3| 2| > +---+---+---+{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2390) KeyGenerator discrepancy between DataFrame writer and SQL
[ https://issues.apache.org/jira/browse/HUDI-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2390: - Priority: Critical (was: Minor) > KeyGenerator discrepancy between DataFrame writer and SQL > - > > Key: HUDI-2390 > URL: https://issues.apache.org/jira/browse/HUDI-2390 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Affects Versions: 0.9.0 >Reporter: renhao >Priority: Critical > > Test Case: > {code:java} > import org.apache.hudi.QuickstartUtils._ > import scala.collection.JavaConversions._ > import org.apache.spark.sql.SaveMode._ > import org.apache.hudi.DataSourceReadOptions._ > import org.apache.hudi.DataSourceWriteOptions._ > import org.apache.hudi.config.HoodieWriteConfig._{code} > 1.准备数据 > > {code:java} > spark.sql("create table test1(a int,b string,c string) using hudi partitioned > by(b) options(primaryKey='a')") > spark.sql("insert into table test1 select 1,2,3") > {code} > > 2.创建hudi table test2 > {code:java} > spark.sql("create table test2(a int,b string,c string) using hudi partitioned > by(b) options(primaryKey='a')"){code} > 3.datasource向test2写入数据 > > {code:java} > val base_data=spark.sql("select * from testdb.test1") > base_data.write.format("hudi"). > option(TABLE_TYPE_OPT_KEY, COW_TABLE_TYPE_OPT_VAL). > option(RECORDKEY_FIELD_OPT_KEY, "a"). > option(PARTITIONPATH_FIELD_OPT_KEY, "b"). > option(KEYGENERATOR_CLASS_OPT_KEY, > "org.apache.hudi.keygen.SimpleKeyGenerator"). > option(OPERATION_OPT_KEY, "bulk_insert"). > option(HIVE_SYNC_ENABLED_OPT_KEY, "true"). > option(HIVE_PARTITION_FIELDS_OPT_KEY, "b"). > option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,"org.apache.hudi.hive.MultiPartKeysValueExtractor"). > > option(HIVE_DATABASE_OPT_KEY, "testdb"). > option(HIVE_TABLE_OPT_KEY, "test2"). > option(HIVE_USE_JDBC_OPT_KEY, "true"). > option("hoodie.bulkinsert.shuffle.parallelism", 4). > option("hoodie.datasource.write.hive_style_partitioning", "true"). > option(TABLE_NAME, > "test2").mode(Append).save(s"/user/hive/warehouse/testdb.db/test2") > {code} > > 此时执行查询结果如下: > {code:java} > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 3| 2| > +---+---+---+{code} > 4.删除一条记录 > {code:java} > spark.sql("delete from testdb.test2 where a=1"){code} > 5.执行查询,a=1的记录未被删除 > {code:java} > spark.sql("select a,b,c from testdb.test2").show{code} > {code:java} > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 3| 2| > +---+---+---+{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2495) Difference in behavior between GenericRecord based key gen and Row based key gen
[ https://issues.apache.org/jira/browse/HUDI-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2495: - Priority: Critical (was: Major) > Difference in behavior between GenericRecord based key gen and Row based key > gen > - > > Key: HUDI-2495 > URL: https://issues.apache.org/jira/browse/HUDI-2495 > Project: Apache Hudi > Issue Type: Bug >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Critical > Labels: sev:critical > > when complex key gen is used and one of the field in record key is a > timestamp field, row writer path and rdd path gives different record key > values. GenericRecord path converts timestamp, where as row writer path does > not do any conversion. > > import java.sql.Timestamp > import spark.implicits._ > val df = Seq( > (1, Timestamp.valueOf("2014-01-01 23:00:01"), "abc"), > (1, Timestamp.valueOf("2014-11-30 12:40:32"), "abc"), > (2, Timestamp.valueOf("2016-12-29 09:54:00"), "def"), > (2, Timestamp.valueOf("2016-05-09 10:12:43"), "def") > ).toDF("typeId","eventTime", "str") > > df.write.format("hudi"). > option("hoodie.insert.shuffle.parallelism", "2"). > option("hoodie.upsert.shuffle.parallelism", "2"). > option("hoodie.bulkinsert.shuffle.parallelism", "2"). > option("hoodie.datasource.write.precombine.field", "typeId"). > option("hoodie.datasource.write.partitionpath.field", "typeId"). > option("hoodie.datasource.write.recordkey.field", "str,eventTime"). > > option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.ComplexKeyGenerator"). > option("hoodie.table.name", "hudi_tbl"). > mode(Overwrite). > save("/tmp/hudi_tbl_trial/") > > val hudiDF = spark.read.format("hudi").load("/tmp/hudi_tbl_trial/") > hudiDF.createOrReplaceTempView("hudi_sql_tbl") > spark.sql("select _hoodie_record_key, str, eventTime, typeId from > hudi_sql_tbl").show(false) > > {code:java} > +--+---+---+--+ > |_hoodie_record_key |str|eventTime |typeId| > +--+---+---+--+ > |str:abc,eventTime:141736923200|abc|2014-11-30 12:40:32|1 | > |str:abc,eventTime:138863520100|abc|2014-01-01 23:00:01|1 | > |str:def,eventTime:146280316300|def|2016-05-09 10:12:43|2 | > |str:def,eventTime:148302324000|def|2016-12-29 09:54:00|2 | > +--+---+---+--+ > {code} > > > // now retry w/ bulk_insert row writer path > df.write.format("hudi"). > option("hoodie.insert.shuffle.parallelism", "2"). > option("hoodie.upsert.shuffle.parallelism", "2"). > option("hoodie.bulkinsert.shuffle.parallelism", "2"). > option("hoodie.datasource.write.precombine.field", "typeId"). > option("hoodie.datasource.write.partitionpath.field", "typeId"). > option("hoodie.datasource.write.recordkey.field", "str,eventTime"). > > option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.ComplexKeyGenerator"). > option("hoodie.table.name", "hudi_tbl"). > "hoodie.datasource.write.operation","bulk_insert"). > mode(Overwrite). > save("/tmp/hudi_tbl_trial_bulk_insert/") > > val hudiDF_bulk_insert = > spark.read.format("hudi").load("/tmp/hudi_tbl_trial_bulk_insert/") > hudiDF_bulk_insert.createOrReplaceTempView("hudi_sql_tbl_bulk_insert") > spark.sql("select _hoodie_record_key, str, eventTime, typeId from > hudi_sql_tbl_bulk_insert").show(false) > {code:java} > +---+---+---+--+ > |_hoodie_record_key |str|eventTime |typeId| > +---+---+---+--+ > |str:def,eventTime:2016-05-09 10:12:43.0|def|2016-05-09 10:12:43|2 | > |str:def,eventTime:2016-12-29 09:54:00.0|def|2016-12-29 09:54:00|2 | > |str:abc,eventTime:2014-01-01 23:00:01.0|abc|2014-01-01 23:00:01|1 | > |str:abc,eventTime:2014-11-30 12:40:32.0|abc|2014-11-30 12:40:32|1 | > +---+---+---+--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2495) Difference in behavior between GenericRecord based key gen and Row based key gen
[ https://issues.apache.org/jira/browse/HUDI-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2495: - Component/s: Spark Integration > Difference in behavior between GenericRecord based key gen and Row based key > gen > - > > Key: HUDI-2495 > URL: https://issues.apache.org/jira/browse/HUDI-2495 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Critical > Labels: sev:critical > > when complex key gen is used and one of the field in record key is a > timestamp field, row writer path and rdd path gives different record key > values. GenericRecord path converts timestamp, where as row writer path does > not do any conversion. > > import java.sql.Timestamp > import spark.implicits._ > val df = Seq( > (1, Timestamp.valueOf("2014-01-01 23:00:01"), "abc"), > (1, Timestamp.valueOf("2014-11-30 12:40:32"), "abc"), > (2, Timestamp.valueOf("2016-12-29 09:54:00"), "def"), > (2, Timestamp.valueOf("2016-05-09 10:12:43"), "def") > ).toDF("typeId","eventTime", "str") > > df.write.format("hudi"). > option("hoodie.insert.shuffle.parallelism", "2"). > option("hoodie.upsert.shuffle.parallelism", "2"). > option("hoodie.bulkinsert.shuffle.parallelism", "2"). > option("hoodie.datasource.write.precombine.field", "typeId"). > option("hoodie.datasource.write.partitionpath.field", "typeId"). > option("hoodie.datasource.write.recordkey.field", "str,eventTime"). > > option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.ComplexKeyGenerator"). > option("hoodie.table.name", "hudi_tbl"). > mode(Overwrite). > save("/tmp/hudi_tbl_trial/") > > val hudiDF = spark.read.format("hudi").load("/tmp/hudi_tbl_trial/") > hudiDF.createOrReplaceTempView("hudi_sql_tbl") > spark.sql("select _hoodie_record_key, str, eventTime, typeId from > hudi_sql_tbl").show(false) > > {code:java} > +--+---+---+--+ > |_hoodie_record_key |str|eventTime |typeId| > +--+---+---+--+ > |str:abc,eventTime:141736923200|abc|2014-11-30 12:40:32|1 | > |str:abc,eventTime:138863520100|abc|2014-01-01 23:00:01|1 | > |str:def,eventTime:146280316300|def|2016-05-09 10:12:43|2 | > |str:def,eventTime:148302324000|def|2016-12-29 09:54:00|2 | > +--+---+---+--+ > {code} > > > // now retry w/ bulk_insert row writer path > df.write.format("hudi"). > option("hoodie.insert.shuffle.parallelism", "2"). > option("hoodie.upsert.shuffle.parallelism", "2"). > option("hoodie.bulkinsert.shuffle.parallelism", "2"). > option("hoodie.datasource.write.precombine.field", "typeId"). > option("hoodie.datasource.write.partitionpath.field", "typeId"). > option("hoodie.datasource.write.recordkey.field", "str,eventTime"). > > option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.ComplexKeyGenerator"). > option("hoodie.table.name", "hudi_tbl"). > "hoodie.datasource.write.operation","bulk_insert"). > mode(Overwrite). > save("/tmp/hudi_tbl_trial_bulk_insert/") > > val hudiDF_bulk_insert = > spark.read.format("hudi").load("/tmp/hudi_tbl_trial_bulk_insert/") > hudiDF_bulk_insert.createOrReplaceTempView("hudi_sql_tbl_bulk_insert") > spark.sql("select _hoodie_record_key, str, eventTime, typeId from > hudi_sql_tbl_bulk_insert").show(false) > {code:java} > +---+---+---+--+ > |_hoodie_record_key |str|eventTime |typeId| > +---+---+---+--+ > |str:def,eventTime:2016-05-09 10:12:43.0|def|2016-05-09 10:12:43|2 | > |str:def,eventTime:2016-12-29 09:54:00.0|def|2016-12-29 09:54:00|2 | > |str:abc,eventTime:2014-01-01 23:00:01.0|abc|2014-01-01 23:00:01|1 | > |str:abc,eventTime:2014-11-30 12:40:32.0|abc|2014-11-30 12:40:32|1 | > +---+---+---+--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2496) Inserts are precombined even with dedup disabled
[ https://issues.apache.org/jira/browse/HUDI-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2496: - Fix Version/s: 0.10.0 > Inserts are precombined even with dedup disabled > > > Key: HUDI-2496 > URL: https://issues.apache.org/jira/browse/HUDI-2496 > Project: Apache Hudi > Issue Type: Bug > Components: Writer Core >Reporter: Sagar Sumit >Priority: Major > Labels: sev:critical > Fix For: 0.10.0 > > > Test case by [~xushiyan] : https://github.com/apache/hudi/pull/3723/files > RCA by [~shivnarayan] : > Within HoodieMergeHandle, we use a hashmap to store incoming records, where > keys are record keys. > and so, if you see 1st batch, duplicates would remain intact. but wrt 2nd > batch, only unique records are considered and later concatenated w/ 1st batch. > https://github.com/apache/hudi/blob/36be28712196ff4427c41b0aa885c7fcd7356d7f/hudi-[…]-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2496) Inserts are precombined even with dedup disabled
[ https://issues.apache.org/jira/browse/HUDI-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2496: - Component/s: Writer Core > Inserts are precombined even with dedup disabled > > > Key: HUDI-2496 > URL: https://issues.apache.org/jira/browse/HUDI-2496 > Project: Apache Hudi > Issue Type: Bug > Components: Writer Core >Reporter: Sagar Sumit >Priority: Major > Labels: sev:critical > > Test case by [~xushiyan] : https://github.com/apache/hudi/pull/3723/files > RCA by [~shivnarayan] : > Within HoodieMergeHandle, we use a hashmap to store incoming records, where > keys are record keys. > and so, if you see 1st batch, duplicates would remain intact. but wrt 2nd > batch, only unique records are considered and later concatenated w/ 1st batch. > https://github.com/apache/hudi/blob/36be28712196ff4427c41b0aa885c7fcd7356d7f/hudi-[…]-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2496) Inserts are precombined even with dedup disabled
[ https://issues.apache.org/jira/browse/HUDI-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2496: - Labels: sev:critical (was: writer) > Inserts are precombined even with dedup disabled > > > Key: HUDI-2496 > URL: https://issues.apache.org/jira/browse/HUDI-2496 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sagar Sumit >Priority: Major > Labels: sev:critical > > Test case by [~xushiyan] : https://github.com/apache/hudi/pull/3723/files > RCA by [~shivnarayan] : > Within HoodieMergeHandle, we use a hashmap to store incoming records, where > keys are record keys. > and so, if you see 1st batch, duplicates would remain intact. but wrt 2nd > batch, only unique records are considered and later concatenated w/ 1st batch. > https://github.com/apache/hudi/blob/36be28712196ff4427c41b0aa885c7fcd7356d7f/hudi-[…]-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2496) Inserts are precombined even with dedup disabled
[ https://issues.apache.org/jira/browse/HUDI-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2496: - Description: Original GH issue https://github.com/apache/hudi/issues/3709 Test case by [~xushiyan] : [https://github.com/apache/hudi/pull/3723/files] RCA by [~shivnarayan] : Within HoodieMergeHandle, we use a hashmap to store incoming records, where keys are record keys. and so, if you see 1st batch, duplicates would remain intact. but wrt 2nd batch, only unique records are considered and later concatenated w/ 1st batch. [https://github.com/apache/hudi/blob/36be28712196ff4427c41b0aa885c7fcd7356d7f/hudi-[]…]-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java was: Test case by [~xushiyan] : https://github.com/apache/hudi/pull/3723/files RCA by [~shivnarayan] : Within HoodieMergeHandle, we use a hashmap to store incoming records, where keys are record keys. and so, if you see 1st batch, duplicates would remain intact. but wrt 2nd batch, only unique records are considered and later concatenated w/ 1st batch. https://github.com/apache/hudi/blob/36be28712196ff4427c41b0aa885c7fcd7356d7f/hudi-[…]-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java > Inserts are precombined even with dedup disabled > > > Key: HUDI-2496 > URL: https://issues.apache.org/jira/browse/HUDI-2496 > Project: Apache Hudi > Issue Type: Bug > Components: Writer Core >Reporter: Sagar Sumit >Priority: Major > Labels: sev:critical > Fix For: 0.10.0 > > > Original GH issue https://github.com/apache/hudi/issues/3709 > Test case by [~xushiyan] : [https://github.com/apache/hudi/pull/3723/files] > RCA by [~shivnarayan] : > Within HoodieMergeHandle, we use a hashmap to store incoming records, where > keys are record keys. > and so, if you see 1st batch, duplicates would remain intact. but wrt 2nd > batch, only unique records are considered and later concatenated w/ 1st batch. > > [https://github.com/apache/hudi/blob/36be28712196ff4427c41b0aa885c7fcd7356d7f/hudi-[]…]-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2496) Inserts are precombined even with dedup disabled
[ https://issues.apache.org/jira/browse/HUDI-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2496: - Priority: Critical (was: Major) > Inserts are precombined even with dedup disabled > > > Key: HUDI-2496 > URL: https://issues.apache.org/jira/browse/HUDI-2496 > Project: Apache Hudi > Issue Type: Bug > Components: Writer Core >Reporter: Sagar Sumit >Priority: Critical > Labels: sev:critical > Fix For: 0.10.0 > > > Original GH issue https://github.com/apache/hudi/issues/3709 > Test case by [~xushiyan] : [https://github.com/apache/hudi/pull/3723/files] > RCA by [~shivnarayan] : > Within HoodieMergeHandle, we use a hashmap to store incoming records, where > keys are record keys. > and so, if you see 1st batch, duplicates would remain intact. but wrt 2nd > batch, only unique records are considered and later concatenated w/ 1st batch. > > [https://github.com/apache/hudi/blob/36be28712196ff4427c41b0aa885c7fcd7356d7f/hudi-[]…]-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2608) Support JSON schema in schema registry provider
[ https://issues.apache.org/jira/browse/HUDI-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2608: - Description: To work with JSON kafka source. Original issue https://github.com/apache/hudi/issues/3835 > Support JSON schema in schema registry provider > --- > > Key: HUDI-2608 > URL: https://issues.apache.org/jira/browse/HUDI-2608 > Project: Apache Hudi > Issue Type: New Feature > Components: DeltaStreamer >Reporter: Raymond Xu >Priority: Major > > To work with JSON kafka source. > > Original issue > https://github.com/apache/hudi/issues/3835 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2608) Support JSON schema in schema registry provider
[ https://issues.apache.org/jira/browse/HUDI-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2608: - Labels: sev:normal user-support-issues (was: ) > Support JSON schema in schema registry provider > --- > > Key: HUDI-2608 > URL: https://issues.apache.org/jira/browse/HUDI-2608 > Project: Apache Hudi > Issue Type: New Feature > Components: DeltaStreamer >Reporter: Raymond Xu >Priority: Major > Labels: sev:normal, user-support-issues > > To work with JSON kafka source. > > Original issue > https://github.com/apache/hudi/issues/3835 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2608) Support JSON schema in schema registry provider
Raymond Xu created HUDI-2608: Summary: Support JSON schema in schema registry provider Key: HUDI-2608 URL: https://issues.apache.org/jira/browse/HUDI-2608 Project: Apache Hudi Issue Type: New Feature Components: DeltaStreamer Reporter: Raymond Xu -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2610) Fix Spark version info for hudi table CTAS from another hudi table
Raymond Xu created HUDI-2610: Summary: Fix Spark version info for hudi table CTAS from another hudi table Key: HUDI-2610 URL: https://issues.apache.org/jira/browse/HUDI-2610 Project: Apache Hudi Issue Type: Sub-task Components: Spark Integration Reporter: Raymond Xu See details in the original issue https://github.com/apache/hudi/issues/3662#issuecomment-938489457 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2609) Clarify small file configs in config page
Raymond Xu created HUDI-2609: Summary: Clarify small file configs in config page Key: HUDI-2609 URL: https://issues.apache.org/jira/browse/HUDI-2609 Project: Apache Hudi Issue Type: Sub-task Components: Docs Reporter: Raymond Xu The knowledge should be preserved in docs close to the related config keys https://github.com/apache/hudi/issues/3676#issuecomment-922508543 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2609) Clarify small file configs in config page
[ https://issues.apache.org/jira/browse/HUDI-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2609: - Labels: user-support-issues (was: ) > Clarify small file configs in config page > - > > Key: HUDI-2609 > URL: https://issues.apache.org/jira/browse/HUDI-2609 > Project: Apache Hudi > Issue Type: Sub-task > Components: Docs >Reporter: Raymond Xu >Priority: Minor > Labels: user-support-issues > > The knowledge should be preserved in docs close to the related config keys > https://github.com/apache/hudi/issues/3676#issuecomment-922508543 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2611) `create table if not exists` should print message instead of throwing error
Raymond Xu created HUDI-2611: Summary: `create table if not exists` should print message instead of throwing error Key: HUDI-2611 URL: https://issues.apache.org/jira/browse/HUDI-2611 Project: Apache Hudi Issue Type: Sub-task Components: Spark Integration Reporter: Raymond Xu See details in https://github.com/apache/hudi/issues/3845#issue-1033218877 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2617) Implement HBase Index for Dataset
Raymond Xu created HUDI-2617: Summary: Implement HBase Index for Dataset Key: HUDI-2617 URL: https://issues.apache.org/jira/browse/HUDI-2617 Project: Apache Hudi Issue Type: Sub-task Reporter: Raymond Xu -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex
[ https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1430: - Description: End to end upsert operation, with proper functional tests coverage. > Implement SparkDataFrameWriteClient with SimpleIndex > > > Key: HUDI-1430 > URL: https://issues.apache.org/jira/browse/HUDI-1430 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: sivabalan narayanan >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > > End to end upsert operation, with proper functional tests coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2619) Make table services work with Dataset
[ https://issues.apache.org/jira/browse/HUDI-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2619: - Description: Clustering, Compaction, Clean should also work with Dataset > Make table services work with Dataset > -- > > Key: HUDI-2619 > URL: https://issues.apache.org/jira/browse/HUDI-2619 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > > Clustering, Compaction, Clean should also work with Dataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2531) [UMBRELLA] Support Dataset APIs in writer paths
[ https://issues.apache.org/jira/browse/HUDI-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2531: - Status: In Progress (was: Open) > [UMBRELLA] Support Dataset APIs in writer paths > --- > > Key: HUDI-2531 > URL: https://issues.apache.org/jira/browse/HUDI-2531 > Project: Apache Hudi > Issue Type: New Feature > Components: Spark Integration >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Blocker > Labels: hudi-umbrellas > Fix For: 0.10.0 > > > To make use of Dataset APIs in writer paths instead of RDD. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-2615) Decouple HoodieRecordPayload with Hoodie table, table services, and index
[ https://issues.apache.org/jira/browse/HUDI-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu reassigned HUDI-2615: Assignee: Raymond Xu > Decouple HoodieRecordPayload with Hoodie table, table services, and index > - > > Key: HUDI-2615 > URL: https://issues.apache.org/jira/browse/HUDI-2615 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > > HoodieTable, HoodieIndex, and compaction, clustering services should be > independent of HoodieRecordPayload -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2077) Flaky test: TestHoodieDeltaStreamer
[ https://issues.apache.org/jira/browse/HUDI-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2077: - Priority: Critical (was: Major) > Flaky test: TestHoodieDeltaStreamer > --- > > Key: HUDI-2077 > URL: https://issues.apache.org/jira/browse/HUDI-2077 > Project: Apache Hudi > Issue Type: Sub-task > Components: Testing >Reporter: Raymond Xu >Assignee: Sagar Sumit >Priority: Critical > Labels: pull-request-available > Attachments: 28.txt, hudi_2077_schema_mismatch.txt > > > {code:java} > [INFO] Results:8520[INFO] 8521[ERROR] Errors: 8522[ERROR] > TestHoodieDeltaStreamer.testUpsertsMORContinuousModeWithMultipleWriters:716->testUpsertsContinuousModeWithMultipleWriters:831->runJobsInParallel:940 > » Execution{code} > Search "testUpsertsMORContinuousModeWithMultipleWriters" in the log file for > details. > {quote} > 1730667 [pool-1461-thread-1] WARN > org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer - Got error : }} > org.apache.hudi.exception.HoodieIOException: Could not check if > hdfs://localhost:4/user/vsts/continuous_mor_mulitwriter is a valid table > at > org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:59) > > at > org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:112) > > at > org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:73) > > at > org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:606) > > at > org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer$TestHelpers.assertAtleastNDeltaCommitsAfterCommit(TestHoodieDeltaStreamer.java:322) > > at > org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.lambda$runJobsInParallel$8(TestHoodieDeltaStreamer.java:906) > > at > org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer$TestHelpers.lambda$waitTillCondition$0(TestHoodieDeltaStreamer.java:347) > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > {{Caused by: java.net.ConnectException: Call From fv-az238-328/10.1.0.24 to > localhost:4 failed on connection exception: java.net.ConnectException: > Connection refused; For more details see: > [http://wiki.apache.org/hadoop/ConnectionRefused] > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2531) [UMBRELLA] Support Dataset APIs in writer paths
[ https://issues.apache.org/jira/browse/HUDI-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2531: - Fix Version/s: 0.10.0 > [UMBRELLA] Support Dataset APIs in writer paths > --- > > Key: HUDI-2531 > URL: https://issues.apache.org/jira/browse/HUDI-2531 > Project: Apache Hudi > Issue Type: New Feature > Components: Spark Integration >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Blocker > Labels: hudi-umbrellas > Fix For: 0.10.0 > > > To make use of Dataset APIs in writer paths instead of RDD. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2616) Implement BloomIndex for Dataset
[ https://issues.apache.org/jira/browse/HUDI-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2616: - Fix Version/s: 0.10.0 > Implement BloomIndex for Dataset > - > > Key: HUDI-2616 > URL: https://issues.apache.org/jira/browse/HUDI-2616 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Raymond Xu >Priority: Major > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2617) Implement HBase Index for Dataset
[ https://issues.apache.org/jira/browse/HUDI-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2617: - Fix Version/s: 0.10.0 > Implement HBase Index for Dataset > -- > > Key: HUDI-2617 > URL: https://issues.apache.org/jira/browse/HUDI-2617 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2615) Decouple HoodieRecordPayload with Hoodie table, table services, and index
[ https://issues.apache.org/jira/browse/HUDI-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2615: - Fix Version/s: 0.10.0 > Decouple HoodieRecordPayload with Hoodie table, table services, and index > - > > Key: HUDI-2615 > URL: https://issues.apache.org/jira/browse/HUDI-2615 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > > HoodieTable, HoodieIndex, and compaction, clustering services should be > independent of HoodieRecordPayload -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1869) Upgrading Spark3 To 3.1
[ https://issues.apache.org/jira/browse/HUDI-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu reassigned HUDI-1869: Assignee: Yann Byron (was: pengzhiwei) > Upgrading Spark3 To 3.1 > --- > > Key: HUDI-1869 > URL: https://issues.apache.org/jira/browse/HUDI-1869 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: pengzhiwei >Assignee: Yann Byron >Priority: Blocker > Labels: pull-request-available > Fix For: 0.10.0 > > > Spark 3.1 has changed some behavior of the internal class and interface for > both spark-sql and spark-core module. > Currently hudi can't compile success under the spark 3.1. We need support sql > support for spark 3.1. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2621) Optimize DataFrameWriter on small file handling
Raymond Xu created HUDI-2621: Summary: Optimize DataFrameWriter on small file handling Key: HUDI-2621 URL: https://issues.apache.org/jira/browse/HUDI-2621 Project: Apache Hudi Issue Type: Sub-task Reporter: Raymond Xu Fix For: 0.10.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2623) Make hudi-bot comment at PR thread bottom
Raymond Xu created HUDI-2623: Summary: Make hudi-bot comment at PR thread bottom Key: HUDI-2623 URL: https://issues.apache.org/jira/browse/HUDI-2623 Project: Apache Hudi Issue Type: Improvement Components: Testing Reporter: Raymond Xu Fix For: 0.10.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2287) Partition pruning not working on Hudi dataset
[ https://issues.apache.org/jira/browse/HUDI-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2287: - Status: In Progress (was: Open) > Partition pruning not working on Hudi dataset > - > > Key: HUDI-2287 > URL: https://issues.apache.org/jira/browse/HUDI-2287 > Project: Apache Hudi > Issue Type: Sub-task > Components: Performance >Reporter: Rajkumar Gunasekaran >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Hi, we have created a Hudi dataset which has two level partition like this > {code:java} > s3://somes3bucket/partition1=value/partition2=value > {code} > where _partition1_ and _partition2_ is of type string > When running a simple count query using Hudi format in spark-shell, it takes > almost 3 minutes to complete > > {code:scala} > spark.read.format("hudi").load("s3://somes3bucket"). > where("partition1 = 'somevalue' and partition2 = 'somevalue'"). > count() > > res1: Long = > attempt 1: 3.2 minutes > attempt 2: 2.5 minutes > {code} > In the Spark UI ~9000 tasks (which is approximately equivalent to the total > no of files in the ENTIRE dataset s3://somes3bucket) are used for > computation. Seems like spark is reading the entire dataset instead of > *partition pruning.*...and then filtering the dataset based on the where > clause > Whereas, if I use the parquet format to read the dataset, the query only > takes ~30 seconds (vis-a-vis 3 minutes with Hudi format) > {code:scala} > spark.read.parquet("s3://somes3bucket"). > where("partition1 = 'somevalue' and partition2 = 'somevalue'"). > count() > res2: Long = > ~ 30 seconds > {code} > In the spark UI, only 1361 (ie 1361 tasks) files are scanned (vis-a-vis ~9000 > files in Hudi) and takes only 15 seconds > Any idea why partition pruning is not working when using Hudi format? > Wondering if I am missing any configuration during the creation of the > dataset? > PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0 and here is > the configuration I have used for creating the dataset > {code:scala} > df.writeStream > .trigger(Trigger.ProcessingTime(s"${param.triggerTimeInSeconds} seconds")) > .partitionBy("partition1","partition2") > .format("org.apache.hudi") > .option(HoodieWriteConfig.TABLE_NAME, param.hiveNHudiTableName.get) > //-- > .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy") > .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, > param.expectedFileSizeInBytes) > .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, > HoodieStorageConfig.DEFAULT_PARQUET_BLOCK_SIZE_BYTES) > //-- > .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, > (param.expectedFileSizeInBytes / 100) * 80) > .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "true") > .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, > param.runCompactionAfterNDeltaCommits.get) > //-- > .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "record_key_id") > .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, > classOf[CustomKeyGenerator].getName) > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, > "partition1:SIMPLE,partition2:SIMPLE") > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, > hudiTablePrecombineKey) > .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") > //.option(DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY, "false") > .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true") > .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, > "partition1,partition2") > .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, param.hiveDb.get) > .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, > param.hiveNHudiTableName.get) > .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > classOf[MultiPartKeysValueExtractor].getName) > .outputMode(OutputMode.Append()) > .queryName(s"${param.hiveDb}_${param.hiveNHudiTableName}_query"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1706) Test flakiness w/ multiwriter test
[ https://issues.apache.org/jira/browse/HUDI-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1706: - Priority: Major (was: Blocker) > Test flakiness w/ multiwriter test > -- > > Key: HUDI-1706 > URL: https://issues.apache.org/jira/browse/HUDI-1706 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: sivabalan narayanan >Assignee: Raymond Xu >Priority: Major > Fix For: 0.10.0 > > > [https://api.travis-ci.com/v3/job/492130170/log.txt] > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2615) Decouple HoodieRecordPayload with Hoodie table, table services, and index
Raymond Xu created HUDI-2615: Summary: Decouple HoodieRecordPayload with Hoodie table, table services, and index Key: HUDI-2615 URL: https://issues.apache.org/jira/browse/HUDI-2615 Project: Apache Hudi Issue Type: Sub-task Reporter: Raymond Xu HoodieTable, HoodieIndex, and compaction, clustering services should be independent of HoodieRecordPayload -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2531) [UMBRELLA] Support Dataset APIs in writer paths
[ https://issues.apache.org/jira/browse/HUDI-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2531: - Priority: Blocker (was: Critical) > [UMBRELLA] Support Dataset APIs in writer paths > --- > > Key: HUDI-2531 > URL: https://issues.apache.org/jira/browse/HUDI-2531 > Project: Apache Hudi > Issue Type: New Feature > Components: Spark Integration >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Blocker > Labels: hudi-umbrellas > > To make use of Dataset APIs in writer paths instead of RDD. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2616) Implement BloomIndex for Dataset
Raymond Xu created HUDI-2616: Summary: Implement BloomIndex for Dataset Key: HUDI-2616 URL: https://issues.apache.org/jira/browse/HUDI-2616 Project: Apache Hudi Issue Type: Sub-task Reporter: Raymond Xu -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2618) Implement write operations other than upsert in SparkDataFrameWriteClient
[ https://issues.apache.org/jira/browse/HUDI-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2618: - Story Points: 4 (was: 3) > Implement write operations other than upsert in SparkDataFrameWriteClient > - > > Key: HUDI-2618 > URL: https://issues.apache.org/jira/browse/HUDI-2618 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > > insert, insert_prepped, insert_overwrite, insert_overwrite_table, delete, > delete_partitions, bulk_insert -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-1970) Performance testing/certification of key SQL DMLs
[ https://issues.apache.org/jira/browse/HUDI-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433587#comment-17433587 ] Raymond Xu edited comment on HUDI-1970 at 10/25/21, 7:13 AM: - * 1B records (randomized values in the example trip model) * 100 partitions, evenly distributed, `year=*/month=*/day=*`, 50 parquet files / partition * hudi: 109.8 GB = 22.4 MB parquet x 5000 * delta: 70.9 GB = 14.5 MB parquet x 5000 |SQL|Hudi 0.9.0| |select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0|129.352|108.312|104.914| |select count(*) from hudi_trips_snapshot|96.001|83.839|66.973| |select count(*) from hudi_trips_snapshot where year = '2020' and month = '03' and day = '01'|1.880|1.776|1.767| |select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where year='2020' and month='03' and day='01' and fare between 20 and 50|3.650|3.147|3.086| was (Author: xushiyan): * 1B records (randomized values in the example trip model) * 100 partitions, evenly distributed, year=*/month=*/day=*, 50 parquet files / partition * EMR 6.2 Spark 3.0.1-amzn-0 * S3, parquet compression snappy * hudi: 109.8 GB = 22.4 MB parquet x 5000 * delta: 70.9 GB = 14.5 MB parquet x 5000 |SQL|Hudi 0.9.0| |select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0|129.352|108.312|104.914| |select count(*) from hudi_trips_snapshot|96.001|83.839|66.973| |select count(*) from hudi_trips_snapshot where year = '2020' and month = '03' and day = '01'|1.880|1.776|1.767| |select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where year='2020' and month='03' and day='01' and fare between 20 and 50|3.650|3.147|3.086| > Performance testing/certification of key SQL DMLs > - > > Key: HUDI-1970 > URL: https://issues.apache.org/jira/browse/HUDI-1970 > Project: Apache Hudi > Issue Type: Sub-task > Components: Performance, Spark Integration >Reporter: Vinoth Chandar >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1970) Performance testing/certification of key SQL DMLs
[ https://issues.apache.org/jira/browse/HUDI-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1970: - Status: Patch Available (was: In Progress) > Performance testing/certification of key SQL DMLs > - > > Key: HUDI-1970 > URL: https://issues.apache.org/jira/browse/HUDI-1970 > Project: Apache Hudi > Issue Type: Sub-task > Components: Performance, Spark Integration >Reporter: Vinoth Chandar >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex
[ https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1430: - Status: In Progress (was: Open) > Implement SparkDataFrameWriteClient with SimpleIndex > > > Key: HUDI-1430 > URL: https://issues.apache.org/jira/browse/HUDI-1430 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: sivabalan narayanan >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex
[ https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1430: - Story Points: 2 > Implement SparkDataFrameWriteClient with SimpleIndex > > > Key: HUDI-1430 > URL: https://issues.apache.org/jira/browse/HUDI-1430 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: sivabalan narayanan >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2618) Implement write operations other than upsert in SparkDataFrameWriteClient
[ https://issues.apache.org/jira/browse/HUDI-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2618: - Summary: Implement write operations other than upsert in SparkDataFrameWriteClient (was: Implement operations other than upsert in SparkDataFrameWriteClient) > Implement write operations other than upsert in SparkDataFrameWriteClient > - > > Key: HUDI-2618 > URL: https://issues.apache.org/jira/browse/HUDI-2618 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2618) Implement write operations other than upsert in SparkDataFrameWriteClient
[ https://issues.apache.org/jira/browse/HUDI-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2618: - Description: insert, insert_prepped, insert_overwrite, insert_overwrite_table, delete, delete_partitions, bulk_insert > Implement write operations other than upsert in SparkDataFrameWriteClient > - > > Key: HUDI-2618 > URL: https://issues.apache.org/jira/browse/HUDI-2618 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > > insert, insert_prepped, insert_overwrite, insert_overwrite_table, delete, > delete_partitions, bulk_insert -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1885) Support Delete/Update Non-Pk Table
[ https://issues.apache.org/jira/browse/HUDI-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu reassigned HUDI-1885: Assignee: Yann Byron > Support Delete/Update Non-Pk Table > -- > > Key: HUDI-1885 > URL: https://issues.apache.org/jira/browse/HUDI-1885 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: pengzhiwei >Assignee: Yann Byron >Priority: Blocker > Fix For: 0.10.0 > > > Allow to delete/update a non-pk table. > {code:java} > create table h0 ( > id int, > name string, > price double > ) using hudi; > delete from h0 where id = 10; > update h0 set price = 10 where id = 12; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-2234) MERGE INTO works only ON primary key
[ https://issues.apache.org/jira/browse/HUDI-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu reassigned HUDI-2234: Assignee: Yann Byron (was: pengzhiwei) > MERGE INTO works only ON primary key > > > Key: HUDI-2234 > URL: https://issues.apache.org/jira/browse/HUDI-2234 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: Sagar Sumit >Assignee: Yann Byron >Priority: Blocker > Fix For: 0.10.0 > > > {code:sql} > drop table if exists hudi_gh_ext_fixed; > create table hudi_gh_ext_fixed (id int, name string, price double, ts long) > using hudi options(primaryKey = 'id', precombineField = 'ts') location > 'file:///tmp/hudi-h4-fixed'; > insert into hudi_gh_ext_fixed values(3, 'AMZN', 300, 120); > insert into hudi_gh_ext_fixed values(2, 'UBER', 300, 120); > insert into hudi_gh_ext_fixed values(4, 'GOOG', 300, 120); > update hudi_gh_ext_fixed set price = 150.0 where name = 'UBER'; > drop table if exists hudi_fixed; > create table hudi_fixed (id int, name string, price double, ts long) using > hudi options(primaryKey = 'id', precombineField = 'ts') partitioned by (ts) > location 'file:///tmp/hudi-h4-part-fixed'; > insert into hudi_fixed values(2, 'UBER', 200, 120); > MERGE INTO hudi_fixed > USING (select id, name, price, ts from hudi_gh_ext_fixed) updates > ON hudi_fixed.name = updates.name > WHEN MATCHED THEN > UPDATE SET * > WHEN NOT MATCHED > THEN INSERT *; > -- java.lang.IllegalArgumentException: Merge Key[name] is not Equal to the > defined primary key[id] in table hudi_fixed > --at > org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.buildMergeIntoConfig(MergeIntoHoodieTableCommand.scala:425) > --at > org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.run(MergeIntoHoodieTableCommand.scala:146) > --at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > --at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > --at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > --at > org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2287) Partition pruning not working on Hudi dataset
[ https://issues.apache.org/jira/browse/HUDI-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433586#comment-17433586 ] Raymond Xu commented on HUDI-2287: -- [~rjkumr] it's likely caused by your `hoodie.table.partition.fields` config in your hoodie.properties. As you're using a CustomKeyGenerator, not sure how that affects the partition field settings. In case of SimpleKeyGenerator, you'd expect `hoodie.table.partition.fields=partition1,partition2`. You can manually modify it and by setting it right for your CustomKeyGenerator's logic, you should be able to get partition pruning to work. > Partition pruning not working on Hudi dataset > - > > Key: HUDI-2287 > URL: https://issues.apache.org/jira/browse/HUDI-2287 > Project: Apache Hudi > Issue Type: Sub-task > Components: Performance >Reporter: Rajkumar Gunasekaran >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Hi, we have created a Hudi dataset which has two level partition like this > {code:java} > s3://somes3bucket/partition1=value/partition2=value > {code} > where _partition1_ and _partition2_ is of type string > When running a simple count query using Hudi format in spark-shell, it takes > almost 3 minutes to complete > > {code:scala} > spark.read.format("hudi").load("s3://somes3bucket"). > where("partition1 = 'somevalue' and partition2 = 'somevalue'"). > count() > > res1: Long = > attempt 1: 3.2 minutes > attempt 2: 2.5 minutes > {code} > In the Spark UI ~9000 tasks (which is approximately equivalent to the total > no of files in the ENTIRE dataset s3://somes3bucket) are used for > computation. Seems like spark is reading the entire dataset instead of > *partition pruning.*...and then filtering the dataset based on the where > clause > Whereas, if I use the parquet format to read the dataset, the query only > takes ~30 seconds (vis-a-vis 3 minutes with Hudi format) > {code:scala} > spark.read.parquet("s3://somes3bucket"). > where("partition1 = 'somevalue' and partition2 = 'somevalue'"). > count() > res2: Long = > ~ 30 seconds > {code} > In the spark UI, only 1361 (ie 1361 tasks) files are scanned (vis-a-vis ~9000 > files in Hudi) and takes only 15 seconds > Any idea why partition pruning is not working when using Hudi format? > Wondering if I am missing any configuration during the creation of the > dataset? > PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0 and here is > the configuration I have used for creating the dataset > {code:scala} > df.writeStream > .trigger(Trigger.ProcessingTime(s"${param.triggerTimeInSeconds} seconds")) > .partitionBy("partition1","partition2") > .format("org.apache.hudi") > .option(HoodieWriteConfig.TABLE_NAME, param.hiveNHudiTableName.get) > //-- > .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy") > .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, > param.expectedFileSizeInBytes) > .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, > HoodieStorageConfig.DEFAULT_PARQUET_BLOCK_SIZE_BYTES) > //-- > .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, > (param.expectedFileSizeInBytes / 100) * 80) > .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "true") > .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, > param.runCompactionAfterNDeltaCommits.get) > //-- > .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "record_key_id") > .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, > classOf[CustomKeyGenerator].getName) > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, > "partition1:SIMPLE,partition2:SIMPLE") > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, > hudiTablePrecombineKey) > .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") > //.option(DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY, "false") > .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true") > .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, > "partition1,partition2") > .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, param.hiveDb.get) > .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, > param.hiveNHudiTableName.get) > .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > classOf[MultiPartKeysValueExtractor].getName) > .outputMode(OutputMode.Append()) > .queryName(s"${param.hiveDb}_${param.hiveNHudiTableName}_query"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2287) Partition pruning not working on Hudi dataset
[ https://issues.apache.org/jira/browse/HUDI-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2287: - Priority: Major (was: Blocker) > Partition pruning not working on Hudi dataset > - > > Key: HUDI-2287 > URL: https://issues.apache.org/jira/browse/HUDI-2287 > Project: Apache Hudi > Issue Type: Sub-task > Components: Performance >Reporter: Rajkumar Gunasekaran >Assignee: Raymond Xu >Priority: Major > Fix For: 0.10.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Hi, we have created a Hudi dataset which has two level partition like this > {code:java} > s3://somes3bucket/partition1=value/partition2=value > {code} > where _partition1_ and _partition2_ is of type string > When running a simple count query using Hudi format in spark-shell, it takes > almost 3 minutes to complete > > {code:scala} > spark.read.format("hudi").load("s3://somes3bucket"). > where("partition1 = 'somevalue' and partition2 = 'somevalue'"). > count() > > res1: Long = > attempt 1: 3.2 minutes > attempt 2: 2.5 minutes > {code} > In the Spark UI ~9000 tasks (which is approximately equivalent to the total > no of files in the ENTIRE dataset s3://somes3bucket) are used for > computation. Seems like spark is reading the entire dataset instead of > *partition pruning.*...and then filtering the dataset based on the where > clause > Whereas, if I use the parquet format to read the dataset, the query only > takes ~30 seconds (vis-a-vis 3 minutes with Hudi format) > {code:scala} > spark.read.parquet("s3://somes3bucket"). > where("partition1 = 'somevalue' and partition2 = 'somevalue'"). > count() > res2: Long = > ~ 30 seconds > {code} > In the spark UI, only 1361 (ie 1361 tasks) files are scanned (vis-a-vis ~9000 > files in Hudi) and takes only 15 seconds > Any idea why partition pruning is not working when using Hudi format? > Wondering if I am missing any configuration during the creation of the > dataset? > PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0 and here is > the configuration I have used for creating the dataset > {code:scala} > df.writeStream > .trigger(Trigger.ProcessingTime(s"${param.triggerTimeInSeconds} seconds")) > .partitionBy("partition1","partition2") > .format("org.apache.hudi") > .option(HoodieWriteConfig.TABLE_NAME, param.hiveNHudiTableName.get) > //-- > .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy") > .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, > param.expectedFileSizeInBytes) > .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, > HoodieStorageConfig.DEFAULT_PARQUET_BLOCK_SIZE_BYTES) > //-- > .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, > (param.expectedFileSizeInBytes / 100) * 80) > .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "true") > .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, > param.runCompactionAfterNDeltaCommits.get) > //-- > .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "record_key_id") > .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, > classOf[CustomKeyGenerator].getName) > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, > "partition1:SIMPLE,partition2:SIMPLE") > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, > hudiTablePrecombineKey) > .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") > //.option(DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY, "false") > .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true") > .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, > "partition1,partition2") > .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, param.hiveDb.get) > .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, > param.hiveNHudiTableName.get) > .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > classOf[MultiPartKeysValueExtractor].getName) > .outputMode(OutputMode.Append()) > .queryName(s"${param.hiveDb}_${param.hiveNHudiTableName}_query"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex
[ https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1430: - Parent: HUDI-2531 Issue Type: Sub-task (was: Improvement) > Implement SparkDataFrameWriteClient with SimpleIndex > > > Key: HUDI-1430 > URL: https://issues.apache.org/jira/browse/HUDI-1430 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: sivabalan narayanan >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex
[ https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1430: - Summary: Implement SparkDataFrameWriteClient with SimpleIndex (was: Support Dataset write w/o conversion to RDD) > Implement SparkDataFrameWriteClient with SimpleIndex > > > Key: HUDI-1430 > URL: https://issues.apache.org/jira/browse/HUDI-1430 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: sivabalan narayanan >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2621) Enhance DataFrameWriter with small file handling
[ https://issues.apache.org/jira/browse/HUDI-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2621: - Summary: Enhance DataFrameWriter with small file handling (was: Optimize DataFrameWriter on small file handling) > Enhance DataFrameWriter with small file handling > > > Key: HUDI-2621 > URL: https://issues.apache.org/jira/browse/HUDI-2621 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2287) Partition pruning not working on Hudi dataset
[ https://issues.apache.org/jira/browse/HUDI-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2287: - Priority: Blocker (was: Major) > Partition pruning not working on Hudi dataset > - > > Key: HUDI-2287 > URL: https://issues.apache.org/jira/browse/HUDI-2287 > Project: Apache Hudi > Issue Type: Sub-task > Components: Performance >Reporter: Rajkumar Gunasekaran >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Hi, we have created a Hudi dataset which has two level partition like this > {code:java} > s3://somes3bucket/partition1=value/partition2=value > {code} > where _partition1_ and _partition2_ is of type string > When running a simple count query using Hudi format in spark-shell, it takes > almost 3 minutes to complete > > {code:scala} > spark.read.format("hudi").load("s3://somes3bucket"). > where("partition1 = 'somevalue' and partition2 = 'somevalue'"). > count() > > res1: Long = > attempt 1: 3.2 minutes > attempt 2: 2.5 minutes > {code} > In the Spark UI ~9000 tasks (which is approximately equivalent to the total > no of files in the ENTIRE dataset s3://somes3bucket) are used for > computation. Seems like spark is reading the entire dataset instead of > *partition pruning.*...and then filtering the dataset based on the where > clause > Whereas, if I use the parquet format to read the dataset, the query only > takes ~30 seconds (vis-a-vis 3 minutes with Hudi format) > {code:scala} > spark.read.parquet("s3://somes3bucket"). > where("partition1 = 'somevalue' and partition2 = 'somevalue'"). > count() > res2: Long = > ~ 30 seconds > {code} > In the spark UI, only 1361 (ie 1361 tasks) files are scanned (vis-a-vis ~9000 > files in Hudi) and takes only 15 seconds > Any idea why partition pruning is not working when using Hudi format? > Wondering if I am missing any configuration during the creation of the > dataset? > PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0 and here is > the configuration I have used for creating the dataset > {code:scala} > df.writeStream > .trigger(Trigger.ProcessingTime(s"${param.triggerTimeInSeconds} seconds")) > .partitionBy("partition1","partition2") > .format("org.apache.hudi") > .option(HoodieWriteConfig.TABLE_NAME, param.hiveNHudiTableName.get) > //-- > .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy") > .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, > param.expectedFileSizeInBytes) > .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, > HoodieStorageConfig.DEFAULT_PARQUET_BLOCK_SIZE_BYTES) > //-- > .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, > (param.expectedFileSizeInBytes / 100) * 80) > .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "true") > .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, > param.runCompactionAfterNDeltaCommits.get) > //-- > .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "record_key_id") > .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, > classOf[CustomKeyGenerator].getName) > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, > "partition1:SIMPLE,partition2:SIMPLE") > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, > hudiTablePrecombineKey) > .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") > //.option(DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY, "false") > .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true") > .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, > "partition1,partition2") > .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, param.hiveDb.get) > .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, > param.hiveNHudiTableName.get) > .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > classOf[MultiPartKeysValueExtractor].getName) > .outputMode(OutputMode.Append()) > .queryName(s"${param.hiveDb}_${param.hiveNHudiTableName}_query"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2615) Decouple HoodieRecordPayload with Hoodie table, table services, and index
[ https://issues.apache.org/jira/browse/HUDI-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2615: - Status: In Progress (was: Open) > Decouple HoodieRecordPayload with Hoodie table, table services, and index > - > > Key: HUDI-2615 > URL: https://issues.apache.org/jira/browse/HUDI-2615 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > > HoodieTable, HoodieIndex, and compaction, clustering services should be > independent of HoodieRecordPayload -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2287) Partition pruning not working on Hudi dataset
[ https://issues.apache.org/jira/browse/HUDI-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2287: - Status: Patch Available (was: In Progress) > Partition pruning not working on Hudi dataset > - > > Key: HUDI-2287 > URL: https://issues.apache.org/jira/browse/HUDI-2287 > Project: Apache Hudi > Issue Type: Sub-task > Components: Performance >Reporter: Rajkumar Gunasekaran >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Hi, we have created a Hudi dataset which has two level partition like this > {code:java} > s3://somes3bucket/partition1=value/partition2=value > {code} > where _partition1_ and _partition2_ is of type string > When running a simple count query using Hudi format in spark-shell, it takes > almost 3 minutes to complete > > {code:scala} > spark.read.format("hudi").load("s3://somes3bucket"). > where("partition1 = 'somevalue' and partition2 = 'somevalue'"). > count() > > res1: Long = > attempt 1: 3.2 minutes > attempt 2: 2.5 minutes > {code} > In the Spark UI ~9000 tasks (which is approximately equivalent to the total > no of files in the ENTIRE dataset s3://somes3bucket) are used for > computation. Seems like spark is reading the entire dataset instead of > *partition pruning.*...and then filtering the dataset based on the where > clause > Whereas, if I use the parquet format to read the dataset, the query only > takes ~30 seconds (vis-a-vis 3 minutes with Hudi format) > {code:scala} > spark.read.parquet("s3://somes3bucket"). > where("partition1 = 'somevalue' and partition2 = 'somevalue'"). > count() > res2: Long = > ~ 30 seconds > {code} > In the spark UI, only 1361 (ie 1361 tasks) files are scanned (vis-a-vis ~9000 > files in Hudi) and takes only 15 seconds > Any idea why partition pruning is not working when using Hudi format? > Wondering if I am missing any configuration during the creation of the > dataset? > PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0 and here is > the configuration I have used for creating the dataset > {code:scala} > df.writeStream > .trigger(Trigger.ProcessingTime(s"${param.triggerTimeInSeconds} seconds")) > .partitionBy("partition1","partition2") > .format("org.apache.hudi") > .option(HoodieWriteConfig.TABLE_NAME, param.hiveNHudiTableName.get) > //-- > .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy") > .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, > param.expectedFileSizeInBytes) > .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, > HoodieStorageConfig.DEFAULT_PARQUET_BLOCK_SIZE_BYTES) > //-- > .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, > (param.expectedFileSizeInBytes / 100) * 80) > .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "true") > .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, > param.runCompactionAfterNDeltaCommits.get) > //-- > .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "record_key_id") > .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, > classOf[CustomKeyGenerator].getName) > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, > "partition1:SIMPLE,partition2:SIMPLE") > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, > hudiTablePrecombineKey) > .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") > //.option(DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY, "false") > .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true") > .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, > "partition1,partition2") > .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, param.hiveDb.get) > .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, > param.hiveNHudiTableName.get) > .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > classOf[MultiPartKeysValueExtractor].getName) > .outputMode(OutputMode.Append()) > .queryName(s"${param.hiveDb}_${param.hiveNHudiTableName}_query"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1970) Performance testing/certification of key SQL DMLs
[ https://issues.apache.org/jira/browse/HUDI-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433587#comment-17433587 ] Raymond Xu commented on HUDI-1970: -- * 1B records (randomized values in the example trip model) * 100 partitions, evenly distributed, year=*/month=*/day=*, 50 parquet files / partition * EMR 6.2 Spark 3.0.1-amzn-0 * S3, parquet compression snappy * hudi: 109.8 GB = 22.4 MB parquet x 5000 * delta: 70.9 GB = 14.5 MB parquet x 5000 |SQL|Hudi 0.9.0| |select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0|129.352|108.312|104.914| |select count(*) from hudi_trips_snapshot|96.001|83.839|66.973| |select count(*) from hudi_trips_snapshot where year = '2020' and month = '03' and day = '01'|1.880|1.776|1.767| |select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where year='2020' and month='03' and day='01' and fare between 20 and 50|3.650|3.147|3.086| > Performance testing/certification of key SQL DMLs > - > > Key: HUDI-1970 > URL: https://issues.apache.org/jira/browse/HUDI-1970 > Project: Apache Hudi > Issue Type: Sub-task > Components: Performance, Spark Integration >Reporter: Vinoth Chandar >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1970) Performance testing/certification of key SQL DMLs
[ https://issues.apache.org/jira/browse/HUDI-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1970: - Status: In Progress (was: Open) > Performance testing/certification of key SQL DMLs > - > > Key: HUDI-1970 > URL: https://issues.apache.org/jira/browse/HUDI-1970 > Project: Apache Hudi > Issue Type: Sub-task > Components: Performance, Spark Integration >Reporter: Vinoth Chandar >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex
[ https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1430: - Story Points: 3 (was: 2) > Implement SparkDataFrameWriteClient with SimpleIndex > > > Key: HUDI-1430 > URL: https://issues.apache.org/jira/browse/HUDI-1430 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: sivabalan narayanan >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)