[jira] [Updated] (HUDI-5602) Troubleshoot METADATA_ONLY bootstrapped table not being able to read back partition path
[ https://issues.apache.org/jira/browse/HUDI-5602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5602: Fix Version/s: 0.14.0 (was: 0.13.1) > Troubleshoot METADATA_ONLY bootstrapped table not being able to read back > partition path > > > Key: HUDI-5602 > URL: https://issues.apache.org/jira/browse/HUDI-5602 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.12.2 >Reporter: Alexey Kudinkin >Priority: Blocker > Fix For: 0.14.0 > > > In [https://github.com/apache/hudi/pull/7461] after enabling matching of the > whole payload rather than just record counts, it's been discovered that Hudi > isn't able to read back partition-path after running METADATA_ONLY bootstrap, > leading to a test failure (it's annotated w/ the TODO and this Jira in the > test suite) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5608) Support decimals w/ precision > 30 in Column Stats
[ https://issues.apache.org/jira/browse/HUDI-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5608: Fix Version/s: 0.14.0 (was: 0.13.1) > Support decimals w/ precision > 30 in Column Stats > -- > > Key: HUDI-5608 > URL: https://issues.apache.org/jira/browse/HUDI-5608 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.2 >Reporter: Alexey Kudinkin >Priority: Critical > Fix For: 0.14.0 > > > As reported in: [https://github.com/apache/hudi/issues/7732] > > Currently we've limited precision of the supported decimals at 30 assuming > that this number is reasonably high to cover 99% of use-cases, but it seems > like there's still a demand for even larger Decimals. > The challenge is however to balance the need to support longer Decimals vs > storage space we have to provision for each one of them. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5575) Support any record key generation along w/ any partition path generation for row writer
[ https://issues.apache.org/jira/browse/HUDI-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5575: Fix Version/s: 0.14.0 (was: 0.13.1) > Support any record key generation along w/ any partition path generation for > row writer > --- > > Key: HUDI-5575 > URL: https://issues.apache.org/jira/browse/HUDI-5575 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Lokesh Jain >Assignee: sivabalan narayanan >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.0 > > > HUDI-5535 adds support for record key generation along w/ any partition path > generation. It also separates the record key generation and partition path > generation into separate interfaces. > This jira aims to add similar support for the row writer path in spark. > cc [~shivnarayan] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5574) Support auto record key generation with Spark SQL
[ https://issues.apache.org/jira/browse/HUDI-5574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5574: Fix Version/s: 0.14.0 (was: 0.13.1) > Support auto record key generation with Spark SQL > - > > Key: HUDI-5574 > URL: https://issues.apache.org/jira/browse/HUDI-5574 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Lokesh Jain >Assignee: sivabalan narayanan >Priority: Critical > Labels: release-0.14.0-blocker > Fix For: 0.14.0 > > > HUDI-2681 adds support for auto record key generation with spark dataframes. > This Jira aims to add support for the same with spark sql. > One of the changes required here as pointed out by [~kazdy] is that > SQL_INSERT_MODE would need to be handled here. In this case if > SQL_INSERT_MODE mode is set to strict, the insert should fail. > cc [~shivnarayan] > Essentially, based on this patch > ([https://github.com/apache/hudi/pull/7681),|https://github.com/apache/hudi/pull/7681,] > we want to ensure spark-sql writes also supports auto generation of record > keys. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5588) Fix Metadata table validator to deduce valid partitions when first commit where partition was added is failed
[ https://issues.apache.org/jira/browse/HUDI-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5588: Fix Version/s: 0.14.0 (was: 0.13.1) > Fix Metadata table validator to deduce valid partitions when first commit > where partition was added is failed > - > > Key: HUDI-5588 > URL: https://issues.apache.org/jira/browse/HUDI-5588 > Project: Apache Hudi > Issue Type: Bug > Components: tests-ci >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Critical > Fix For: 0.14.0 > > > Metadata validation sometimes fails due to test code issue. > FS based listing shows 0 partitions, while MDT listing shows all 100 > partitions. Its an issue w/ validator code. > > actual timeline: > ls -ltr tbl1/hoodie_table/.hoodie/ total 720 drwxr-xr-x 2 nsb staff 64 Jan 17 > 18:45 archived drwxr-xr-x 4 nsb staff 128 Jan 17 18:45 metadata -rw-r--r-- 1 > nsb staff 808 Jan 17 18:45 hoodie.properties -rw-r--r-- 1 nsb staff 1230 Jan > 17 18:45 20230117214546000.rollback.requested -rw-r--r-- 1 nsb staff 0 Jan 17 > 18:45 20230117214546000.rollback.inflight -rw-r--r-- 1 nsb staff 1414 Jan 17 > 18:46 20230117214546000.rollback -rw-r--r-- 1 nsb staff 1230 Jan 17 18:47 > 20230117214701512.rollback.requested -rw-r--r-- 1 nsb staff 0 Jan 17 18:47 > 20230117214701512.rollback.inflight -rw-r--r-- 1 nsb staff 1414 Jan 17 18:47 > 20230117214701512.rollback -rw-r--r-- 1 nsb staff 15492 Jan 17 18:48 > 20230117214831503.rollback.requested -rw-r--r-- 1 nsb staff 0 Jan 17 18:48 > 20230117214831503.rollback.inflight -rw-r--r-- 1 nsb staff 0 Jan 17 18:48 > 20230117214848714.deltacommit.requested -rw-r--r-- 1 nsb staff 16359 Jan 17 > 18:48 20230117214831503.rollback -rw-r--r-- 1 nsb staff 69698 Jan 17 18:49 > 20230117214848714.deltacommit.inflight -rw-r--r-- 1 nsb staff 0 Jan 17 18:50 > 20230117215006714.deltacommit.requested -rw-r--r-- 1 nsb staff 94423 Jan 17 > 18:50 20230117214848714.deltacommit -rw-r--r-- 1 nsb staff 142198 Jan 17 > 18:50 20230117215006714.deltacommit.inflight > > > atleast there is one successfull commit 20230117214848714.deltacommit. > > but our validator code checks for creation time of partition and considers > that as valid partition only if that particular commit is succeded. > {code:java} > List allPartitionPathsFromFS = > FSUtils.getAllPartitionPaths(engineContext, basePath, false, > cfg.assumeDatePartitioning); > HoodieTimeline completedTimeline = > metaClient.getActiveTimeline().filterCompletedInstants(); > // ignore partitions created by uncommitted ingestion. > allPartitionPathsFromFS = > allPartitionPathsFromFS.stream().parallel().filter(part -> { > HoodiePartitionMetadata hoodiePartitionMetadata = > new HoodiePartitionMetadata(metaClient.getFs(), > FSUtils.getPartitionPath(basePath, part)); > Option instantOption = > hoodiePartitionMetadata.readPartitionCreatedCommitTime(); > if (instantOption.isPresent()) { > String instantTime = instantOption.get(); > return completedTimeline.containsOrBeforeTimelineStarts(instantTime); > } else { > return false; > } > }).collect(Collectors.toList()); {code} > > we need to fix this > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5444) FileNotFound issue w/ metadata enabled
[ https://issues.apache.org/jira/browse/HUDI-5444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5444: Fix Version/s: 0.14.0 (was: 0.13.1) > FileNotFound issue w/ metadata enabled > -- > > Key: HUDI-5444 > URL: https://issues.apache.org/jira/browse/HUDI-5444 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Fix For: 0.14.0 > > > stacktrace > {code:java} > Caused by: java.io.FileNotFoundException: File not found: > gs://TBL_PATH/op_cmpny_cd=WMT.COM/order_placed_dt=2022-12-08/441e7909-6a62-45ac-b9df-dd0386574f52-0_607-17-2433_20221208132316380.parquet > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1082) > {code} > > 20221208133227028 (RB_C10) > 20221208133227028001 MDT compaction > 20221208132316380 (C10) > 20221208133647230 > DT > 8 │ 20221202234515099 │ rollback │ COMPLETED │ Rolls back │ 12-02 > 15:45:18 │ 12-02 15:45:18 │ 12-02 15:45:33 ║ > ║ │ │ │ │ 2022120413756 │ > │ │ ║ > ╟─┼───┼──┼───┼───┼┼┼╢ > ║ 9 │ 20221208133227028 │ rollback │ COMPLETED │ Rolls back │ 12-08 > 05:32:33 │ 12-08 05:32:33 │ 12-08 05:32:44 ║ > ║ │ │ │ │ 20221208132316380 │ > │ │ ║ > ╟─┼───┼──┼───┼───┼┼┼╢ > ║ 10 │ 20221208133647230 │ rollback │ COMPLETED │ Rolls back │ 12-08 > 05:36:47 │ 12-08 05:36:48 │ 12-08 05:36:57 ║ > ║ │ │ │ │ 20221208133222583 │ > │ │ ║ > ╟─┼───┼──┼───┼───┼┼┼╢ > MDT timeline: > -rw-r--r--@ 1 nsb staff 0 Dec 8 05:32 > 20221208133227028.deltacommit.requested > -rw-r--r--@ 1 nsb staff 548 Dec 8 05:32 > 20221208133227028.deltacommit.inflight > -rw-r--r--@ 1 nsb staff 6042 Dec 8 05:32 20221208133227028.deltacommit > -rw-r--r--@ 1 nsb staff 1938 Dec 8 05:34 > 20221208133227028001.compaction.requested > -rw-r--r--@ 1 nsb staff 0 Dec 8 05:34 > 20221208133227028001.compaction.inflight > -rw-r--r--@ 1 nsb staff 7556 Dec 8 05:34 20221208133227028001.commit > -rw-r--r--@ 1 nsb staff 0 Dec 8 05:34 > 20221208132316380.deltacommit.requested > -rw-r--r--@ 1 nsb staff 3049 Dec 8 05:34 > 20221208132316380.deltacommit.inflight > -rw-r--r--@ 1 nsb staff 8207 Dec 8 05:35 20221208132316380.deltacommit > -rw-r--r--@ 1 nsb staff 0 Dec 8 05:36 > 20221208133647230.deltacommit.requested > -rw-r--r--@ 1 nsb staff 548 Dec 8 05:36 > 20221208133647230.deltacommit.inflight > -rw-r--r--@ 1 nsb staff 6042 Dec 8 05:36 20221208133647230.deltacommit > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5507) SparkSQL can not read the latest change data without execute "refresh table xxx"
[ https://issues.apache.org/jira/browse/HUDI-5507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5507: Fix Version/s: 0.14.0 (was: 0.13.1) > SparkSQL can not read the latest change data without execute "refresh table > xxx" > > > Key: HUDI-5507 > URL: https://issues.apache.org/jira/browse/HUDI-5507 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Affects Versions: 0.11.0, 0.11.1, 0.12.0, 0.12.1, 0.12.2 >Reporter: Danny Chen >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5557) Wrong candidate files found in metadata table
[ https://issues.apache.org/jira/browse/HUDI-5557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5557: Fix Version/s: 0.14.0 (was: 0.13.1) > Wrong candidate files found in metadata table > -- > > Key: HUDI-5557 > URL: https://issues.apache.org/jira/browse/HUDI-5557 > Project: Apache Hudi > Issue Type: Bug > Components: metadata, spark-sql >Affects Versions: 0.12.2 >Reporter: ruofan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.12.3, 0.14.0 > > > Suppose the hudi table has five fields, but only two fields are indexed. When > part of the filter condition in SQL comes from index fields and the other > part comes from non-index fields, the candidate files queried from the > metadata table are wrong. > For example following hudi table schema > {code:java} > name: varchar(128) > age: int > addr: varchar(128) > city: varchar(32) > job: varchar(32) {code} > table properties > {code:java} > hoodie.table.type=MERGE_ON_READ > hoodie.metadata.enable=true > hoodie.metadata.index.column.stats.enable=true > hoodie.metadata.index.column.stats.column.list='name,city' > hoodie.enable.data.skipping=true {code} > sql > {code:java} > select * from hudi_table where name='tom' and age=18; {code} > if we set hoodie.enable.data.skipping=false, the data can be found. But if we > set hoodie.enable.data.skipping=true, we can't find the expected data. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5463) Apply rollback commits from data table as rollbacks in MDT instead of Delta commit
[ https://issues.apache.org/jira/browse/HUDI-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5463: Fix Version/s: 0.14.0 (was: 0.13.1) > Apply rollback commits from data table as rollbacks in MDT instead of Delta > commit > -- > > Key: HUDI-5463 > URL: https://issues.apache.org/jira/browse/HUDI-5463 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Critical > Fix For: 0.14.0 > > > As of now, any rollback in DT is another DC in MDT. this may not scale for > record level index in MDT since we have to add 1000s of delete records and > finally have to resolve all valid and invalid records. So, its better to > rollback the commit in MDT as well instead of doing a DC. > > Impact: > record level index is unusable w/o this change. While fixing other rollback > related tickets, do consider this as a possible option if this simplifies > other fixes. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5442) Fix HiveHoodieTableFileIndex to use lazy listing
[ https://issues.apache.org/jira/browse/HUDI-5442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5442: Fix Version/s: 0.14.0 (was: 0.13.1) > Fix HiveHoodieTableFileIndex to use lazy listing > > > Key: HUDI-5442 > URL: https://issues.apache.org/jira/browse/HUDI-5442 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core, trino-presto >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Critical > Fix For: 0.14.0 > > > Currently, HiveHoodieTableFileIndex hard-codes the shouldListLazily to false, > using eager listing only. This leads to scanning all table partitions in the > file index, regardless of the queryPaths provided (for Trino Hive connector, > only one partition is passed in). > {code:java} > public HiveHoodieTableFileIndex(HoodieEngineContext engineContext, > HoodieTableMetaClient metaClient, > TypedProperties configProperties, > HoodieTableQueryType queryType, > List queryPaths, > Option specifiedQueryInstant, > boolean shouldIncludePendingCommits > ) { > super(engineContext, > metaClient, > configProperties, > queryType, > queryPaths, > specifiedQueryInstant, > shouldIncludePendingCommits, > true, > new NoopCache(), > false); > } {code} > After flipping it to true for testing, the following exception is thrown. > {code:java} > io.trino.spi.TrinoException: Failed to parse partition column values from the > partition-path: likely non-encoded slashes being used in partition column's > values. You can try to work this around by switching listing mode to eager > at > io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:284) > at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38) > at io.trino.$gen.Trino_39220221217_092723_2.run(Unknown Source) > at > io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > at java.base/java.lang.Thread.run(Thread.java:833) > Caused by: org.apache.hudi.exception.HoodieException: Failed to parse > partition column values from the partition-path: likely non-encoded slashes > being used in partition column's values. You can try to work this around by > switching listing mode to eager > at > org.apache.hudi.BaseHoodieTableFileIndex.parsePartitionColumnValues(BaseHoodieTableFileIndex.java:317) > at > org.apache.hudi.BaseHoodieTableFileIndex.lambda$listPartitionPaths$6(BaseHoodieTableFileIndex.java:288) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) > at > java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) > at > java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) > at > org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPaths(BaseHoodieTableFileIndex.java:291) > at > org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:205) > at > org.apache.hudi.BaseHoodieTableFileIndex.getAllInputFileSlices(BaseHoodieTableFileIndex.java:216) > at > org.apache.hudi.hadoop.HiveHoodieTableFileIndex.listFileSlices(HiveHoodieTableFileIndex.java:71) > at > org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:263) > at > org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:158) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325) > at > org.apache.hudi.hadoop.HoodieParquetInputFormatBase.getSplits(HoodieParquetInputFormatBase.java:68) > at > io.trino.plugin.hive.BackgroundHiveSplitLoader.lambda$loadPartition$2(BackgroundHiveSplitLoader.java:493) > at > io.trino.plugin.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:25) > at io.trino.plugin.hive.HdfsEnvironment.doAs(HdfsEn
[jira] [Updated] (HUDI-5520) Fail MDT when list of log files grows unboundedly
[ https://issues.apache.org/jira/browse/HUDI-5520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5520: Fix Version/s: 0.14.0 (was: 0.13.1) > Fail MDT when list of log files grows unboundedly > - > > Key: HUDI-5520 > URL: https://issues.apache.org/jira/browse/HUDI-5520 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: sivabalan narayanan >Assignee: Jonathan Vexler >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5436) Auto repair tool for MDT out of sync
[ https://issues.apache.org/jira/browse/HUDI-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5436: Fix Version/s: 0.14.0 (was: 0.13.1) > Auto repair tool for MDT out of sync > > > Key: HUDI-5436 > URL: https://issues.apache.org/jira/browse/HUDI-5436 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: sivabalan narayanan >Priority: Critical > Fix For: 0.14.0 > > > Can we write a spark-submit to repair any out of sync issues w/ MDT. for eg, > if MDT validation failed for a given table, we don't have a good way to fix > the MDT. > So, we should develop a sparksubmit job which will try to deduce from which > commit the out of sync happens and try to fix just the delta. > > idea here is: > Try running validation job for latest files at every commit starting from > latest in reverse chronological order. At some point validation will succeed. > Lets call it commit N. > we can add savepoint to MDT at commit N and restore the table to that commit > N. > and then we can take any new commits after commitN from data table and apply > them one by one to MDT. > > Once complete, we can run validation tool again to ensure its in good shape. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5374) Use KeyGeneratorFactory class for instantiating a KeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5374: Fix Version/s: 0.14.0 (was: 0.13.1) > Use KeyGeneratorFactory class for instantiating a KeyGenerator > -- > > Key: HUDI-5374 > URL: https://issues.apache.org/jira/browse/HUDI-5374 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > Currently the configs hoodie.datasource.write.keygenerator.class and > hoodie.datasource.write.keygenerator.type are used in multiple areas to > create a key generator. The idea is to reuse the *KeyGeneratorFactory classes > for instantiating KeyGenerator. > The Jira adds a KeyGeneratorFactory base class and > HoodieSparkKeyGeneratorFactory, HoodieAvroKeyGeneratorFactory extends this > base class. These classes are then used in code for creating KeyGenerator. > Based on Github issue: [https://github.com/apache/hudi/issues/7291] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5271) Inconsistent reader and writer schema in HoodieAvroDataBlock cause exception
[ https://issues.apache.org/jira/browse/HUDI-5271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5271: Fix Version/s: 0.14.0 (was: 0.13.1) > Inconsistent reader and writer schema in HoodieAvroDataBlock cause exception > > > Key: HUDI-5271 > URL: https://issues.apache.org/jira/browse/HUDI-5271 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Reporter: Teng Huo >Assignee: Teng Huo >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > Exception detail in https://github.com/apache/hudi/issues/7284 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5364) Make sure Hudi's Column Stats are wired into Spark's relation stats
[ https://issues.apache.org/jira/browse/HUDI-5364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5364: Fix Version/s: 0.14.0 (was: 0.13.1) > Make sure Hudi's Column Stats are wired into Spark's relation stats > --- > > Key: HUDI-5364 > URL: https://issues.apache.org/jira/browse/HUDI-5364 > Project: Apache Hudi > Issue Type: Bug > Components: spark, spark-sql >Affects Versions: 0.12.1 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Fix For: 0.14.0 > > > Currently, we're leveraging CSI exclusively to better prune the target files. > Additionally, we should wire in stats from CSI into Spark's > `CatalogStatistics` which in turn will be leveraged by Spark's Optimization > rules for better planning. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5385) Make behavior of keeping File Writers open configurable
[ https://issues.apache.org/jira/browse/HUDI-5385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5385: Fix Version/s: 0.14.0 (was: 0.13.1) > Make behavior of keeping File Writers open configurable > --- > > Key: HUDI-5385 > URL: https://issues.apache.org/jira/browse/HUDI-5385 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.1 >Reporter: Alexey Kudinkin >Priority: Critical > Fix For: 0.14.0 > > > Currently, when writing in Spark we will be keeping the File Writers for > individual partitions open as long as we're processing the batch which > entails that all of the data written out will be kept in memory (at least the > last row-group in case of Parquet writers) until batch is fully processed and > all of the writers are closed. > While this allows us to better control how many files are created in every > partition (we keep the writer open and hence we don't need to create a new > file when a new record comes in), this brings a penalty of keeping all of the > data in memory potentially leading to OOMs, longer GC cycles, etc -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5322) Bulk-insert (row-writing) is not rewriting incoming dataset into Writer's schema
[ https://issues.apache.org/jira/browse/HUDI-5322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5322: Fix Version/s: 0.14.0 (was: 0.13.1) > Bulk-insert (row-writing) is not rewriting incoming dataset into Writer's > schema > > > Key: HUDI-5322 > URL: https://issues.apache.org/jira/browse/HUDI-5322 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Reporter: Alexey Kudinkin >Assignee: Jonathan Vexler >Priority: Critical > Fix For: 0.14.0 > > > Row-writing Bulk-insert have to rewrite incoming dataset into the finalized > Writer's schema, instead it's currently just using incoming dataset as is, > deviating in semantic from non-Row-writing flow (alas other operations) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5405) Avoid using Projections in generic Merge Into DMLs
[ https://issues.apache.org/jira/browse/HUDI-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5405: Fix Version/s: 0.14.0 (was: 0.13.1) > Avoid using Projections in generic Merge Into DMLs > -- > > Key: HUDI-5405 > URL: https://issues.apache.org/jira/browse/HUDI-5405 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Affects Versions: 0.12.1 >Reporter: Alexey Kudinkin >Priority: Critical > Fix For: 0.14.0 > > > Currently, `MergeIntoHoodieTableCommand` squarely relies on semantic > implemented by `ExpressionPayload` to be able to insert/update records. > While this is necessary since MIT semantic enables users to do sophisticated > and fine-grained updates (for ex, partial updating), this is not necessary in > the most generic case: > > {code:java} > MERGE INTO target > USING ... source > ON target.id = source.id > WHEN MATCHED THEN UPDATE * > WHEN NOT MATCHED THEN INSERT *{code} > This is essentially just a SQL way of implementing an upsert – if there are > matching records in the table we update them, otherwise – insert. > In this case there's actually no need to use ExpressionPayload at all, and we > can just simply use normal Hudi upserting flow to handle it (avoiding all of > the ExpressionPayload overhead) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5361) Propagate Hudi properties set in Spark's SQLConf to Hudi
[ https://issues.apache.org/jira/browse/HUDI-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5361: Fix Version/s: 0.14.0 (was: 0.13.1) > Propagate Hudi properties set in Spark's SQLConf to Hudi > > > Key: HUDI-5361 > URL: https://issues.apache.org/jira/browse/HUDI-5361 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: Alexey Kudinkin >Assignee: Jonathan Vexler >Priority: Critical > Fix For: 0.14.0 > > > Currently, the only property we propagate from Spark's SQLConf is > hoodie.metadata.enable. > Instead we should actually pull all of the Hudi related configs from SQLConf > and pass them to Hudi. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5438) Benchmark calls w/ metadata enabled and ensure no calls to direct FS
[ https://issues.apache.org/jira/browse/HUDI-5438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5438: Fix Version/s: 0.14.0 (was: 0.13.1) > Benchmark calls w/ metadata enabled and ensure no calls to direct FS > > > Key: HUDI-5438 > URL: https://issues.apache.org/jira/browse/HUDI-5438 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: sivabalan narayanan >Priority: Critical > Fix For: 0.14.0 > > > we need to benchmark calls to S3 (s3 access logs) and ensure when metadata is > enabled, we don't make any direct calls to FS. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5352) Jackson fails to serialize LocalDate when updating Delta Commit metadata
[ https://issues.apache.org/jira/browse/HUDI-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5352: Fix Version/s: 0.14.0 (was: 0.13.1) > Jackson fails to serialize LocalDate when updating Delta Commit metadata > > > Key: HUDI-5352 > URL: https://issues.apache.org/jira/browse/HUDI-5352 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: Alexey Kudinkin >Assignee: Raymond Xu >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.0 > > > Currently, running TestColumnStatsIndex on Spark 3.3 fails the MOR tests due > to Jackson not being able to serialize LocalData as is and requiring > additional JSR310 dependency. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5319) NPE in Bloom Filter Index
[ https://issues.apache.org/jira/browse/HUDI-5319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5319: Fix Version/s: 0.14.0 (was: 0.13.1) > NPE in Bloom Filter Index > - > > Key: HUDI-5319 > URL: https://issues.apache.org/jira/browse/HUDI-5319 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.12.1 >Reporter: Alexey Kudinkin >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.14.0 > > > {code:java} > /12/02 11:05:49 WARN TaskSetManager: Lost task 3.0 in stage 1098.0 (TID > 1300185) (ip-172-31-23-246.us-east-2.compute.internal executor 10): > java.lang.RuntimeException: org.apache.hudi.exception.HoodieIndexException: > Error checking bloom filter index. > at > org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121) > at > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:183) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:138) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > Caused by: org.apache.hudi.exception.HoodieIndexException: Error checking > bloom filter index. > at > org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:110) > at > org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:60) > at > org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119) > ... 16 more > Caused by: java.lang.NullPointerException > at > org.apache.hudi.io.HoodieKeyLookupHandle.addKey(HoodieKeyLookupHandle.java:87) > at > org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:92) > ... 18 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4944) The encoded slash (%2F) in partition path is not properly decoded during Spark read
[ https://issues.apache.org/jira/browse/HUDI-4944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4944: Fix Version/s: 0.14.0 (was: 0.13.1) > The encoded slash (%2F) in partition path is not properly decoded during > Spark read > --- > > Key: HUDI-4944 > URL: https://issues.apache.org/jira/browse/HUDI-4944 > Project: Apache Hudi > Issue Type: Bug > Components: bootstrap >Reporter: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Attachments: Untitled > > > When the source partitioned parquet table of the bootstrap operation has the > encoded slash (%2F) in the partition path, e.g., > "partition_path=2015%2F03%2F17", after the metadata-only bootstrap with the > bootstrap indexing storing the data file path containing the partition path > with the encoded slash (%2F), the target bootstrapped Hudi table cannot be > read due to FileNotFound exception. The root cause is that the encoding of > the slash is lost when creating the new Path instance with the URI (see > below, that "partition_path=2015/03/17" instead of > "partition_path=2015%2F03%2F17"). > {code:java} > Caused by: java.io.FileNotFoundException: File does not exist: > hdfs://localhost:62738/user/ethan/test_dataset_bootstrapped/partition_path=2015/03/17/e0fa3466-d3bc-43f7-b586-2f95d8745095_3-161-675_01.parquet > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1528) > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1521) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1521) > at > org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:39) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448) > at > org.apache.spark.sql.execution.datasources.parquet.Spark24HoodieParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$lzycompute$1(Spark24HoodieParquetFileFormat.scala:131) > at > org.apache.spark.sql.execution.datasources.parquet.Spark24HoodieParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$1(Spark24HoodieParquetFileFormat.scala:130) > at > org.apache.spark.sql.execution.datasources.parquet.Spark24HoodieParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(Spark24HoodieParquetFileFormat.scala:134) > at > org.apache.spark.sql.execution.datasources.parquet.Spark24HoodieParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(Spark24HoodieParquetFileFormat.scala:111) > at > org.apache.hudi.HoodieDataSourceHelper$$anonfun$buildHoodieParquetReader$1.apply(HoodieDataSourceHelper.scala:71) > at > org.apache.hudi.HoodieDataSourceHelper$$anonfun$buildHoodieParquetReader$1.apply(HoodieDataSourceHelper.scala:70) > at org.apache.hudi.HoodieBootstrapRDD.compute(HoodieBootstrapRDD.scala:60) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) {code} > The path conversion that causes the problem is in the code below. "new > URI(file.filePath)" decodes the "%2F" and converts the slash. > Spark24HoodieParquetFileFormat (same for Spark32PlusHoodieParquetFileFormat) > {code:java} > val fileSplit = > new FileSplit(new Path(new URI(file.filePath)), file.start, file.length, > Array.empty) {code} > This fails the tests below and we need to use a partition path without > slashes in the value for now: > TestHoodieDeltaStreamer#testBulkInsertsAndUpsertsWithBootstrap > ITTestHoodieDemo#testParquetDemo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4937) Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT readers
[ https://issues.apache.org/jira/browse/HUDI-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4937: Fix Version/s: 0.14.0 (was: 0.13.1) > Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT > readers > - > > Key: HUDI-4937 > URL: https://issues.apache.org/jira/browse/HUDI-4937 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core, writer-core >Affects Versions: 0.12.0 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.0 > > > Currently, `HoodieTable` is holding `HoodieBackedTableMetadata` that is setup > not to reuse actual LogScanner and HFileReader used to read MT itself. > This is proving to be wasteful on a number of occasions already, including > (not an exhaustive list): > https://github.com/apache/hudi/issues/6373 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4738) [MOR] Bloom Index missing new records inserted into Log files
[ https://issues.apache.org/jira/browse/HUDI-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4738: Fix Version/s: 0.14.0 (was: 0.13.1) > [MOR] Bloom Index missing new records inserted into Log files > - > > Key: HUDI-4738 > URL: https://issues.apache.org/jira/browse/HUDI-4738 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Alexey Kudinkin >Priority: Blocker > Fix For: 0.14.0 > > > Currently, Bloom Index is implemented under following assumption that > _file-group (once written), has fixed set of records that could not be > expanded_ (this is encoded t/h assumption that at least one version of every > record w/in the file group is stored w/in its base file). > This is relied upon when we tag incoming records w/ the locations of the > file-groups they could potentially belong to (in case, when such records are > updates), by fetching the Bloom Index info from either a) base-file or b) > record in MT Bloom Index associated w/ particular file-group id. > > However this assumption is not always true, since it's possible for _new_ > records to be inserted into the log-files, which would mean that the records > key-set of a single file-group could expand. This could lead to potentially > some records that were previously written to log-files to be duplicated. > > We need to reconcile these 2 aspects and do either of: > # Disallow expansion of the file-group records' set (by not allowing inserts > into log-files) > # Fix Bloom Index implementation to also check log-files during tagging. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5080) UnpersistRdds unpersist all rdds in the spark context
[ https://issues.apache.org/jira/browse/HUDI-5080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5080: Fix Version/s: (was: 0.13.1) > UnpersistRdds unpersist all rdds in the spark context > - > > Key: HUDI-5080 > URL: https://issues.apache.org/jira/browse/HUDI-5080 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: sivabalan narayanan >Assignee: Raymond Xu >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0 > > > In SparkRDDWriteClient, we have a method to clean up persisted Rdds to free > up the space occupied. > [https://github.com/apache/hudi/blob/b78c3441c4e28200abec340eaff852375764cbdb/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java#L584] > But the issue is, it cleans up all persisted rdds in the given spark context. > This will impact, async compaction or any other async table services running. > or even if there are multiple streams writing to different tables, this will > be cause a huge impact. > > This also needs to be fixed with DeltaSync. > [https://github.com/apache/hudi/blob/b78c3441c4e28200abec340eaff852375764cbdb/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L345] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4947) Missing .hoodie/hoodie.properties in Hudi table
[ https://issues.apache.org/jira/browse/HUDI-4947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4947: Fix Version/s: 0.14.0 (was: 0.13.1) > Missing .hoodie/hoodie.properties in Hudi table > --- > > Key: HUDI-4947 > URL: https://issues.apache.org/jira/browse/HUDI-4947 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Priority: Major > Fix For: 0.14.0 > > > At some point, the ingestion job reports that hoodie.properties is missing > and neither of hoodie.properties nor hoodie.properties.backup is present. > Sample stacktrace: > > {code:java} > Caused by: org.apache.hudi.exception.HoodieIOException: Could not load Hoodie > properties from s3://.../.hoodie/hoodie.properties > at > org.apache.hudi.common.table.HoodieTableConfig.(HoodieTableConfig.java:254) > at > org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:125) > at > org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:78) > at > org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:668) > at > org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$getHoodieTableConfig$1(HoodieSparkSqlWriter.scala:756) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.hudi.HoodieSparkSqlWriter$.getHoodieTableConfig(HoodieSparkSqlWriter.scala:757) > at > org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:85) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:165) > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4922) Presto query of bootstrapped data returns null
[ https://issues.apache.org/jira/browse/HUDI-4922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4922: Fix Version/s: 0.14.0 (was: 0.13.1) > Presto query of bootstrapped data returns null > --- > > Key: HUDI-4922 > URL: https://issues.apache.org/jira/browse/HUDI-4922 > Project: Apache Hudi > Issue Type: Bug > Components: bootstrap >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Critical > Fix For: 0.14.0 > > > https://github.com/apache/hudi/issues/6532 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5092) Querying Hudi table throws NoSuchMethodError in Databricks runtime
[ https://issues.apache.org/jira/browse/HUDI-5092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5092: Fix Version/s: 0.14.0 (was: 0.13.1) > Querying Hudi table throws NoSuchMethodError in Databricks runtime > --- > > Key: HUDI-5092 > URL: https://issues.apache.org/jira/browse/HUDI-5092 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.0 >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.14.0 > > Attachments: image (1).png, image.png > > > Originally reported by the user: > [https://github.com/apache/hudi/issues/6137] > > Crux of the issue is that Databricks's DBR runtime diverges from OSS Spark, > and in that case `FileStatusCache` API is very clearly divergent b/w the two. > There are a few approaches we can take: > # Avoid reliance on Spark's FIleStatusCache implementation altogether and > rely on our own one > # Apply more staggered approach where we first try to use Spark's > FileStatusCache and if it doesn't match expected API, we fallback to our own > impl > > Approach # 1 would actually mean that we're not sharing cache implementation > w/ Spark, which in turn would entail that in some cases we might be keeping 2 > instances of the same cache. Approach # 2 remediates that and allows us to > only fallback in case API is not compatible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5015) Cleaner does not work properly when metadata table is enabled
[ https://issues.apache.org/jira/browse/HUDI-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5015: Fix Version/s: 0.14.0 (was: 0.13.1) > Cleaner does not work properly when metadata table is enabled > - > > Key: HUDI-5015 > URL: https://issues.apache.org/jira/browse/HUDI-5015 > Project: Apache Hudi > Issue Type: Bug > Components: cleaning >Reporter: Pratyaksh Sharma >Assignee: Pratyaksh Sharma >Priority: Major > Fix For: 0.14.0 > > > Please see [https://github.com/apache/hudi/pull/6926] for more context. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4958) Provide accurate numDeletes in commit metadata
[ https://issues.apache.org/jira/browse/HUDI-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4958: Fix Version/s: 0.14.0 (was: 0.13.1) > Provide accurate numDeletes in commit metadata > -- > > Key: HUDI-4958 > URL: https://issues.apache.org/jira/browse/HUDI-4958 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 0.14.0 > > > When doing a simple computation of {{numInserts - numDeletes}} for all the > commits, this leads to negative total records. Need to check if number of > inserts and deletes are accurate when both inserts and deletes exist in the > same input batch for upsert. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5229) Add flink avro version entry in root pom
[ https://issues.apache.org/jira/browse/HUDI-5229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-5229: Fix Version/s: 0.14.0 (was: 0.13.1) > Add flink avro version entry in root pom > > > Key: HUDI-5229 > URL: https://issues.apache.org/jira/browse/HUDI-5229 > Project: Apache Hudi > Issue Type: Bug > Components: flink >Reporter: Danny Chen >Priority: Major > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4777) Flink gen bucket index of mor table not consistent with spark lead to duplicate bucket issue
[ https://issues.apache.org/jira/browse/HUDI-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4777: Fix Version/s: 0.14.0 (was: 0.13.1) > Flink gen bucket index of mor table not consistent with spark lead to > duplicate bucket issue > > > Key: HUDI-4777 > URL: https://issues.apache.org/jira/browse/HUDI-4777 > Project: Apache Hudi > Issue Type: Bug > Components: index >Reporter: JinxinTang >Assignee: JinxinTang >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4921) Fix last completed commit in CleanPlanner
[ https://issues.apache.org/jira/browse/HUDI-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4921: Fix Version/s: 0.14.0 (was: 0.13.1) > Fix last completed commit in CleanPlanner > - > > Key: HUDI-4921 > URL: https://issues.apache.org/jira/browse/HUDI-4921 > Project: Apache Hudi > Issue Type: Bug > Components: cleaning >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0 > > > Recently we added last completed commit in as part of clean commit metadata. > ideally the value should represent the last completed commit in timeline > before er which there are no inflight commits. but we just get the last > completed commit in active timeline and setting the value. > this needs fixing. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4854) Deltastreamer does not respect partition selector regex for metadata-only bootstrap
[ https://issues.apache.org/jira/browse/HUDI-4854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4854: Fix Version/s: 0.14.0 (was: 0.13.1) > Deltastreamer does not respect partition selector regex for metadata-only > bootstrap > --- > > Key: HUDI-4854 > URL: https://issues.apache.org/jira/browse/HUDI-4854 > Project: Apache Hudi > Issue Type: Bug > Components: bootstrap >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4852) Incremental sync not updating pending file groups under clustering
[ https://issues.apache.org/jira/browse/HUDI-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4852: Fix Version/s: 0.14.0 (was: 0.13.1) > Incremental sync not updating pending file groups under clustering > -- > > Key: HUDI-4852 > URL: https://issues.apache.org/jira/browse/HUDI-4852 > Project: Apache Hudi > Issue Type: Bug >Reporter: Surya Prasanna Yalla >Assignee: Surya Prasanna Yalla >Priority: Critical > Fix For: 0.14.0 > > > Pending file groups under clustering are not updated through incremental sync > calls. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4818) Using CustomKeyGenerator fails w/ SparkHoodieTableFileIndex
[ https://issues.apache.org/jira/browse/HUDI-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4818: Fix Version/s: 0.14.0 (was: 0.13.1) > Using CustomKeyGenerator fails w/ SparkHoodieTableFileIndex > --- > > Key: HUDI-4818 > URL: https://issues.apache.org/jira/browse/HUDI-4818 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.0 > > > Currently using `CustomKeyGenerator` with the partition-path config > \{hoodie.datasource.write.partitionpath.field=ts:timestamp} fails w/ > {code:java} > Caused by: java.lang.RuntimeException: Failed to cast value `2022-05-11` to > `LongType` for partition column `ts_ms` > at > org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.$anonfun$parsePartition$2(Spark3ParsePartitionUtil.scala:72) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.$anonfun$parsePartition$1(Spark3ParsePartitionUtil.scala:65) > at scala.Option.map(Option.scala:230) > at > org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.parsePartition(Spark3ParsePartitionUtil.scala:63) > at > org.apache.hudi.SparkHoodieTableFileIndex.parsePartitionPath(SparkHoodieTableFileIndex.scala:274) > at > org.apache.hudi.SparkHoodieTableFileIndex.parsePartitionColumnValues(SparkHoodieTableFileIndex.scala:258) > at > org.apache.hudi.BaseHoodieTableFileIndex.lambda$getAllQueryPartitionPaths$3(BaseHoodieTableFileIndex.java:190) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at > java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) > at > org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:193) > {code} > > This occurs b/c SparkHoodieTableFileIndex produces incorrect partition schema > at XXX > where it properly handles only `TimestampBasedKeyGenerator`s but not the > other key-generators that might be changing the data-type of the > partition-value as compared to the source partition-column (in this case it > has `ts` as a long in the source table schema, but it produces > partition-value as string) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4632) Remove the force active property for flink1.14 profile
[ https://issues.apache.org/jira/browse/HUDI-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4632: Fix Version/s: 0.14.0 (was: 0.13.1) > Remove the force active property for flink1.14 profile > -- > > Key: HUDI-4632 > URL: https://issues.apache.org/jira/browse/HUDI-4632 > Project: Apache Hudi > Issue Type: Bug > Components: flink >Affects Versions: 0.11.1 >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4643) MergeInto syntax WHEN MATCHED is optional but must be set
[ https://issues.apache.org/jira/browse/HUDI-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4643: Fix Version/s: 0.14.0 (was: 0.13.1) > MergeInto syntax WHEN MATCHED is optional but must be set > - > > Key: HUDI-4643 > URL: https://issues.apache.org/jira/browse/HUDI-4643 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > > {code:java} > spark.sql( > s""" > |create table $tableName ( > | id int, > | name string, > | price double, > | ts long, > | dt string > |) using hudi > | location '${tmp.getCanonicalPath}/$tableName' > | tblproperties ( > | primaryKey ='id', > | preCombineField = 'ts' > | ) > """.stripMargin) > // Insert data > spark.sql(s"insert into $tableName select 1, 'a1', 1, 10, '2022-08-18'") > spark.sql( > s""" > | merge into $tableName as t0 > | using ( > | select 1 as id, 'a1' as name, 11 as price, 110 as ts, '2022-08-19' as dt > union all > | select 2 as id, 'a2' as name, 10 as price, 100 as ts, '2022-08-18' as dt > | ) as s0 > | on t0.id = s0.id > | when not matched then insert * > """.stripMargin > ) > {code} > > {code:java} > 11493 [Executor task launch worker for task 65] ERROR > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor - Error > upserting bucketType UPDATE for partition :0 > org.apache.hudi.exception.HoodieException: > org.apache.hudi.exception.HoodieException: > java.util.concurrent.ExecutionException: java.lang.AssertionError: assertion > failed: hoodie.payload.update.condition.assignments have not set > at > org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:149) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleInsertPartition(BaseSparkCommitActionExecutor.java:335) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:246) > > {code} > > > if set hoodie.merge.allow.duplicate.on.inserts = true,The result is one more > record than expected: > [1,a1,1.0,10,2022-08-18], [1,a1,11.0,110,2022-08-19], > [2,a2,10.0,100,2022-08-18] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4704) bulk insert overwrite table will delete the table and then recreate a table
[ https://issues.apache.org/jira/browse/HUDI-4704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4704: Fix Version/s: 0.14.0 (was: 0.13.1) > bulk insert overwrite table will delete the table and then recreate a table > --- > > Key: HUDI-4704 > URL: https://issues.apache.org/jira/browse/HUDI-4704 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql, writer-core >Affects Versions: 0.12.0 >Reporter: zouxxyy >Assignee: Raymond Xu >Priority: Major > Fix For: 0.14.0 > > > When hoodie.sql.bulk.insert.enable is enabled, executing insert overwrite > will delete the table and then recreate a table, so time travel cannot be > performed. > > {code:java} > create table hudi_cow_test_tbl ( > id bigint, > name string, > ts bigint, > dt string, > hh string > ) using hudi > tblproperties ( > type = 'cow', > primaryKey = 'id', > preCombineField = 'ts', > 'hoodie.sql.insert.mode' = 'non-strict', > 'hoodie.sql.bulk.insert.enable' = 'true' > ); > insert into hudi_cow_test_tbl select 1, 'a1', 1001, '2021-12-09', '11'; > insert overwrite hudi_cow_test_tbl select 3, 'a3', 1001, '2021-12-09', '11'; > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4542) Flink streaming query fails with ClassNotFoundException
[ https://issues.apache.org/jira/browse/HUDI-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4542: Fix Version/s: 0.14.0 (was: 0.13.1) > Flink streaming query fails with ClassNotFoundException > --- > > Key: HUDI-4542 > URL: https://issues.apache.org/jira/browse/HUDI-4542 > Project: Apache Hudi > Issue Type: Bug > Components: flink-sql >Reporter: Ethan Guo >Priority: Critical > Fix For: 0.14.0 > > Attachments: Screen Shot 2022-08-04 at 17.17.42.png > > > Environment: EMR 6.7.0 Flink 1.14.2 > Reproducible steps: Build Hudi Flink bundle from master > {code:java} > mvn clean package -DskipTests -pl :hudi-flink1.14-bundle -am {code} > Copy to EMR master node /lib/flink/lib > Launch Flink SQL client: > {code:java} > cd /lib/flink && ./bin/yarn-session.sh --detached > ./bin/sql-client.sh {code} > Write a Hudi table with a few commits with metadata table enabled (no column > stats). Then, run the following for the streaming query > {code:java} > CREATE TABLE t2( > uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED, > name VARCHAR(10), > age INT, > ts TIMESTAMP(3), > `partition` VARCHAR(20) > ) > PARTITIONED BY (`partition`) > WITH ( > 'connector' = 'hudi', > 'path' = 's3a://', > 'table.type' = 'MERGE_ON_READ', > 'read.streaming.enabled' = 'true', -- this option enable the streaming > read > 'read.start-commit' = '20220803165232362', -- specifies the start commit > instant time > 'read.streaming.check-interval' = '4' -- specifies the check interval for > finding new source commits, default 60s. > ); {code} > {code:java} > select * from t2; {code} > {code:java} > Flink SQL> select * from t2; > 2022-08-05 00:12:43,635 INFO org.apache.hadoop.metrics2.impl.MetricsConfig > [] - Loaded properties from hadoop-metrics2.properties > 2022-08-05 00:12:43,650 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl [] - Scheduled > Metric snapshot period at 300 second(s). > 2022-08-05 00:12:43,650 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl [] - > s3a-file-system metrics system started > 2022-08-05 00:12:47,722 INFO org.apache.hadoop.fs.s3a.S3AInputStream > [] - Switching to Random IO seek policy > 2022-08-05 00:12:47,941 INFO org.apache.hadoop.yarn.client.RMProxy > [] - Connecting to ResourceManager at > ip-172-31-9-157.us-east-2.compute.internal/172.31.9.157:8032 > 2022-08-05 00:12:47,942 INFO org.apache.hadoop.yarn.client.AHSProxy > [] - Connecting to Application History server at > ip-172-31-9-157.us-east-2.compute.internal/172.31.9.157:10200 > 2022-08-05 00:12:47,942 INFO org.apache.flink.yarn.YarnClusterDescriptor > [] - No path for the flink jar passed. Using the location of > class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar > 2022-08-05 00:12:47,942 WARN org.apache.flink.yarn.YarnClusterDescriptor > [] - Neither the HADOOP_CONF_DIR nor the YARN_CONF_DIR > environment variable is set.The Flink YARN Client needs one of these to be > set to properly load the Hadoop configuration for accessing YARN. > 2022-08-05 00:12:47,959 INFO org.apache.flink.yarn.YarnClusterDescriptor > [] - Found Web Interface > ip-172-31-3-92.us-east-2.compute.internal:39605 of application > 'application_1659656614768_0001'. > [ERROR] Could not execute SQL statement. Reason: > java.lang.ClassNotFoundException: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat{code} > {code:java} > 2022-08-04 17:12:59 > org.apache.flink.runtime.JobException: Recovery is suppressed by > NoRestartBackoffTimeStrategy > at > org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:138) > at > org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:82) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:228) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:218) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:209) > at > org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:679) > at > org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:79) > at > org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:444) > at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source) > at > sun.reflect
[jira] [Updated] (HUDI-4573) Fix HoodieMultiTableDeltaStreamer to write all tables in continuous mode
[ https://issues.apache.org/jira/browse/HUDI-4573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4573: Fix Version/s: 0.14.0 (was: 0.13.1) > Fix HoodieMultiTableDeltaStreamer to write all tables in continuous mode > > > Key: HUDI-4573 > URL: https://issues.apache.org/jira/browse/HUDI-4573 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: Ethan Guo >Assignee: sivabalan narayanan >Priority: Major > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4541) Flink job fails with column stats enabled in metadata table due to NotSerializableException
[ https://issues.apache.org/jira/browse/HUDI-4541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4541: Fix Version/s: 0.14.0 (was: 0.13.1) > Flink job fails with column stats enabled in metadata table due to > NotSerializableException > > > Key: HUDI-4541 > URL: https://issues.apache.org/jira/browse/HUDI-4541 > Project: Apache Hudi > Issue Type: Bug > Components: flink-sql >Reporter: Ethan Guo >Priority: Blocker > Fix For: 0.14.0 > > Attachments: Screen Shot 2022-08-04 at 17.10.05.png > > > Environment: EMR 6.7.0 Flink 1.14.2 > Reproducible steps: Build Hudi Flink bundle from master > {code:java} > mvn clean package -DskipTests -pl :hudi-flink1.14-bundle -am {code} > Copy to EMR master node /lib/flink/lib > Launch Flink SQL client: > {code:java} > cd /lib/flink && ./bin/yarn-session.sh --detached > ./bin/sql-client.sh {code} > Run the following from the Flink quick start guide with metadata table, > column stats, and data skipping enabled > {code:java} > CREATE TABLE t1( > uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED, > name VARCHAR(10), > age INT, > ts TIMESTAMP(3), > `partition` VARCHAR(20) > ) > PARTITIONED BY (`partition`) > WITH ( > 'connector' = 'hudi', > 'path' = 's3a://', > 'table.type' = 'MERGE_ON_READ', -- this creates a MERGE_ON_READ table, by > default is COPY_ON_WRITE > 'metadata.enabled' = 'true', -- enables multi-modal index and metadata table > 'hoodie.metadata.index.column.stats.enable' = 'true', -- enables column > stats in metadata table > 'read.data.skipping.enabled' = 'true' -- enables data skipping > ); > INSERT INTO t1 VALUES > ('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'), > ('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'), > ('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'), > ('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'), > ('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'), > ('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'), > ('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'), > ('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4'); {code} > !Screen Shot 2022-08-04 at 17.10.05.png|width=1130,height=463! > Exception: > {code:java} > 2022-08-04 17:04:41 > org.apache.flink.runtime.JobException: Recovery is suppressed by > NoRestartBackoffTimeStrategy > at > org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:138) > at > org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:82) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:228) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:218) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:209) > at > org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:679) > at > org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:79) > at > org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:444) > at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:316) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:314) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:217) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) > at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) > at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) > at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) > at scala.PartialFunction$
[jira] [Updated] (HUDI-4457) Make sure IT docker test return code non-zero when failed
[ https://issues.apache.org/jira/browse/HUDI-4457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4457: Fix Version/s: 0.14.0 (was: 0.13.1) > Make sure IT docker test return code non-zero when failed > - > > Key: HUDI-4457 > URL: https://issues.apache.org/jira/browse/HUDI-4457 > Project: Apache Hudi > Issue Type: Bug > Components: tests-ci >Reporter: Raymond Xu >Priority: Major > Fix For: 0.14.0 > > > IT testcase where docker command runs and returns exit code 0, but test > actually failed. This will be misleading for troubleshooting. > TODO > 1. verify the behavior > 2. fix it -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4430) Incorrect type casting while reading HUDI table created with CustomKeyGenerator and unixtimestamp paritioning field
[ https://issues.apache.org/jira/browse/HUDI-4430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4430: Fix Version/s: 0.14.0 (was: 0.13.1) > Incorrect type casting while reading HUDI table created with > CustomKeyGenerator and unixtimestamp paritioning field > --- > > Key: HUDI-4430 > URL: https://issues.apache.org/jira/browse/HUDI-4430 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Affects Versions: 0.12.0 >Reporter: Volodymyr Burenin >Assignee: Alexey Kudinkin >Priority: Critical > Fix For: 0.14.0 > > > Hi, > I have discovered an issue that doesn't play nicely with the custom key > generatosr, basically anything that is not TimestampBasedKeyGenerator or > TimestampBasedAvroKeyGenerator. > {{While trying to read a table that was created with these parameters(the > rest don't matter):}} > {code:java} > hoodie.datasource.write.recordkey.field=query_id,event_type > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > hoodie.datasource.write.partitionpath.field=create_time_epoch_seconds:timestamp > hoodie.deltastreamer.keygen.timebased.timestamp.type=UNIX_TIMESTAMP > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd{code} > {color:#172b4d}I get and error that looks like:{color} > {code:java} > 22/07/20 20:32:48 DEBUG Spark32HoodieParquetFileFormat: Appending > StructType(StructField(create_time_epoch_seconds,LongType,true)) [2022/07/13] > 22/07/20 20:32:48 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5) > java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot > be cast to java.lang.Long > at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong(rows.scala:42) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong$(rows.scala:42) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getLong(rows.scala:195) > at > org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:66) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:245) > {code} > Apparently the issue is in _partitionSchemaFromProperties function in file: > hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala > that checks for a class type it uses StructType of String for. > Once it is any non Timestamp based known class it basically uses whatever > type it is and then fails to retrieve the value for. > I have a proposal here which we probably need: Give a user a way to force a > string type if needed and add ability to add a prefixed column that contains > a processed partition value. It could be done as two separate features. > This problem is critical for me, so I have to change Hoodie source code on my > end temporary to make it work. > Here is how I roughly changed the referenced function: > > {code:java} > /** > * Get the partition schema from the hoodie.properties. > */ > private lazy val _partitionSchemaFromProperties: StructType = { > val tableConfig = metaClient.getTableConfig > val partitionColumns = tableConfig.getPartitionFields > if (partitionColumns.isPresent) { > val partitionFields = partitionColumns.get().map(column => > StructField("_hoodie_"+column, StringType)) > StructType(partitionFields) > } else { > // If the partition columns have not stored in hoodie.properties(the > table that was > // created earlier), we trait it as a non-partitioned table. > logWarning("No partition columns available from hoodie.properties." + > " Partition pruning will not work") > new StructType() > } > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4539) Make Hudi's CLI API consistent
[ https://issues.apache.org/jira/browse/HUDI-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4539: Fix Version/s: 0.14.0 (was: 0.13.1) > Make Hudi's CLI API consistent > -- > > Key: HUDI-4539 > URL: https://issues.apache.org/jira/browse/HUDI-4539 > Project: Apache Hudi > Issue Type: Bug > Components: cli >Reporter: Alexey Kudinkin >Priority: Critical > Fix For: 0.14.0 > > > Currently API provided by the CLI is inconsistent: > # Some of the commands (to display metadata for ex) are applicable to some > commits/actions but not others > # Same actions should be applicable to both active and archived timeline > (from the CLI standpoint there should be essentially no difference) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4330) NPE when trying to upsert into a dataset with no Meta Fields
[ https://issues.apache.org/jira/browse/HUDI-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4330: Fix Version/s: 0.14.0 (was: 0.13.1) > NPE when trying to upsert into a dataset with no Meta Fields > > > Key: HUDI-4330 > URL: https://issues.apache.org/jira/browse/HUDI-4330 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Raymond Xu >Priority: Critical > Fix For: 0.14.0 > > > When trying to upsert into a dataset with Meta Fields being disabled, you > will encounter obscure NPE like below: > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 25 in stage 20.0 failed 4 times, most recent failure: Lost task 25.3 in > stage 20.0 (TID 4110) (ip-172-31-20-53.us-west-2.compute.internal executor > 7): java.lang.RuntimeException: > org.apache.hudi.exception.HoodieIndexException: Error checking bloom filter > index. > at > org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121) > at > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > Caused by: org.apache.hudi.exception.HoodieIndexException: Error checking > bloom filter index. > at > org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:110) > at > org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:60) > at > org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119) > ... 16 more > Caused by: java.lang.NullPointerException > at > org.apache.hudi.io.HoodieKeyLookupHandle.addKey(HoodieKeyLookupHandle.java:88) > at > org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:92) > ... 18 more {code} > Instead, we could be more explicit as to why this could have happened > (meta-fields disabled -> no bloom filter created -> unable to do upserts) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4266) Flink streaming reader can not work when there are multiple partition fields
[ https://issues.apache.org/jira/browse/HUDI-4266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4266: Fix Version/s: 0.14.0 (was: 0.13.1) > Flink streaming reader can not work when there are multiple partition fields > > > Key: HUDI-4266 > URL: https://issues.apache.org/jira/browse/HUDI-4266 > Project: Apache Hudi > Issue Type: Bug > Components: flink-sql >Affects Versions: 0.11.0 >Reporter: Danny Chen >Priority: Major > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4321) Fix Hudi to not write in Parquet legacy format
[ https://issues.apache.org/jira/browse/HUDI-4321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4321: Fix Version/s: 0.14.0 (was: 0.13.1) > Fix Hudi to not write in Parquet legacy format > -- > > Key: HUDI-4321 > URL: https://issues.apache.org/jira/browse/HUDI-4321 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Priority: Critical > Fix For: 0.14.0 > > > Currently Hudi have to write in Parquet legacy-format > ("spark.sql.parquet.writeLegacyFormat") whenever schema contains Decimals, > due to the fact that it relies on AvroParquetReader which is unable to read > Decimals in the non-legacy format (ie it could only read Decimals encoded as > FIXED_BYTE_ARRAY, and not as INT32/INT64) > This leads to suboptimal storage footprint where for example on some datasets > this could lead to a bloat of 10% or more. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4184) Creating external table in Spark SQL modifies "hoodie.properties"
[ https://issues.apache.org/jira/browse/HUDI-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4184: Fix Version/s: 0.14.0 (was: 0.13.1) > Creating external table in Spark SQL modifies "hoodie.properties" > - > > Key: HUDI-4184 > URL: https://issues.apache.org/jira/browse/HUDI-4184 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: Alexey Kudinkin >Assignee: Sagar Sumit >Priority: Critical > Fix For: 0.14.0 > > > My setup was like following: > # There's a table existing in one AWS account > # I'm trying to access that table from Spark SQL from _another_ AWS account > that only has Read permissions to the bucket with the table. > # Now when issuing "CREATE TABLE" Spark SQL command it fails b/c Hudi tries > to modify "hoodie.properties" file for whatever reason, even though i'm not > modifying the table and just trying to create table in the catalog. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4369) Hudi Kafka Connect Sink writing to GCS bucket
[ https://issues.apache.org/jira/browse/HUDI-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4369: Fix Version/s: 0.14.0 (was: 0.13.1) > Hudi Kafka Connect Sink writing to GCS bucket > - > > Key: HUDI-4369 > URL: https://issues.apache.org/jira/browse/HUDI-4369 > Project: Apache Hudi > Issue Type: Bug > Components: kafka-connect >Reporter: Vishal Agarwal >Priority: Critical > Fix For: 0.14.0 > > > Hi team, > I am trying to use Hudi sink connector with Kafka Connect to write to GCS > bucket. But I am getting error regarding "gs" file scheme. I have added all > GCS related properties in core-site.xml and the corresponding gcs-connector > jar in the plugin path. But still facing the issue. > The issue was already reported with S3 as per jira > https://issues.apache.org/jira/browse/HUDI-3610. But I am unable to get the > resolution. > Happy to discuss on this ! > Thanks > *StackTrace-* > %d [%thread] %-5level %logger - %msg%n > org.apache.hudi.exception.HoodieException: Fatal error instantiating Hudi > Write Provider > at > org.apache.hudi.connect.writers.KafkaConnectWriterProvider.(KafkaConnectWriterProvider.java:103) > ~[connectors-uber.jar:?] > at > org.apache.hudi.connect.transaction.ConnectTransactionParticipant.(ConnectTransactionParticipant.java:65) > ~[connectors-uber.jar:?] > at org.apache.hudi.connect.HoodieSinkTask.bootstrap(HoodieSinkTask.java:198) > [connectors-uber.jar:?] > at org.apache.hudi.connect.HoodieSinkTask.open(HoodieSinkTask.java:151) > [connectors-uber.jar:?] > at > org.apache.kafka.connect.runtime.WorkerSinkTask.openPartitions(WorkerSinkTask.java:587) > [connect-runtime-2.4.1.jar:?] > at > org.apache.kafka.connect.runtime.WorkerSinkTask.access$1100(WorkerSinkTask.java:67) > [connect-runtime-2.4.1.jar:?] > at > org.apache.kafka.connect.runtime.WorkerSinkTask$HandleRebalance.onPartitionsAssigned(WorkerSinkTask.java:652) > [connect-runtime-2.4.1.jar:?] > at > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.invokePartitionsAssigned(ConsumerCoordinator.java:272) > [kafka-clients-2.4.1.jar:?] > at > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:400) > [kafka-clients-2.4.1.jar:?] > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:421) > [kafka-clients-2.4.1.jar:?] > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:340) > [kafka-clients-2.4.1.jar:?] > at > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:471) > [kafka-clients-2.4.1.jar:?] > at > org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1267) > [kafka-clients-2.4.1.jar:?] > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1231) > [kafka-clients-2.4.1.jar:?] > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1211) > [kafka-clients-2.4.1.jar:?] > at > org.apache.kafka.connect.runtime.WorkerSinkTask.pollConsumer(WorkerSinkTask.java:444) > [connect-runtime-2.4.1.jar:?] > at > org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:317) > [connect-runtime-2.4.1.jar:?] > at > org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224) > [connect-runtime-2.4.1.jar:?] > at > org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192) > [connect-runtime-2.4.1.jar:?] > at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177) > [connect-runtime-2.4.1.jar:?] > at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227) > [connect-runtime-2.4.1.jar:?] > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [?:1.8.0_331] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_331] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_331] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_331] > at java.lang.Thread.run(Thread.java:750) [?:1.8.0_331] > Caused by: org.apache.hudi.exception.HoodieIOException: Failed to get > instance of org.apache.hadoop.fs.FileSystem > at org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:109) > ~[connectors-uber.jar:?] > at org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:100) > ~[connectors-uber.jar:?] > at org.apache.hudi.client.BaseHoodieClient.(BaseHoodieClient.java:69) > ~[connectors-uber.jar:?] > at > org.apache.hudi.client.BaseHoodieWriteClient.(BaseHoodieWriteClient.java:175) > ~[connectors-uber.jar:?] > a
[jira] [Updated] (HUDI-4341) HoodieHFileReader is not compatible with Hadoop 3
[ https://issues.apache.org/jira/browse/HUDI-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4341: Fix Version/s: 0.14.0 (was: 0.13.1) > HoodieHFileReader is not compatible with Hadoop 3 > - > > Key: HUDI-4341 > URL: https://issues.apache.org/jira/browse/HUDI-4341 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Critical > Labels: spark > Fix For: 0.14.0 > > > [https://github.com/apache/hudi/issues/5765] > Spark SQL throws "java.lang.NoSuchMethodError: > org.apache.hadoop.hdfs.client.HdfsDataInputStream.getReadStatistics()" after > a while. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4185) Evaluate alternatives to using "hoodie.properties" as state store for Metadata Table
[ https://issues.apache.org/jira/browse/HUDI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4185: Fix Version/s: 0.14.0 (was: 0.13.1) > Evaluate alternatives to using "hoodie.properties" as state store for > Metadata Table > > > Key: HUDI-4185 > URL: https://issues.apache.org/jira/browse/HUDI-4185 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: Alexey Kudinkin >Assignee: Sagar Sumit >Priority: Critical > Fix For: 0.14.0 > > > Currently Metadata Table uses "hoodie.properties" file as a state-store > adding properties reflecting the state of the metadata table being indexed. > This is creating some issues (for ex, HUDI-4138) in respect to the > "hoodie.properties" lifecycle as most of the already existing code assumes > that the file is (mostly) immutable. > We should re-evaluate our usage of "hoodie.properties" as a state-store given > that it has ripple effects on the existing components. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4306) ComplexKeyGenerator and ComplexAvroKeyGenerator support non-partitioned table
[ https://issues.apache.org/jira/browse/HUDI-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4306: Fix Version/s: 0.14.0 (was: 0.13.1) > ComplexKeyGenerator and ComplexAvroKeyGenerator support non-partitioned table > - > > Key: HUDI-4306 > URL: https://issues.apache.org/jira/browse/HUDI-4306 > Project: Apache Hudi > Issue Type: Bug > Components: flink-sql >Reporter: Danny Chen >Assignee: Nicholas Jiang >Priority: Major > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3940) Lock manager does not increment retry count upon exception
[ https://issues.apache.org/jira/browse/HUDI-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3940: Fix Version/s: (was: 0.13.1) > Lock manager does not increment retry count upon exception > -- > > Key: HUDI-3940 > URL: https://issues.apache.org/jira/browse/HUDI-3940 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sagar Sumit >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0, 0.12.1, 0.13.0, 0.12.3, 0.14.0 > > > Came up while debugging CI failure: > https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=8198&view=logs&j=3272dbb2-0925-5f35-bae7-04e75ae62175&t=e3c8a1bc-8efe-5852-1800-3bd561aebfc8 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3976) Newly introduced HiveSyncConfig config, syncAsSparkDataSourceTable is defaulted as true
[ https://issues.apache.org/jira/browse/HUDI-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3976: Fix Version/s: 0.14.0 (was: 0.13.1) > Newly introduced HiveSyncConfig config, syncAsSparkDataSourceTable is > defaulted as true > --- > > Key: HUDI-3976 > URL: https://issues.apache.org/jira/browse/HUDI-3976 > Project: Apache Hudi > Issue Type: Bug >Reporter: Surya Prasanna Yalla >Priority: Critical > Fix For: 0.14.0 > > > Newly introduced HiveSyncConfig config, syncAsSparkDataSourceTable is > defaulted as true. With this config enabled, both tableProperties and > serdeProperties are added to HMS. After that when running spark.sql queries > on the table the queries are failing with schema mismatch errors. Also this > is not backward compatible, using 0.8 version we are not able to read the > hive queries. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4112) Relax constraint in metadata table that rollback of a commit that got archived in MDT throws exception
[ https://issues.apache.org/jira/browse/HUDI-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4112: Fix Version/s: 0.14.0 (was: 0.13.1) > Relax constraint in metadata table that rollback of a commit that got > archived in MDT throws exception > -- > > Key: HUDI-4112 > URL: https://issues.apache.org/jira/browse/HUDI-4112 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Critical > Fix For: 0.14.0 > > > when we are trying to rollback a commit, and if the commit it archived in > MDT, when this rollback is getting applied to MDT, we throw exception. > > excerpt from HoodieTableMetadataUtil.java > > {code:java} > HoodieInstant syncedInstant = new HoodieInstant(false, > HoodieTimeline.DELTA_COMMIT_ACTION, instantToRollback); > if > (metadataTableTimeline.getCommitsTimeline().isBeforeTimelineStarts(syncedInstant.getTimestamp())) > { > throw new HoodieMetadataException(String.format("The instant %s required to > sync rollback of %s has been archived", > syncedInstant, instantToRollback)); > } > shouldSkip = !metadataTableTimeline.containsInstant(syncedInstant); > if (!hasNonZeroRollbackLogFiles && shouldSkip) { > LOG.info(String.format("Skipping syncing of rollbackMetadata at %s, since > this instant was never committed to Metadata Table", > instantToRollback)); > return; > } {code} > > This is very much valid in case of restore operation. > C1, C2, C3, C4, C5, C6. > C2 savepointed. > but MDT could have archived C2(aggressive archival commits) since all C1 to > C6 are committed. So, when we trigger restore to C1, it will invoke rollback > of C6, C5... C2. > So, with savepoint and restore flow, this is a valid scenario and we need to > relax the constraint. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3342) MOR Delta Block Rollbacks not applied if Lazy Block reading is disabled
[ https://issues.apache.org/jira/browse/HUDI-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3342: Fix Version/s: 0.14.0 (was: 0.13.1) > MOR Delta Block Rollbacks not applied if Lazy Block reading is disabled > --- > > Key: HUDI-3342 > URL: https://issues.apache.org/jira/browse/HUDI-3342 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: Alexey Kudinkin >Assignee: Raymond Xu >Priority: Critical > Fix For: 0.14.0 > > > While working on HUDI-3322, i've spotted following contraption: > When we are rolling back Delta Commits, we add corresponding > {{ROLLBACK_PREVIOUS_BLOCK}} Command Block at the back of the "queue". When we > restore, we issue a sequence of Rollbacks, which means that stack if such > Rollback Blocks could be of size > 1. > However, when reading that MOR table if the reader does not specify > `readBlocksLazily=true`, we'd be merging Blocks eagerly (when instants > increment) therefore essentially rendering such Rollback Blocks useless since > they can't "unmerge" previously merged records, resurrecting the data that > was supposed to be rolled back. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4154) Unable to write HUDI Tables to S3 via Flink SQL
[ https://issues.apache.org/jira/browse/HUDI-4154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-4154: Fix Version/s: 0.14.0 (was: 0.13.1) > Unable to write HUDI Tables to S3 via Flink SQL > --- > > Key: HUDI-4154 > URL: https://issues.apache.org/jira/browse/HUDI-4154 > Project: Apache Hudi > Issue Type: Bug > Components: connectors >Reporter: sambhav gupta >Priority: Major > Fix For: 0.14.0 > > Attachments: Error_hudi.png, Flink-conf.yaml.png, > FlinkHudiTable_create.png, core-site.xml.png > > > When trying to write Hudi Tables into MinIO(S3) via Flink SQL we are facing > issues . > The configuration is as follows: > 1) MinIO S3 working on localhost:9000 - Latest docker image > 2) Flink 1.13.6 > 3) Hudi - hudi-flink-bundle_2.11-0.10.1.jar > 4) etc/core/site.xml set with S3 properties for access key, secret key and > endpoint already > When we create a MOR Hudi table as follows and try to insert records in it we > face an issue. > > create table t1s3hudi(id int PRIMARY KEY, name varchar(50)) with > > ('connector' = 'hudi', 'path' = 's3a://test123/t1s3hudi', 'table.type' = > > 'MERGE_ON_READ', 'hoodie.aws.access.key' = 'minioadmin', > > 'hoodie.aws.secret.key' = 'minioadmin'); > > insert into t1s3hudi values(1,'one number s3'); > > The exception that we get in error logs is: > *Caused by: org.apache.hudi.exception.HoodieException: Error while checking > whether table exists under path:s3a://test123/t1s3hudi* > *at org.apache.hudi.util.StreamerUtil.tableExists(StreamerUtil.java:292) > ~[hudi-flink-bundle_2.11-0.10.1.jar:0.10.1]* > *at > org.apache.hudi.util.StreamerUtil.initTableIfNotExists(StreamerUtil.java:258) > ~[hudi-flink-bundle_2.11-0.10.1.jar:0.10.1]* > *at > org.apache.hudi.sink.StreamWriteOperatorCoordinator.start(StreamWriteOperatorCoordinator.java:164) > ~[hudi-flink-bundle_2.11-0.10.1.jar:0.10.1]* > *at > org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:194) > ~[flink-dist_2.11-1.13.6.jar:1.13.6]* > *at > org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85) > ~[flink-dist_2.11-1.13.6.jar:1.13.6]* > *at > org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:592) > ~[flink-dist_2.11-1.13.6.jar:1.13.6]* > *at > org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955) > ~[flink-dist_2.11-1.13.6.jar:1.13.6]* > *at > org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873) > ~[flink-dist_2.11-1.13.6.jar:1.13.6]* > *at > org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) > ~[flink-dist_2.11-1.13.6.jar:1.13.6]* > *at > org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181) > ~[flink-dist_2.11-1.13.6.jar:1.13.6]* > *at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605) > ~[flink-dist_2.11-1.13.6.jar:1.13.6]* > *... 18 more* > *Caused by: java.nio.file.AccessDeniedException: > s3a://test123/t1s3hudi/.hoodie: getFileStatus on > s3a://test123/t1s3hudi/.hoodie: > com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon > S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: > XAJMZTMQDGHRWZS8; S3 Extended Request ID: > qaTd5xTZCvnRwThI9fTSeuWVuzXpuw9H6w7roFGBnBVNQmHe1O7mgHbzEZmEIKNp/bx3Iyb9/Kc=; > Proxy: null), S3 Extended Request ID: > qaTd5xTZCvnRwThI9fTSeuWVuzXpuw9H6w7roFGBnBVNQmHe1O7mgHbzEZmEIKNp/bx3Iyb9/Kc=:403 > Forbidden* > *at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:218) > ~[flink-s3-fs-hadoop-1.13.6.jar:1.13.6]* > *at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:145) > ~[flink-s3-fs-hadoop-1.13.6.jar:1.13.6]* > *at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2184) > ~[flink-s3-fs-hadoop-1.13.6.jar:1.13.6]* > *at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149) > ~[flink-s3-fs-hadoop-1.13.6.jar:1.13.6]* > *at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088) > ~[flink-s3-fs-hadoop-1.13.6.jar:1.13.6]* > *at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1734) > ~[flink-s3-fs-hadoop-1.13.6.jar:1.13.6]* > *at > org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:2970) > ~[flink-s3-fs-hadoop-1.13.6.jar:1.13.6]* > *at org.apache.hudi.util.StreamerUtil.tableExists(StreamerUtil.java:290) > ~[hudi-flink-bundle_2.11-0.10.1.jar:0.10.1]* > *at > org.apache.hudi.util.StreamerUtil.initTableIfNotExists(Stream
[jira] [Updated] (HUDI-3646) The Hudi update syntax should not modify the nullability attribute of a column
[ https://issues.apache.org/jira/browse/HUDI-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3646: Fix Version/s: 0.14.0 (was: 0.13.1) > The Hudi update syntax should not modify the nullability attribute of a column > -- > > Key: HUDI-3646 > URL: https://issues.apache.org/jira/browse/HUDI-3646 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Affects Versions: 0.10.1 > Environment: spark3.1.2 >Reporter: Tao Meng >Assignee: Alexey Kudinkin >Priority: Critical > Fix For: 0.14.0 > > > now, when we use sparksql to update hudi table, we find that hudi will > change the nullability attribute of a column > eg: > {code:java} > // code placeholder > val tableName = generateTableName > val tablePath = s"${new Path(tmp.getCanonicalPath, > tableName).toUri.toString}" > // create table > spark.sql( >s""" > |create table $tableName ( > | id int, > | name string, > | price double, > | ts long > |) using hudi > | location '$tablePath' > | options ( > | type = '$tableType', > | primaryKey = 'id', > | preCombineField = 'ts' > | ) > """.stripMargin) > // insert data to table > spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000") > spark.sql(s"select * from $tableName").printSchema() > // update data > spark.sql(s"update $tableName set price = 20 where id = 1") > spark.sql(s"select * from $tableName").printSchema() {code} > > |-- _hoodie_commit_time: string (nullable = true) > |-- _hoodie_commit_seqno: string (nullable = true) > |-- _hoodie_record_key: string (nullable = true) > |-- _hoodie_partition_path: string (nullable = true) > |-- _hoodie_file_name: string (nullable = true) > |-- id: integer (nullable = true) > |-- name: string (nullable = true) > *|-- price: double (nullable = true)* > |-- ts: long (nullable = true) > > |-- _hoodie_commit_time: string (nullable = true) > |-- _hoodie_commit_seqno: string (nullable = true) > |-- _hoodie_record_key: string (nullable = true) > |-- _hoodie_partition_path: string (nullable = true) > |-- _hoodie_file_name: string (nullable = true) > |-- id: integer (nullable = true) > |-- name: string (nullable = true) > *|-- price: double (nullable = false )* > |-- ts: long (nullable = true) > > the nullable attribute of price has been changed to false, This is not the > result we want -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3786) how to deduce what MDT partitions to update on the write path w/ async indeing
[ https://issues.apache.org/jira/browse/HUDI-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3786: Fix Version/s: 0.14.0 (was: 0.13.1) > how to deduce what MDT partitions to update on the write path w/ async indeing > -- > > Key: HUDI-3786 > URL: https://issues.apache.org/jira/browse/HUDI-3786 > Project: Apache Hudi > Issue Type: Bug > Components: code-quality, metadata >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Critical > Fix For: 0.14.0 > > > w/ async indexing, how do we deduce what are the MDT partitions to update on > the regular write path? > > {code:java} > private MetadataRecordsGenerationParams getRecordsGenerationParams() { > return new MetadataRecordsGenerationParams( > dataMetaClient, enabledPartitionTypes, > dataWriteConfig.getBloomFilterType(), > dataWriteConfig.getBloomIndexParallelism(), > dataWriteConfig.isMetadataColumnStatsIndexEnabled(), > dataWriteConfig.getColumnStatsIndexParallelism(), > > StringUtils.toList(dataWriteConfig.getColumnsEnabledForColumnStatsIndex()), > > StringUtils.toList(dataWriteConfig.getColumnsEnabledForBloomFilterIndex())); > } {code} > As of now, I see above code snippet is what deciding that. But don't we need > to decide on tableConfig ? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3683) Support evolved schema for HFile Reader
[ https://issues.apache.org/jira/browse/HUDI-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3683: Fix Version/s: 0.14.0 (was: 0.13.1) > Support evolved schema for HFile Reader > --- > > Key: HUDI-3683 > URL: https://issues.apache.org/jira/browse/HUDI-3683 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.0 > > > {code:java} > [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.057 > s <<< FAILURE! - in org.apache.hudi.io.storage.TestHoodieHFileReaderWriter > [ERROR] > org.apache.hudi.io.storage.TestHoodieHFileReaderWriter.testWriteReadWithEvolvedSchema > Time elapsed: 0.055 s <<< ERROR! > org.apache.avro.AvroTypeException: Found example.schema.trip, expecting > example.schema.trip, missing required field added_field > at > org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292) > at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) > at > org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130) > at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215) > at > org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145) > at > org.apache.hudi.avro.HoodieAvroUtils.bytesToAvro(HoodieAvroUtils.java:137) > at > org.apache.hudi.io.storage.HoodieHFileReader.deserialize(HoodieHFileReader.java:394) > at > org.apache.hudi.io.storage.HoodieHFileReader.getRecordFromCell(HoodieHFileReader.java:378) > at > org.apache.hudi.io.storage.HoodieHFileReader.access$000(HoodieHFileReader.java:63) > at > org.apache.hudi.io.storage.HoodieHFileReader$2.hasNext(HoodieHFileReader.java:300) > at > org.apache.hudi.io.storage.TestHoodieReaderWriterBase.verifyReaderWithSchema(TestHoodieReaderWriterBase.java:231) > at > org.apache.hudi.io.storage.TestHoodieReaderWriterBase.testWriteReadWithEvolvedSchema(TestHoodieReaderWriterBase.java:153) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688) > at > org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131) > at > org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149) > at > org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140) > at > org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:84) > at > org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115) > at > org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37) > at > org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104) > at > org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:212) > at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:208) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:137) > at > org.junit.jupiter.engine.descriptor.TestMethodTest
[jira] [Updated] (HUDI-3626) Refactor TableSchemaResolver to remove `includeMetadataFields` flags
[ https://issues.apache.org/jira/browse/HUDI-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3626: Fix Version/s: 0.14.0 (was: 0.13.1) > Refactor TableSchemaResolver to remove `includeMetadataFields` flags > > > Key: HUDI-3626 > URL: https://issues.apache.org/jira/browse/HUDI-3626 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Fix For: 0.14.0 > > > Currently, `TableSchemaResolver` to pass `includeMetadataFields` in its APIs > that would selectively remove metadata fields from the returned schemas. > There are multiple issues with this flag: > # It's applied inconsistently: sometimes it just means that meta fields > {_}won't be added{_}, and sometimes it means fields _would be removed_ even > if present > # This flag actually spells the usage context into TableSchemaResolver: it's > highly contextual whether caller wants to remove, omit such meta-fields, or > take the schema as is and should not be spilled into the Resolver itself. > # B/c of it there's no way to know actually what was actual schema the data > was written with (b/c flag might not only omit, but also change the original > schema) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3603) Support read DateType for hive2/hive3
[ https://issues.apache.org/jira/browse/HUDI-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3603: Fix Version/s: 0.14.0 (was: 0.13.1) > Support read DateType for hive2/hive3 > --- > > Key: HUDI-3603 > URL: https://issues.apache.org/jira/browse/HUDI-3603 > Project: Apache Hudi > Issue Type: Bug > Components: hive >Affects Versions: 0.10.1 >Reporter: Tao Meng >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.0 > > > now hudi only support read dateType for hive2, we should support read > DateType for both hive2 and hive3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3639) [Incremental] Add Proper Incremental Records FIltering support into Hudi's custom RDD
[ https://issues.apache.org/jira/browse/HUDI-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3639: Fix Version/s: 0.14.0 (was: 0.13.1) > [Incremental] Add Proper Incremental Records FIltering support into Hudi's > custom RDD > - > > Key: HUDI-3639 > URL: https://issues.apache.org/jira/browse/HUDI-3639 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.0 > > > Currently, Hudi's `MergeOnReadIncrementalRelation` solely relies on > `ParquetFileReader` to do record-level filtering of the records that don't > belong to a timeline span being queried. > As a side-effect, Hudi actually have to disable the use of > [VectorizedParquetReader|https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-vectorized-parquet-reader.html] > (since using one would prevent records from being filtered by the Reader) > > Instead, we should make sure that proper record-level filtering is performed > w/in the returned RDD, instead of squarely relying on FileReader to do that. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3887) Spark query can not read the data changes which written by flink after the spark query connection created
[ https://issues.apache.org/jira/browse/HUDI-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3887: Fix Version/s: 0.14.0 (was: 0.13.1) > Spark query can not read the data changes which written by flink after the > spark query connection created > - > > Key: HUDI-3887 > URL: https://issues.apache.org/jira/browse/HUDI-3887 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Reporter: chacha.tang >Priority: Major > Fix For: 0.14.0 > > > Enviroment: > hudi version: 0.10.1 > flink version: 13.2 > spark version: 3.1.2 > > Spark query can not read the data changes which written by flink after the > spark query connection created。 > For example: > step1: use flink hudi job write table > INSERT INTO t1 VALUES ('id1','Danny',20,TIMESTAMP '1970-01-01 > 00:00:01','par1'); > step2: create the spark jdbc connection to query the data,in this time data > can be query correctly > step3: change the age property to write the data again。 > INSERT INTO t1 VALUES ('id1','Danny',27,TIMESTAMP '1970-01-01 > 00:00:01','par1'); > step4: use the spark jdbc connection which created in step2 to query data,and > found there is no change happened。 > step5:create a new spark jdbc connection to query the data, then the result > is correct -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3636) Clustering fails due to marker creation failure
[ https://issues.apache.org/jira/browse/HUDI-3636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3636: Fix Version/s: 0.14.0 (was: 0.13.1) > Clustering fails due to marker creation failure > --- > > Key: HUDI-3636 > URL: https://issues.apache.org/jira/browse/HUDI-3636 > Project: Apache Hudi > Issue Type: Bug > Components: multi-writer >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.0 > > > Scenario: multi-writer test, one writer doing ingesting with Deltastreamer > continuous mode, COW, inserts, async clustering and cleaning (partitions > under 2022/1, 2022/2), another writer with Spark datasource doing backfills > to different partitions (2021/12). > 0.10.0 no MT, clustering instant is inflight (failing it in the middle before > upgrade) ➝ 0.11 MT, with multi-writer configuration the same as before. > The clustering/replace instant cannot make progress due to marker creation > failure, failing the DS ingestion as well. Need to investigate if this is > timeline-server-based marker related or MT related. > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in > stage 46.0 failed 1 times, most recent failure: Lost task 2.0 in stage 46.0 > (TID 277) (192.168.70.231 executor driver): java.lang.RuntimeException: > org.apache.hudi.exception.HoodieException: > org.apache.hudi.exception.HoodieException: > java.util.concurrent.ExecutionException: > org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file > 2022/1/24/aa2f24d3-882f-4d48-b20e-9fcd3540c7a7-0_2-46-277_20220314101326706.parquet.marker.CREATE > Connect to localhost:26754 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] > failed: Connection refused (Connection refused) > at > org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121) > at > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) > at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) > at scala.collection.TraversableOnce.to(TraversableOnce.scala:366) > at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364) > at scala.collection.AbstractIterator.to(Iterator.scala:1431) > at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358) > at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431) > at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345) > at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1431) > at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030) > at > org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hudi.exception.HoodieException: > org.apache.hudi.exception.HoodieException: > java.util.concurrent.ExecutionException: > org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file > 2022/1/24/aa2f24d3-882f-4d48-b20e-9fcd3540c7a7-0_2-46-277_20220314101326706.parquet.marker.CREATE > Connect to localhost:26754 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] > failed: Connection refused (Connection refused) > at > org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:94) > at
[jira] [Updated] (HUDI-3668) Fix failing unit tests in hudi-integ-test
[ https://issues.apache.org/jira/browse/HUDI-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3668: Fix Version/s: 0.14.0 (was: 0.13.1) > Fix failing unit tests in hudi-integ-test > - > > Key: HUDI-3668 > URL: https://issues.apache.org/jira/browse/HUDI-3668 > Project: Apache Hudi > Issue Type: Bug > Components: tests-ci >Reporter: Ethan Guo >Assignee: sivabalan narayanan >Priority: Major > Fix For: 0.14.0 > > > org.apache.hudi.integ.testsuite.TestDFSHoodieTestSuiteWriterAdapter#testDFSTwoFilesWriteWithRollover > {code:java} > org.mockito.exceptions.verification.TooManyActualInvocations: > avroFileDeltaInputWriter.canWrite(); > Wanted 2 times: > -> at > org.apache.hudi.integ.testsuite.TestDFSHoodieTestSuiteWriterAdapter.testDFSTwoFilesWriteWithRollover(TestDFSHoodieTestSuiteWriterAdapter.java:119) > But was 3 times: > -> at > org.apache.hudi.integ.testsuite.writer.DFSDeltaWriterAdapter.write(DFSDeltaWriterAdapter.java:50) > -> at > org.apache.hudi.integ.testsuite.writer.DFSDeltaWriterAdapter.write(DFSDeltaWriterAdapter.java:50) > -> at > org.apache.hudi.integ.testsuite.writer.DFSDeltaWriterAdapter.write(DFSDeltaWriterAdapter.java:50) > at > org.apache.hudi.integ.testsuite.TestDFSHoodieTestSuiteWriterAdapter.testDFSTwoFilesWriteWithRollover(TestDFSHoodieTestSuiteWriterAdapter.java:119) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688) > at > org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131) > at > org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149) > at > org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140) > at > org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:84) > at > org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115) > at > org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37) > at > org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104) > at > org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:212) > at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:208) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:137) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:71) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$5(NodeTestTask.java:139) > at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$7(NodeTestTask.java:129) > at > org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:127) > at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:126) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTas
[jira] [Updated] (HUDI-3818) hudi doesn't support bytes column as primary key
[ https://issues.apache.org/jira/browse/HUDI-3818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3818: Fix Version/s: 0.14.0 (was: 0.13.1) > hudi doesn't support bytes column as primary key > > > Key: HUDI-3818 > URL: https://issues.apache.org/jira/browse/HUDI-3818 > Project: Apache Hudi > Issue Type: Bug > Components: hive >Reporter: rex xiong >Assignee: rex xiong >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.0 > > > when use bytes column as primary key, hudi will generate fixed hoodie key, > then upserts will only insert one row. > {code:java} > scala> sql("desc extended binary_test1").show() > +++---+ > | col_name| data_type|comment| > +++---+ > | _hoodie_commit_time| string| null| > |_hoodie_commit_seqno| string| null| > | _hoodie_record_key| string| null| > |_hoodie_partition...| string| null| > | _hoodie_file_name| string| null| > | id| binary| null| > | name| string| null| > | dt| string| null| > | | | | > |# Detailed Table ...| | | > | Database| default| | > | Table| binary_test1| | > | Owner| root| | > | Created Time|Sat Apr 02 13:28:...| | > | Last Access| UNKNOWN| | > | Created By| Spark 3.2.0| | > | Type| MANAGED| | > | Provider| hudi| | > | Table Properties|[last_commit_time...| | > | Statistics| 435194 bytes| | > +++---+ > scala> sql("select * from binary_test1").show() > +---+++--+++-++ > |_hoodie_commit_time|_hoodie_commit_seqno| > _hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| > id| name| dt| > +---+++--+++-++ > | 20220402132927590|20220402132927590...|id:java.nio.HeapB...| > |1a06106e-5e7a-4e6...|[03 45 6A 00 00 0...|Mary Jane|20220401| > +---+++--+++-++{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3648) Failed to execute rollback due to HoodieIOException: Could not delete instant
[ https://issues.apache.org/jira/browse/HUDI-3648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3648: Fix Version/s: 0.14.0 (was: 0.13.1) > Failed to execute rollback due to HoodieIOException: Could not delete instant > - > > Key: HUDI-3648 > URL: https://issues.apache.org/jira/browse/HUDI-3648 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: Ethan Guo >Assignee: Sagar Sumit >Priority: Critical > Labels: hudi-on-call > Fix For: 0.14.0 > > > Deltastreamer continuous mode writing to COW table with async clustering and > cleaning. > {code:java} > org.apache.hudi.exception.HoodieRollbackException: Failed to rollback > file:/Users/ethan/Work/scripts/mt_rollout_testing/deploy_b_single_writer_async_services/b3_ds_cow_010mt_011mt_conf/test_table > commits 20220314165647208 > at > org.apache.hudi.client.BaseHoodieWriteClient.rollback(BaseHoodieWriteClient.java:695) > at > org.apache.hudi.client.BaseHoodieWriteClient.rollbackFailedWrites(BaseHoodieWriteClient.java:1037) > at > org.apache.hudi.client.BaseHoodieWriteClient.tryUpgrade(BaseHoodieWriteClient.java:1404) > at > org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1302) > at > org.apache.hudi.client.SparkRDDWriteClient.insert(SparkRDDWriteClient.java:174) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:574) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:329) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:656) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hudi.exception.HoodieIOException: Could not delete > instant [==>20220314165647208__commit__INFLIGHT] > at > org.apache.hudi.common.table.timeline.HoodieActiveTimeline.deleteInstantFile(HoodieActiveTimeline.java:250) > at > org.apache.hudi.common.table.timeline.HoodieActiveTimeline.deletePending(HoodieActiveTimeline.java:201) > at > org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.deleteInflightAndRequestedInstant(BaseRollbackActionExecutor.java:270) > at > org.apache.hudi.table.action.rollback.CopyOnWriteRollbackActionExecutor.executeRollback(CopyOnWriteRollbackActionExecutor.java:90) > at > org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.doRollbackAndGetStats(BaseRollbackActionExecutor.java:218) > at > org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.runRollback(BaseRollbackActionExecutor.java:115) > at > org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.execute(BaseRollbackActionExecutor.java:144) > at > org.apache.hudi.table.HoodieSparkCopyOnWriteTable.rollback(HoodieSparkCopyOnWriteTable.java:346) > at > org.apache.hudi.client.BaseHoodieWriteClient.rollback(BaseHoodieWriteClient.java:680) > ... 11 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3407) Make sure Restore operation is Not Concurrent w/ Writes in Multi-Writer scenario
[ https://issues.apache.org/jira/browse/HUDI-3407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3407: Fix Version/s: 0.14.0 (was: 0.13.1) > Make sure Restore operation is Not Concurrent w/ Writes in Multi-Writer > scenario > > > Key: HUDI-3407 > URL: https://issues.apache.org/jira/browse/HUDI-3407 > Project: Apache Hudi > Issue Type: Bug > Components: multi-writer >Reporter: Alexey Kudinkin >Priority: Major > Fix For: 0.14.0 > > > Currently there's no guard-rail that would prevent Restore from running > concurrently with Writes in Multi-Writer scenario, which might lead to table > getting into inconsistent state. > > One of the approaches could be letting Restore to acquire the Write lock for > the whole duration of its operation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3487) The global index is enabled regardless of changlog
[ https://issues.apache.org/jira/browse/HUDI-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3487: Fix Version/s: 0.14.0 (was: 0.13.1) > The global index is enabled regardless of changlog > -- > > Key: HUDI-3487 > URL: https://issues.apache.org/jira/browse/HUDI-3487 > Project: Apache Hudi > Issue Type: Bug > Components: flink, index >Reporter: waywtdcc >Assignee: waywtdcc >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3467) Check shutdown logic with async compaction in Spark Structured Streaming
[ https://issues.apache.org/jira/browse/HUDI-3467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3467: Fix Version/s: 0.14.0 (was: 0.13.1) > Check shutdown logic with async compaction in Spark Structured Streaming > > > Key: HUDI-3467 > URL: https://issues.apache.org/jira/browse/HUDI-3467 > Project: Apache Hudi > Issue Type: Bug > Components: compaction, spark >Reporter: Ethan Guo >Assignee: sivabalan narayanan >Priority: Critical > Labels: hudi-on-call > Fix For: 0.14.0 > > > Related issue > https://github.com/apache/hudi/issues/5046 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly
[ https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3517: Fix Version/s: 0.14.0 (was: 0.13.1) > Unicode in partition path causes it to be resolved wrongly > -- > > Key: HUDI-3517 > URL: https://issues.apache.org/jira/browse/HUDI-3517 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql, writer-core >Affects Versions: 0.10.1 >Reporter: Ji Qi >Assignee: Lokesh Jain >Priority: Blocker > Labels: hudi-on-call > Fix For: 0.14.0 > > Original Estimate: 2h > Remaining Estimate: 2h > > When there is unicode in the partition path, the upsert fails. > h3. To reproduce > # Create this dataframe in spark-shell (note the dotted I) > {code:none} > scala> res0.show(truncate=false) > +---+---+ > |_c0|_c1| > +---+---+ > |1 |İ | > +---+---+ > {code} > # Write it to hudi (this write will create the hudi table and succeed) > {code:none} > res0.write.format("hudi").option("hoodie.table.name", > "unicode_test").option("hoodie.datasource.write.precombine.field", > "_c0").option("hoodie.datasource.write.recordkey.field", > "_c0").option("hoodie.datasource.write.partitionpath.field", > "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test") > {code} > # Try to write {{res0}} again (this upsert will fail at index lookup stage) > Environment > * Hudi version: 0.10.1 > * Spark version: 3.1.2 > h3. Stacktrace > {code:none} > 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : > (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0&basepath=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test&fileid=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0&lastinstantts=20220225182311228&timelinehash=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83) > 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403) > org.apache.hudi.exception.HoodieIOException: Failed to read footer for > parquet > file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet > at > org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185) > at > org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201) > at > org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109) > at > org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49) > at > org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39) > at > org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149) > at > org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) > at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) > at scala.collection.TraversableOnce.to(TraversableOnce.scala:315) > at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313) > at scala.collection.AbstractIterator.to(Iterator.scala:1429) > at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307) > at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429) > at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294) > at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1429) > at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030) > at > org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
[jira] [Updated] (HUDI-3300) Timeline server FSViewManager should avoid point lookup for metadata file partition
[ https://issues.apache.org/jira/browse/HUDI-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3300: Fix Version/s: 0.14.0 (was: 0.13.1) > Timeline server FSViewManager should avoid point lookup for metadata file > partition > --- > > Key: HUDI-3300 > URL: https://issues.apache.org/jira/browse/HUDI-3300 > Project: Apache Hudi > Issue Type: Bug > Components: metadata, timeline-server >Reporter: Manoj Govindassamy >Assignee: Yue Zhang >Priority: Major > Fix For: 0.14.0 > > > When inline reading is enabled, that is > hoodie.metadata.enable.full.scan.log.files = false, > MetadataMergedLogRecordReader doesn't cache the file listings records via the > ExternalSpillableMap. So, every file listing will lead to re-reading of > metadata files partition log and base files. Since files partition is less in > size, even when inline reading is enabled, the TimelineServer should > construct the FSViewManager with inline reading disabled for metadata files > partition. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3067) "Table already exists" error with multiple writers and dynamodb
[ https://issues.apache.org/jira/browse/HUDI-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3067: Fix Version/s: 0.14.0 (was: 0.13.1) > "Table already exists" error with multiple writers and dynamodb > --- > > Key: HUDI-3067 > URL: https://issues.apache.org/jira/browse/HUDI-3067 > Project: Apache Hudi > Issue Type: Bug >Reporter: Nikita Sheremet >Assignee: Wenning Ding >Priority: Critical > Labels: hudi-on-call > Fix For: 0.14.0 > > > How reproduce: > # Set up multiple writing > [https://hudi.apache.org/docs/concurrency_control/] for dynamodb (do not > forget to set _hoodie.write.lock.dynamodb.region_ and > {_}hoodie.write.lock.dynamodb.billing_mode{_}). Do not create anty dynamodb > table. > # Run multiple writers to the table > (Tested on aws EMR, so multiple writers is EMR steps) > Expected result - all steps completed. > Actual result: some steps failed with exception > {code:java} > Caused by: com.amazonaws.services.dynamodbv2.model.ResourceInUseException: > Table already exists: truedata_detections (Service: AmazonDynamoDBv2; Status > Code: 400; Error Code: ResourceInUseException; Request ID:; Proxy: null) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1819) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1403) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1372) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686) > at > com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550) > at > com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530) > at > com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:6214) > at > com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:6181) > at > com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeCreateTable(AmazonDynamoDBClient.java:1160) > at > com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.createTable(AmazonDynamoDBClient.java:1124) > at > org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.createLockTableInDynamoDB(DynamoDBBasedLockProvider.java:188) > at > org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.(DynamoDBBasedLockProvider.java:99) > at > org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.(DynamoDBBasedLockProvider.java:77) > ... 54 more > 21/12/19 13:42:06 INFO Yar {code} > This happens because all steps tried to create table at the same time. > > Suggested solution: > A catch statment for _Table already exists_ exception should be added into > dynamodb table creation code. May be with delay and additional check that > table is present. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1748) Read operation will possibility fail on mor table rt view when a write operations is concurrency running
[ https://issues.apache.org/jira/browse/HUDI-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-1748: Fix Version/s: 0.14.0 (was: 0.13.1) > Read operation will possibility fail on mor table rt view when a write > operations is concurrency running > > > Key: HUDI-1748 > URL: https://issues.apache.org/jira/browse/HUDI-1748 > Project: Apache Hudi > Issue Type: Bug > Components: multi-writer >Reporter: lrz >Priority: Major > Labels: core-flow-ds, pull-request-available, query-eng, > user-support-issues > Fix For: 0.14.0 > > > during reading operation, a new base file maybe produced by a writting > operation. then the reading will opooibility to get a NPE when getSplit. here > is the exception stack: > !https://wa.vision.huawei.com/vision-file-storage/api/file/download/upload-v2/2021/2/15/qwx352829/7bacca8042104499b0991d50b4bc3f2a/image.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3117) Kafka Connect can not clearly distinguish every task log
[ https://issues.apache.org/jira/browse/HUDI-3117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3117: Fix Version/s: 0.14.0 (was: 0.13.1) > Kafka Connect can not clearly distinguish every task log > > > Key: HUDI-3117 > URL: https://issues.apache.org/jira/browse/HUDI-3117 > Project: Apache Hudi > Issue Type: Bug >Reporter: cdmikechen >Assignee: Ethan Guo >Priority: Major > Labels: kafka-connect > Fix For: 0.14.0 > > > After creating multiple tasks in Kafka connect, it is difficult to > distinguish which task is processed in the current part through log > information because there is no field declaration of task related information. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3057) Instants should be generated strictly under locks
[ https://issues.apache.org/jira/browse/HUDI-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3057: Fix Version/s: 0.14.0 (was: 0.13.1) > Instants should be generated strictly under locks > - > > Key: HUDI-3057 > URL: https://issues.apache.org/jira/browse/HUDI-3057 > Project: Apache Hudi > Issue Type: Bug > Components: multi-writer, writer-core >Reporter: Alexey Kudinkin >Assignee: sivabalan narayanan >Priority: Major > Labels: sev:high > Fix For: 0.14.0 > > Attachments: logs.txt > > > While looking into the flakiness of the tests outlined here: > https://issues.apache.org/jira/browse/HUDI-3043 > > I've stumbled upon following failure where one of the writers tries to > complete the Commit but it couldn't b/c such file does already exist: > {code:java} > java.util.concurrent.ExecutionException: java.lang.RuntimeException: > org.apache.hudi.exception.HoodieIOException: Failed to create file > /var/folders/kb/cnff55vj041g2nnlzs5ylqk0gn/T/junit5142536255031969586/testtable_MERGE_ON_READ/.hoodie/20211217150157632.commit > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at > org.apache.hudi.utilities.functional.TestHoodieDeltaStreamerWithMultiWriter.runJobsInParallel(TestHoodieDeltaStreamerWithMultiWriter.java:336) > at > org.apache.hudi.utilities.functional.TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWriters(TestHoodieDeltaStreamerWithMultiWriter.java:150) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688) > at > org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131) > at > org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149) > at > org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140) > at > org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92) > at > org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115) > at > org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37) > at > org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104) > at > org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:212) > at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:208) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:137) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:71) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$5(NodeTestTask.java:139) > at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$7(NodeTestTask.java:129) > at > org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:127) > at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execut
[jira] [Updated] (HUDI-3023) Fix order of tests
[ https://issues.apache.org/jira/browse/HUDI-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3023: Fix Version/s: 0.14.0 (was: 0.13.1) > Fix order of tests > -- > > Key: HUDI-3023 > URL: https://issues.apache.org/jira/browse/HUDI-3023 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Fix For: 0.14.0 > > > Recently, we encountered an issue in integ tests where namenode was not ready > yet to receive connections (still in safemode) and hdfs commands in > ITTestHoodieDemo setup were not succeeeding. The namenode typically takes > some time (2-3 minutes) to come up. While adding a delay is a workaround, it > would be better to execute this test after others like > ITTestHoodieSyncCommand and ITTestHoodieSanity. With JUnit 5.8 we can order > test classes and as we already have a working patch for upgrading junit > (https://github.com/apache/hudi/pull/3748) we should consider ficing the > order to avoid flakiness. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3055) Make sure that Compression Codec configuration is respected across the board
[ https://issues.apache.org/jira/browse/HUDI-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3055: Fix Version/s: 0.14.0 (was: 0.13.1) > Make sure that Compression Codec configuration is respected across the board > > > Key: HUDI-3055 > URL: https://issues.apache.org/jira/browse/HUDI-3055 > Project: Apache Hudi > Issue Type: Bug > Components: storage-management >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Major > Labels: new-to-hudi > Fix For: 0.14.0 > > > Currently there are quite a few places where we assume GZip as the > compression codec which is incorrect, given that this is configurable and > users might actually prefer to use different compression codec. > Examples: > [HoodieParquetDataBlock|https://github.com/apache/hudi/pull/4333/files#diff-798a773c6eef4011aef2da2b2fb71c25f753500548167b610021336ef6f14807] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1779) Fail to bootstrap/upsert a table which contains timestamp column
[ https://issues.apache.org/jira/browse/HUDI-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-1779: Fix Version/s: 0.14.0 (was: 0.13.1) > Fail to bootstrap/upsert a table which contains timestamp column > > > Key: HUDI-1779 > URL: https://issues.apache.org/jira/browse/HUDI-1779 > Project: Apache Hudi > Issue Type: Bug > Components: dependencies, spark >Reporter: lrz >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0 > > Attachments: unsupportInt96.png, upsertFail.png, upsertFail2.png > > > current when hudi bootstrap a parquet file, or upsert into a parquet file > which contains timestmap column, it will fail because these issues: > 1) At bootstrap operation, if the origin parquet file was written by a spark > application, then spark will default save timestamp as int96(see > spark.sql.parquet.int96AsTimestamp), then bootstrap will fail, it’s because > of Hudi can not read Int96 type now.(this issue can be solve by upgrade > parquet to 1.12.0, and set parquet.avro.readInt96AsFixed=true, please check > [https://github|https://github/] > <[https://github/]>.com/apache/parquet-mr/pull/831/files) > 2) after bootstrap, doing upsert will fail because we use hoodie schema to > read origin parquet file. The schema is not match because hoodie schema > treat timestamp as long and at origin file it’s Int96 > 3) after bootstrap, and partial update for a parquet file will fail, because > we copy the old record and save by hoodie schema( we miss a > convertFixedToLong operation like spark does) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3114) Kafka Connect can not connect Hive by jdbc
[ https://issues.apache.org/jira/browse/HUDI-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3114: Fix Version/s: 0.14.0 (was: 0.13.1) > Kafka Connect can not connect Hive by jdbc > -- > > Key: HUDI-3114 > URL: https://issues.apache.org/jira/browse/HUDI-3114 > Project: Apache Hudi > Issue Type: Bug > Components: dependencies, kafka-connect >Reporter: cdmikechen >Assignee: Ethan Guo >Priority: Critical > Fix For: 0.14.0 > > > Current Kafka Connect does not import hive-jdbc dependency, which makes it > impossible to create hive tables using hive jdbc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2930) Rollbacks are not archived when metadata table is enabled
[ https://issues.apache.org/jira/browse/HUDI-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-2930: Fix Version/s: 0.14.0 (was: 0.13.1) > Rollbacks are not archived when metadata table is enabled > - > > Key: HUDI-2930 > URL: https://issues.apache.org/jira/browse/HUDI-2930 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: HUDI-bug > Fix For: 0.14.0 > > > I run bulk inserts into COW table using DeltaStreamer continuous mode and I > observed that the rollbacks are not archived. There were commits in between > these old rollbacks but after the archival process kicks in, the old > rollbacks are still in the active timeline while the other commits are > archived. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3019) Upserts with Dataype promotion only to a subset of partition fails
[ https://issues.apache.org/jira/browse/HUDI-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-3019: Fix Version/s: 0.14.0 (was: 0.13.1) > Upserts with Dataype promotion only to a subset of partition fails > -- > > Key: HUDI-3019 > URL: https://issues.apache.org/jira/browse/HUDI-3019 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Affects Versions: 0.10.0 >Reporter: sivabalan narayanan >Assignee: Sagar Sumit >Priority: Critical > Fix For: 0.14.0 > > > Upserts with Dataype promotion only to a subset of partition fails. > > Lets say intial insert was done to partition1 and partition2. with col1 type > as integer. > commit2 inserted records to partition2 and partition3, with col1 type as > long. integer -> long is backwards compatible evolution and hence write > succeeds. but when trying to read data from hudi, we run into issues. This is > not seen when a new column is added. > > Reference issue: > [https://github.com/apache/hudi/issues/3558] > > {code:java} > spark.sql("select * from hudi_trips_snapshot2").show() > 21/12/14 12:11:48 ERROR Executor: Exception in task 0.0 in stage 165.0 (TID > 1620) > java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary > at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:49) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:36) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2782) Fix marker based strategy for structured streaming
[ https://issues.apache.org/jira/browse/HUDI-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-2782: Fix Version/s: 0.14.0 (was: 0.13.1) > Fix marker based strategy for structured streaming > -- > > Key: HUDI-2782 > URL: https://issues.apache.org/jira/browse/HUDI-2782 > Project: Apache Hudi > Issue Type: Bug >Reporter: sivabalan narayanan >Priority: Critical > Fix For: 0.14.0 > > > As part of [this|https://github.com/apache/hudi/pull/3967] patch, we are > making timeline server based as the default marker type. But we have an issue > w/ structured streaming. Looks like after 1st micro batch, the timeline > server gets shutdown and for subsequent micro batches, timeline server is not > available. So, in the patch we have made marker based overridden just for > structured streaming. > > We may want to revisit this and see how to go about it. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2910) Hudi CLI "commits showarchived" throws NPE
[ https://issues.apache.org/jira/browse/HUDI-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-2910: Fix Version/s: 0.14.0 (was: 0.13.1) > Hudi CLI "commits showarchived" throws NPE > -- > > Key: HUDI-2910 > URL: https://issues.apache.org/jira/browse/HUDI-2910 > Project: Apache Hudi > Issue Type: Bug > Components: cli >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.14.0 > > > When trying to show archived commits through Hudi CLI command "commits > showarchived", NullPointerException is thrown. I'm using 0.10.0-rc2. > {code:java} > hudi:test_table->commits showarchived > Command failed java.lang.NullPointerException > java.lang.NullPointerException > at > org.apache.hudi.common.table.timeline.HoodieArchivedTimeline.lambda$readCommit$2(HoodieArchivedTimeline.java:154) > at org.apache.hudi.common.util.Option.map(Option.java:107) > at > org.apache.hudi.common.table.timeline.HoodieArchivedTimeline.readCommit(HoodieArchivedTimeline.java:149) > at > org.apache.hudi.common.table.timeline.HoodieArchivedTimeline.lambda$loadInstants$5(HoodieArchivedTimeline.java:228) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at > java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at > java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) > at > org.apache.hudi.common.table.timeline.HoodieArchivedTimeline.loadInstants(HoodieArchivedTimeline.java:230) > at > org.apache.hudi.common.table.timeline.HoodieArchivedTimeline.loadInstants(HoodieArchivedTimeline.java:193) > at > org.apache.hudi.common.table.timeline.HoodieArchivedTimeline.loadInstants(HoodieArchivedTimeline.java:189) > at > org.apache.hudi.common.table.timeline.HoodieArchivedTimeline.loadInstantDetailsInMemory(HoodieArchivedTimeline.java:112) > at > org.apache.hudi.cli.commands.CommitsCommand.showArchivedCommits(CommitsCommand.java:217) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:216) > at > org.springframework.shell.core.SimpleExecutionStrategy.invoke(SimpleExecutionStrategy.java:68) > at > org.springframework.shell.core.SimpleExecutionStrategy.execute(SimpleExecutionStrategy.java:59) > at > org.springframework.shell.core.AbstractShell.executeCommand(AbstractShell.java:134) > at > org.springframework.shell.core.JLineShell.promptLoop(JLineShell.java:533) > at org.springframework.shell.core.JLineShell.run(JLineShell.java:179) > at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2745) Record count does not match input after compaction is scheduled when running Hudi Kafka Connect sink
[ https://issues.apache.org/jira/browse/HUDI-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-2745: Fix Version/s: 0.14.0 (was: 0.13.1) > Record count does not match input after compaction is scheduled when running > Hudi Kafka Connect sink > > > Key: HUDI-2745 > URL: https://issues.apache.org/jira/browse/HUDI-2745 > Project: Apache Hudi > Issue Type: Bug > Components: compaction >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.14.0 > > > Spark Shell command to do snapshot query: > {code:java} > val basePath = "/tmp/hoodie/hudi-test-topic" > val df = spark.read.format("hudi").load(basePath) > df.createOrReplaceTempView("hudi_test_table") > spark.sql("select count(*) from hudi_test_table").show() {code} > Two cases of count mismatch: > (1) Compaction scheduled, more deltacommits later on: the count does not > match input size. After compaction is executed. The count becomes correct. > (2) Clustering scheduled, more deltacommits later on: the count is correct, > equal to the input size. After clustering is executed, the count drops and > becomes incorrect. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2528) Flaky test: MERGE_ON_READ testTableOperationsWithRestore
[ https://issues.apache.org/jira/browse/HUDI-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-2528: Fix Version/s: 0.14.0 (was: 0.13.1) > Flaky test: MERGE_ON_READ testTableOperationsWithRestore > > > Key: HUDI-2528 > URL: https://issues.apache.org/jira/browse/HUDI-2528 > Project: Apache Hudi > Issue Type: Bug > Components: Testing, tests-ci >Reporter: Raymond Xu >Priority: Critical > Fix For: 0.14.0 > > > > {code:java} > [ERROR] Failures:[ERROR] There files should have been rolled-back when > rolling back commit 002 but are still remaining. Files: > [file:/tmp/junit6464799159313857398/2016/03/15/9d59f0f1-9cfa-41a4-b247-6bf002ad6cc7-0_0-592-8761_001.parquet, > > file:/tmp/junit6464799159313857398/2016/03/15/9d59f0f1-9cfa-41a4-b247-6bf002ad6cc7-0_0-585-8754_001.parquet] > ==> expected: <0> but was: <2>[ERROR] Errors:[ERROR] No Compaction > request available at 007 to run compaction {code} > > Probably the same cause as HUDI-2108 > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1889) Support partition path in a nested field in HoodieFileIndex
[ https://issues.apache.org/jira/browse/HUDI-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-1889: Fix Version/s: 0.14.0 (was: 0.13.1) > Support partition path in a nested field in HoodieFileIndex > --- > > Key: HUDI-1889 > URL: https://issues.apache.org/jira/browse/HUDI-1889 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 0.14.0 > > > Partition path in a nested field is not supported in HoodieFileIndex. When > using a nested field for the partition path, the following exception is > thrown: > {code:java} > java.lang.IllegalArgumentException: Cannot find column: 'fare.currency' in > the > schema[StructField(_row_key,StringType,true),StructField(timestamp,LongType,true),StructField(name,StringType,true),StructField(fare,StructType(StructField(value,LongType,true), > StructField(currency,StringType,true)),true)] > at > org.apache.hudi.HoodieFileIndex$$anonfun$4$$anonfun$apply$1.apply(HoodieFileIndex.scala:98) > at > org.apache.hudi.HoodieFileIndex$$anonfun$4$$anonfun$apply$1.apply(HoodieFileIndex.scala:98) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:59) > at > org.apache.hudi.HoodieFileIndex$$anonfun$4.apply(HoodieFileIndex.scala:98) > at > org.apache.hudi.HoodieFileIndex$$anonfun$4.apply(HoodieFileIndex.scala:97) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.hudi.HoodieFileIndex._partitionSchemaFromProperties$lzycompute(HoodieFileIndex.scala:97) > at > org.apache.hudi.HoodieFileIndex._partitionSchemaFromProperties(HoodieFileIndex.scala:91) > at > org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:245) > at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:147) > at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:116) > at > org.apache.hudi.TestHoodieRowWriting.testRowWriting(TestHoodieRowWriting.scala:103) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688) > at > org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131) > at > org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149) > at > org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140) > at > org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92) > at > org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115) > at > org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37) > at > org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104) > at > org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:212) > at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) >
[jira] [Updated] (HUDI-1380) Async cleaning does not work with Timeline Server
[ https://issues.apache.org/jira/browse/HUDI-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-1380: Fix Version/s: 0.14.0 (was: 0.13.1) > Async cleaning does not work with Timeline Server > - > > Key: HUDI-1380 > URL: https://issues.apache.org/jira/browse/HUDI-1380 > Project: Apache Hudi > Issue Type: Bug > Components: Common Core, table-service, timeline-server >Reporter: Nishith Agarwal >Priority: Major > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1369) Bootstrap datasource jobs from hanging via spark-submit
[ https://issues.apache.org/jira/browse/HUDI-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-1369: Fix Version/s: 0.14.0 (was: 0.13.1) > Bootstrap datasource jobs from hanging via spark-submit > --- > > Key: HUDI-1369 > URL: https://issues.apache.org/jira/browse/HUDI-1369 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.14.0 > > > MOR table creation via Hudi datasource hangs at the end of the spark-submit > job. > Looks like {{HoodieWriteClient}} at > [https://github.com/apache/hudi/blob/release-0.6.0/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L255] > not being closed which does not stop the timeline server at the end, and as > a result the job hangs and never exits. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1117) Add tdunning json library to spark and utilities bundle
[ https://issues.apache.org/jira/browse/HUDI-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-1117: Fix Version/s: 0.14.0 (was: 0.13.1) > Add tdunning json library to spark and utilities bundle > --- > > Key: HUDI-1117 > URL: https://issues.apache.org/jira/browse/HUDI-1117 > Project: Apache Hudi > Issue Type: Bug > Components: dependencies, meta-sync >Affects Versions: 0.9.0 >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > Labels: user-support-issues > Fix For: 0.14.0 > > > Exception during Hive Sync: > ``` > An error occurred while calling o175.save.\n: java.lang.NoClassDefFoundError: > org/json/JSONException\n\tat > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)\n\tat > > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10047)\n\tat > > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128)\n\tat > > org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)\n\tat > > org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)\n\tat > org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)\n\tat > org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)\n\tat > org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)\n\tat > org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)\n\tat > org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)\n\tat > org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)\n\tat > org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:515)\n\tat > > org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:498)\n\tat > > org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:488)\n\tat > > org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:273)\n\tat > org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:146)\n\tat > ``` > This is from using hudi-spark-bundle. > [https://github.com/apache/hudi/issues/1787] > JSONException class is coming from > https://mvnrepository.com/artifact/org.json/json There is licensing issue and > hence not part of hudi bundle packages. The underlying issue is due to Hive > 1.x vs 2.x ( See > https://issues.apache.org/jira/browse/HUDI-150?jql=text%20~%20%22org.json%22%20and%20project%20%3D%20%22Apache%20Hudi%22%20) > Spark Hive integration still brings in hive 1.x jars which depends on > org.json. I believe this was provided in user's environment and hence we have > not seen folks complaining about this issue. > Even though this is not Hudi issue per se, let me check a jar with compatible > license : https://mvnrepository.com/artifact/com.tdunning/json/1.8 and if it > works, we will add to 0.6 bundles after discussing with community. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1158) Optimizations in parallelized listing behaviour for markers and bootstrap source files
[ https://issues.apache.org/jira/browse/HUDI-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-1158: Fix Version/s: 0.14.0 (was: 0.13.1) > Optimizations in parallelized listing behaviour for markers and bootstrap > source files > -- > > Key: HUDI-1158 > URL: https://issues.apache.org/jira/browse/HUDI-1158 > Project: Apache Hudi > Issue Type: Bug > Components: Common Core >Reporter: Udit Mehrotra >Assignee: Ethan Guo >Priority: Critical > Fix For: 0.14.0 > > > * Extract out the common inner logic > * Parallelize not just at top directory level, but at the leaf partition > folders level -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1036) HoodieCombineHiveInputFormat not picking up HoodieRealtimeFileSplit
[ https://issues.apache.org/jira/browse/HUDI-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-1036: Fix Version/s: 0.14.0 (was: 0.13.1) > HoodieCombineHiveInputFormat not picking up HoodieRealtimeFileSplit > --- > > Key: HUDI-1036 > URL: https://issues.apache.org/jira/browse/HUDI-1036 > Project: Apache Hudi > Issue Type: Bug > Components: hive >Affects Versions: 0.9.0 >Reporter: Bhavani Sudha >Assignee: Nishith Agarwal >Priority: Major > Labels: user-support-issues > Fix For: 0.14.0 > > > Opening this Jira based on the GitHub issue reported here - > [https://github.com/apache/hudi/issues/1735] when hive.input.format = > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat it is not able to > create HoodieRealtimeFileSplit for querying _rt table. Please see the GitHub > issue more details. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1145) Debug if Insert operation calls upsert in case of small file handling path.
[ https://issues.apache.org/jira/browse/HUDI-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-1145: Fix Version/s: 0.14.0 (was: 0.13.1) > Debug if Insert operation calls upsert in case of small file handling path. > --- > > Key: HUDI-1145 > URL: https://issues.apache.org/jira/browse/HUDI-1145 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Affects Versions: 0.9.0 >Reporter: Bhavani Sudha >Assignee: Bhavani Sudha >Priority: Major > Fix For: 0.14.0 > > > INSERT operations may be triggering UPSERT internally in the Merging process > when dealing with small files. This surfaced out of a SLACK thread. Need to > config if this is indeed is happening. If yes, this needs to be fixed such > that the MERGE HANDLE should not call upsert and instead let conflicting > records into the file if it is an INSERT operation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1286) Merge On Read queries (_rt) fails on docker demo for test suite
[ https://issues.apache.org/jira/browse/HUDI-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-1286: Fix Version/s: 0.14.0 (was: 0.13.1) > Merge On Read queries (_rt) fails on docker demo for test suite > --- > > Key: HUDI-1286 > URL: https://issues.apache.org/jira/browse/HUDI-1286 > Project: Apache Hudi > Issue Type: Bug > Components: dev-experience, Testing, tests-ci >Affects Versions: 0.9.0 >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > Fix For: 0.14.0 > > > When running the following query -> > {code:java} > select count(*) from testdb.table1_rt > {code} > we see the following exception in hiveserver : > {code:java} > 2020-09-16T03:41:07,668 INFO LocalJobRunner Map Task Executor #0: > realtime.AbstractRealtimeRecordReader > (AbstractRealtimeRecordReader.java:init(88)) - Writer Schema From Parquet => > [_hoodie_commit_time type:UNION pos:0, _hoodie_commit_seqno type:UNION pos:1, > _hoodie_record_key type:UNION pos:2, _hoodie_partition_path type:UNION pos:3, > _hoodie_file_name type:UNION pos:4, timestamp type:LONG pos:5, _row_key > type:STRING pos:6, rider type:STRING pos:7, driver type:STRING pos:8, > begin_lat type:DOUBLE pos:9, begin_lon type:DOUBLE pos:10, end_lat > type:DOUBLE pos:11, end_lon type:DOUBLE pos:12, fare type:DOUBLE > pos:13]2020-09-16T03:41:07,668 INFO LocalJobRunner Map Task Executor #0: > realtime.AbstractRealtimeRecordReader > (AbstractRealtimeRecordReader.java:init(88)) - Writer Schema From Parquet => > [_hoodie_commit_time type:UNION pos:0, _hoodie_commit_seqno type:UNION pos:1, > _hoodie_record_key type:UNION pos:2, _hoodie_partition_path type:UNION pos:3, > _hoodie_file_name type:UNION pos:4, timestamp type:LONG pos:5, _row_key > type:STRING pos:6, rider type:STRING pos:7, driver type:STRING pos:8, > begin_lat type:DOUBLE pos:9, begin_lon type:DOUBLE pos:10, end_lat > type:DOUBLE pos:11, end_lon type:DOUBLE pos:12, fare type:DOUBLE > pos:13]2020-09-16T03:41:07,670 INFO [Thread-465]: mapred.LocalJobRunner > (LocalJobRunner.java:runTasks(483)) - map task executor > complete.2020-09-16T03:41:07,671 WARN [Thread-465]: mapred.LocalJobRunner > (LocalJobRunner.java:run(587)) - job_local242522391_0010java.lang.Exception: > java.io.IOException: org.apache.hudi.exception.HoodieException: Error > ordering fields for storage read. #fieldNames: 4, #fieldPositions: 5 at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489) > ~[hadoop-mapreduce-client-common-2.8.4.jar:?] at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:549) > ~[hadoop-mapreduce-client-common-2.8.4.jar:?]Caused by: java.io.IOException: > org.apache.hudi.exception.HoodieException: Error ordering fields for storage > read. #fieldNames: 4, #fieldPositions: 5 at > org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) > ~[hive-exec-2.3.3.jar:2.3.3] at > org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) > ~[hive-exec-2.3.3.jar:2.3.3] at > org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379) > ~[hive-exec-2.3.3.jar:2.3.3] at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:169) > ~[hadoop-mapreduce-client-core-2.8.4.jar:?] at > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) > ~[hadoop-mapreduce-client-core-2.8.4.jar:?] at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) > ~[hadoop-mapreduce-client-core-2.8.4.jar:?] at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270) > ~[hadoop-mapreduce-client-common-2.8.4.jar:?] at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[?:1.8.0_212] at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ~[?:1.8.0_212] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_212] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_212] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]Caused > by: org.apache.hudi.exception.HoodieException: Error ordering fields for > storage read. #fieldNames: 4, #fieldPositions: 5 at > org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.orderFields(HoodieRealtimeRecordReaderUtils.java:258) > ~[hoodie-hadoop-mr-bundle.jar:0.6.1-SNAPSHOT] at > org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:99) > ~[hoodie-hadoop-mr-bundle.jar:0.6.1-SNAPSHOT] at > org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(Abstra
[jira] [Updated] (HUDI-234) Graceful degradation of ObjectSizeCalculator for non hotspot jvms
[ https://issues.apache.org/jira/browse/HUDI-234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-234: --- Fix Version/s: 0.14.0 (was: 0.13.1) > Graceful degradation of ObjectSizeCalculator for non hotspot jvms > - > > Key: HUDI-234 > URL: https://issues.apache.org/jira/browse/HUDI-234 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Affects Versions: 0.5.0 >Reporter: Vinoth Chandar >Priority: Major > Labels: new-to-hudi > Fix For: 0.14.0 > > > https://github.com/apache/incubator-hudi/issues/860 bug report -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-992) For hive-style partitioned source data, partition columns synced with Hive will always have String type
[ https://issues.apache.org/jira/browse/HUDI-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-992: --- Fix Version/s: 0.14.0 (was: 0.13.1) > For hive-style partitioned source data, partition columns synced with Hive > will always have String type > --- > > Key: HUDI-992 > URL: https://issues.apache.org/jira/browse/HUDI-992 > Project: Apache Hudi > Issue Type: Bug > Components: bootstrap, meta-sync >Affects Versions: 0.9.0 >Reporter: Udit Mehrotra >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.14.0 > > > Currently bootstrap implementation is not able to handle partition columns > correctly when the source data has *hive-style partitioning*, as is also > mentioned in https://jira.apache.org/jira/browse/HUDI-915 > The schema inferred while performing bootstrap and stored in the commit > metadata does not have partition column schema(in case of hive partitioned > data). As a result during hive-sync when hudi tries to determine the type of > partition column from that schema, it would not find it and assume the > default data type *string*. > Here is where partition column schema is determined for hive-sync: > [https://github.com/apache/hudi/blob/master/hudi-hive-sync/src/main/java/org/apache/hudi/hive/util/HiveSchemaUtil.java#L417] > > Thus no matter what the data type of partition column is in the source data > (atleast what spark infers it as from the path), it will always be synced as > string. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-83) Map Timestamp type in spark to corresponding Timestamp type in Hive during Hive sync
[ https://issues.apache.org/jira/browse/HUDI-83?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yue Zhang updated HUDI-83: -- Fix Version/s: 0.14.0 (was: 0.13.1) > Map Timestamp type in spark to corresponding Timestamp type in Hive during > Hive sync > > > Key: HUDI-83 > URL: https://issues.apache.org/jira/browse/HUDI-83 > Project: Apache Hudi > Issue Type: Bug > Components: hive, meta-sync, Usability >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Assignee: cdmikechen >Priority: Critical > Labels: pull-request-available, query-eng, sev:critical, > user-support-issues > Fix For: 0.14.0 > > > [https://github.com/apache/incubator-hudi/issues/543] &; related issues -- This message was sent by Atlassian Jira (v8.20.10#820010)