from:"Yue Zhang \(Jira\)"

[jira] [Updated] (HUDI-5602) Troubleshoot METADATA_ONLY bootstrapped table not being able to read back partition path

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5602:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Troubleshoot METADATA_ONLY bootstrapped table not being able to read back 
> partition path
> 
>
> Key: HUDI-5602
> URL: https://issues.apache.org/jira/browse/HUDI-5602
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.2
>Reporter: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.14.0
>
>
> In [https://github.com/apache/hudi/pull/7461] after enabling matching of the 
> whole payload rather than just record counts, it's been discovered that Hudi 
> isn't able to read back partition-path after running METADATA_ONLY bootstrap, 
> leading to a test failure (it's annotated w/ the TODO and this Jira in the 
> test suite)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5608) Support decimals w/ precision > 30 in Column Stats

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5608:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Support decimals w/ precision > 30 in Column Stats
> --
>
> Key: HUDI-5608
> URL: https://issues.apache.org/jira/browse/HUDI-5608
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.2
>Reporter: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.14.0
>
>
> As reported in: [https://github.com/apache/hudi/issues/7732]
>  
> Currently we've limited precision of the supported decimals at 30 assuming 
> that this number is reasonably high to cover 99% of use-cases, but it seems 
> like there's still a demand for even larger Decimals.
> The challenge is however to balance the need to support longer Decimals vs 
> storage space we have to provision for each one of them.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5575) Support any record key generation along w/ any partition path generation for row writer

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5575:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Support any record key generation along w/ any partition path generation for 
> row writer
> ---
>
> Key: HUDI-5575
> URL: https://issues.apache.org/jira/browse/HUDI-5575
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Lokesh Jain
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> HUDI-5535 adds support for record key generation along w/ any partition path 
> generation. It also separates the record key generation and partition path 
> generation into separate interfaces.
> This jira aims to add similar support for the row writer path in spark.
> cc [~shivnarayan] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5574) Support auto record key generation with Spark SQL

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5574:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Support auto record key generation with Spark SQL
> -
>
> Key: HUDI-5574
> URL: https://issues.apache.org/jira/browse/HUDI-5574
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Lokesh Jain
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: release-0.14.0-blocker
> Fix For: 0.14.0
>
>
> HUDI-2681 adds support for auto record key generation with spark dataframes. 
> This Jira aims to add support for the same with spark sql.
> One of the changes required here as pointed out by [~kazdy] is that 
> SQL_INSERT_MODE would need to be handled here. In this case if 
> SQL_INSERT_MODE mode is set to strict, the insert should fail.
> cc [~shivnarayan] 
> Essentially, based on this patch 
> ([https://github.com/apache/hudi/pull/7681),|https://github.com/apache/hudi/pull/7681,]
> we want to ensure spark-sql writes also supports auto generation of record 
> keys. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5588) Fix Metadata table validator to deduce valid partitions when first commit where partition was added is failed

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5588:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fix Metadata table validator to deduce valid partitions when first commit 
> where partition was added is failed
> -
>
> Key: HUDI-5588
> URL: https://issues.apache.org/jira/browse/HUDI-5588
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.14.0
>
>
> Metadata validation sometimes fails due to test code issue. 
> FS based listing shows 0 partitions, while MDT listing shows all 100 
> partitions. Its an issue w/ validator code.
>  
> actual timeline:
> ls -ltr tbl1/hoodie_table/.hoodie/ total 720 drwxr-xr-x 2 nsb staff 64 Jan 17 
> 18:45 archived drwxr-xr-x 4 nsb staff 128 Jan 17 18:45 metadata -rw-r--r-- 1 
> nsb staff 808 Jan 17 18:45 hoodie.properties -rw-r--r-- 1 nsb staff 1230 Jan 
> 17 18:45 20230117214546000.rollback.requested -rw-r--r-- 1 nsb staff 0 Jan 17 
> 18:45 20230117214546000.rollback.inflight -rw-r--r-- 1 nsb staff 1414 Jan 17 
> 18:46 20230117214546000.rollback -rw-r--r-- 1 nsb staff 1230 Jan 17 18:47 
> 20230117214701512.rollback.requested -rw-r--r-- 1 nsb staff 0 Jan 17 18:47 
> 20230117214701512.rollback.inflight -rw-r--r-- 1 nsb staff 1414 Jan 17 18:47 
> 20230117214701512.rollback -rw-r--r-- 1 nsb staff 15492 Jan 17 18:48 
> 20230117214831503.rollback.requested -rw-r--r-- 1 nsb staff 0 Jan 17 18:48 
> 20230117214831503.rollback.inflight -rw-r--r-- 1 nsb staff 0 Jan 17 18:48 
> 20230117214848714.deltacommit.requested -rw-r--r-- 1 nsb staff 16359 Jan 17 
> 18:48 20230117214831503.rollback -rw-r--r-- 1 nsb staff 69698 Jan 17 18:49 
> 20230117214848714.deltacommit.inflight -rw-r--r-- 1 nsb staff 0 Jan 17 18:50 
> 20230117215006714.deltacommit.requested -rw-r--r-- 1 nsb staff 94423 Jan 17 
> 18:50 20230117214848714.deltacommit -rw-r--r-- 1 nsb staff 142198 Jan 17 
> 18:50 20230117215006714.deltacommit.inflight
>  
>  
> atleast there is one successfull commit 20230117214848714.deltacommit.
>  
> but our validator code checks for creation time of partition and considers 
> that as valid partition only if that particular commit is succeded.
> {code:java}
> List allPartitionPathsFromFS = 
> FSUtils.getAllPartitionPaths(engineContext, basePath, false, 
> cfg.assumeDatePartitioning);
> HoodieTimeline completedTimeline = 
> metaClient.getActiveTimeline().filterCompletedInstants();
> // ignore partitions created by uncommitted ingestion.
> allPartitionPathsFromFS = 
> allPartitionPathsFromFS.stream().parallel().filter(part -> {
>   HoodiePartitionMetadata hoodiePartitionMetadata =
>   new HoodiePartitionMetadata(metaClient.getFs(), 
> FSUtils.getPartitionPath(basePath, part));
>   Option instantOption = 
> hoodiePartitionMetadata.readPartitionCreatedCommitTime();
>   if (instantOption.isPresent()) {
> String instantTime = instantOption.get();
> return completedTimeline.containsOrBeforeTimelineStarts(instantTime);
>   } else {
> return false;
>   }
> }).collect(Collectors.toList()); {code}
>  
> we need to fix this
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5444) FileNotFound issue w/ metadata enabled

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5444:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> FileNotFound issue w/ metadata enabled
> --
>
> Key: HUDI-5444
> URL: https://issues.apache.org/jira/browse/HUDI-5444
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.14.0
>
>
> stacktrace
> {code:java}
> Caused by: java.io.FileNotFoundException: File not found: 
> gs://TBL_PATH/op_cmpny_cd=WMT.COM/order_placed_dt=2022-12-08/441e7909-6a62-45ac-b9df-dd0386574f52-0_607-17-2433_20221208132316380.parquet
>         at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1082)
>  {code}
>  
> 20221208133227028 (RB_C10)
> 20221208133227028001 MDT compaction
> 20221208132316380 (C10)
> 20221208133647230
> DT
>  8   │ 20221202234515099 │ rollback │ COMPLETED │ Rolls back        │ 12-02 
> 15:45:18 │ 12-02 15:45:18 │ 12-02 15:45:33 ║
> ║     │                   │          │           │ 2022120413756 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> ║ 9   │ 20221208133227028 │ rollback │ COMPLETED │ Rolls back        │ 12-08 
> 05:32:33 │ 12-08 05:32:33 │ 12-08 05:32:44 ║
> ║     │                   │          │           │ 20221208132316380 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> ║ 10  │ 20221208133647230 │ rollback │ COMPLETED │ Rolls back        │ 12-08 
> 05:36:47 │ 12-08 05:36:48 │ 12-08 05:36:57 ║
> ║     │                   │          │           │ 20221208133222583 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> MDT timeline: 
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:32 
> 20221208133227028.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff   548 Dec  8 05:32 
> 20221208133227028.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  6042 Dec  8 05:32 20221208133227028.deltacommit
> -rw-r--r--@ 1 nsb  staff  1938 Dec  8 05:34 
> 20221208133227028001.compaction.requested
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:34 
> 20221208133227028001.compaction.inflight
> -rw-r--r--@ 1 nsb  staff  7556 Dec  8 05:34 20221208133227028001.commit
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:34 
> 20221208132316380.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff  3049 Dec  8 05:34 
> 20221208132316380.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  8207 Dec  8 05:35 20221208132316380.deltacommit
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:36 
> 20221208133647230.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff   548 Dec  8 05:36 
> 20221208133647230.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  6042 Dec  8 05:36 20221208133647230.deltacommit
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5507) SparkSQL can not read the latest change data without execute "refresh table xxx"

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5507:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> SparkSQL can not read the latest change data without execute "refresh table 
> xxx"
> 
>
> Key: HUDI-5507
> URL: https://issues.apache.org/jira/browse/HUDI-5507
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.11.0, 0.11.1, 0.12.0, 0.12.1, 0.12.2
>Reporter: Danny Chen
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5557) Wrong candidate files found in metadata table

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5557:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Wrong candidate files found in metadata table 
> --
>
> Key: HUDI-5557
> URL: https://issues.apache.org/jira/browse/HUDI-5557
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata, spark-sql
>Affects Versions: 0.12.2
>Reporter: ruofan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.3, 0.14.0
>
>
> Suppose the hudi table has five fields, but only two fields are indexed. When 
> part of the filter condition in SQL comes from index fields and the other 
> part comes from non-index fields, the candidate files queried from the 
> metadata table are wrong.
> For example following hudi table schema
> {code:java}
> name: varchar(128)
> age: int
> addr: varchar(128)
> city: varchar(32)
> job: varchar(32) {code}
> table properties
> {code:java}
> hoodie.table.type=MERGE_ON_READ
> hoodie.metadata.enable=true
> hoodie.metadata.index.column.stats.enable=true
> hoodie.metadata.index.column.stats.column.list='name,city'
> hoodie.enable.data.skipping=true {code}
> sql
> {code:java}
> select * from hudi_table where name='tom' and age=18;  {code}
> if we set hoodie.enable.data.skipping=false, the data can be found. But if we 
> set hoodie.enable.data.skipping=true, we can't find the expected data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5463) Apply rollback commits from data table as rollbacks in MDT instead of Delta commit

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5463:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Apply rollback commits from data table as rollbacks in MDT instead of Delta 
> commit
> --
>
> Key: HUDI-5463
> URL: https://issues.apache.org/jira/browse/HUDI-5463
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.14.0
>
>
> As of now, any rollback in DT is another DC in MDT. this may not scale for 
> record level index in MDT since we have to add 1000s of delete records and 
> finally have to resolve all valid and invalid records. So, its better to 
> rollback the commit in MDT as well instead of doing a DC. 
>  
> Impact: 
> record level index is unusable w/o this change. While fixing other rollback 
> related tickets, do consider this as a possible option if this simplifies 
> other fixes. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5442) Fix HiveHoodieTableFileIndex to use lazy listing

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5442:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fix HiveHoodieTableFileIndex to use lazy listing
> 
>
> Key: HUDI-5442
> URL: https://issues.apache.org/jira/browse/HUDI-5442
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, trino-presto
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.14.0
>
>
> Currently, HiveHoodieTableFileIndex hard-codes the shouldListLazily to false, 
> using eager listing only.  This leads to scanning all table partitions in the 
> file index, regardless of the queryPaths provided (for Trino Hive connector, 
> only one partition is passed in).
> {code:java}
> public HiveHoodieTableFileIndex(HoodieEngineContext engineContext,
> HoodieTableMetaClient metaClient,
> TypedProperties configProperties,
> HoodieTableQueryType queryType,
> List queryPaths,
> Option specifiedQueryInstant,
> boolean shouldIncludePendingCommits
> ) {
>   super(engineContext,
>   metaClient,
>   configProperties,
>   queryType,
>   queryPaths,
>   specifiedQueryInstant,
>   shouldIncludePendingCommits,
>   true,
>   new NoopCache(),
>   false);
> } {code}
> After flipping it to true for testing, the following exception is thrown.
> {code:java}
> io.trino.spi.TrinoException: Failed to parse partition column values from the 
> partition-path: likely non-encoded slashes being used in partition column's 
> values. You can try to work this around by switching listing mode to eager
>     at 
> io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:284)
>     at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
>     at io.trino.$gen.Trino_39220221217_092723_2.run(Unknown Source)
>     at 
> io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>     at java.base/java.lang.Thread.run(Thread.java:833)
> Caused by: org.apache.hudi.exception.HoodieException: Failed to parse 
> partition column values from the partition-path: likely non-encoded slashes 
> being used in partition column's values. You can try to work this around by 
> switching listing mode to eager
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.parsePartitionColumnValues(BaseHoodieTableFileIndex.java:317)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.lambda$listPartitionPaths$6(BaseHoodieTableFileIndex.java:288)
>     at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
>     at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
>     at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
>     at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
>     at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
>     at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>     at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPaths(BaseHoodieTableFileIndex.java:291)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:205)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllInputFileSlices(BaseHoodieTableFileIndex.java:216)
>     at 
> org.apache.hudi.hadoop.HiveHoodieTableFileIndex.listFileSlices(HiveHoodieTableFileIndex.java:71)
>     at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:263)
>     at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:158)
>     at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
>     at 
> org.apache.hudi.hadoop.HoodieParquetInputFormatBase.getSplits(HoodieParquetInputFormatBase.java:68)
>     at 
> io.trino.plugin.hive.BackgroundHiveSplitLoader.lambda$loadPartition$2(BackgroundHiveSplitLoader.java:493)
>     at 
> io.trino.plugin.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:25)
>     at io.trino.plugin.hive.HdfsEnvironment.doAs(HdfsEn

[jira] [Updated] (HUDI-5520) Fail MDT when list of log files grows unboundedly

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5520:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fail MDT when list of log files grows unboundedly
> -
>
> Key: HUDI-5520
> URL: https://issues.apache.org/jira/browse/HUDI-5520
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5436) Auto repair tool for MDT out of sync

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5436:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Auto repair tool for MDT out of sync
> 
>
> Key: HUDI-5436
> URL: https://issues.apache.org/jira/browse/HUDI-5436
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.14.0
>
>
> Can we write a spark-submit to repair any out of sync issues w/ MDT. for eg, 
> if MDT validation failed for a given table, we don't have a good way to fix 
> the MDT.
> So, we should develop a sparksubmit job which will try to deduce from which 
> commit the out of sync happens and try to fix just the delta.
>  
> idea here is:
> Try running validation job for latest files at every commit starting from 
> latest in reverse chronological order. At some point validation will succeed. 
> Lets call it commit N.
> we can add savepoint to MDT at commit N and restore the table to that commit 
> N.
> and then we can take any new commits after commitN from data table and apply 
> them one by one to MDT.
>  
> Once complete, we can run validation tool again to ensure its in good shape.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5374) Use KeyGeneratorFactory class for instantiating a KeyGenerator

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5374:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Use KeyGeneratorFactory class for instantiating a KeyGenerator
> --
>
> Key: HUDI-5374
> URL: https://issues.apache.org/jira/browse/HUDI-5374
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently the configs hoodie.datasource.write.keygenerator.class and 
> hoodie.datasource.write.keygenerator.type are used in multiple areas to 
> create a key generator. The idea is to reuse the *KeyGeneratorFactory classes 
> for instantiating KeyGenerator.
> The Jira adds a KeyGeneratorFactory base class and 
> HoodieSparkKeyGeneratorFactory, HoodieAvroKeyGeneratorFactory extends this 
> base class. These classes are then used in code for creating KeyGenerator.
> Based on Github issue: [https://github.com/apache/hudi/issues/7291]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5271) Inconsistent reader and writer schema in HoodieAvroDataBlock cause exception

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5271:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Inconsistent reader and writer schema in HoodieAvroDataBlock cause exception
> 
>
> Key: HUDI-5271
> URL: https://issues.apache.org/jira/browse/HUDI-5271
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Teng Huo
>Assignee: Teng Huo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Exception detail in https://github.com/apache/hudi/issues/7284



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5364) Make sure Hudi's Column Stats are wired into Spark's relation stats

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5364:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Make sure Hudi's Column Stats are wired into Spark's relation stats
> ---
>
> Key: HUDI-5364
> URL: https://issues.apache.org/jira/browse/HUDI-5364
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, spark-sql
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.14.0
>
>
> Currently, we're leveraging CSI exclusively to better prune the target files.
> Additionally, we should wire in stats from CSI into Spark's 
> `CatalogStatistics` which in turn will be leveraged by Spark's Optimization 
> rules for better planning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5385) Make behavior of keeping File Writers open configurable

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5385:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Make behavior of keeping File Writers open configurable
> ---
>
> Key: HUDI-5385
> URL: https://issues.apache.org/jira/browse/HUDI-5385
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.14.0
>
>
> Currently, when writing in Spark we will be keeping the File Writers for 
> individual partitions open as long as we're processing the batch which 
> entails that all of the data written out will be kept in memory (at least the 
> last row-group in case of Parquet writers) until batch is fully processed and 
> all of the writers are closed.
> While this allows us to better control how many files are created in every 
> partition (we keep the writer open and hence we don't need to create a new 
> file when a new record comes in), this brings a penalty of keeping all of the 
> data in memory potentially leading to OOMs, longer GC cycles, etc



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5322) Bulk-insert (row-writing) is not rewriting incoming dataset into Writer's schema

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5322:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Bulk-insert (row-writing) is not rewriting incoming dataset into Writer's 
> schema
> 
>
> Key: HUDI-5322
> URL: https://issues.apache.org/jira/browse/HUDI-5322
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Alexey Kudinkin
>Assignee: Jonathan Vexler
>Priority: Critical
> Fix For: 0.14.0
>
>
> Row-writing Bulk-insert have to rewrite incoming dataset into the finalized 
> Writer's schema, instead it's currently just using incoming dataset as is, 
> deviating in semantic from non-Row-writing flow (alas other operations)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5405) Avoid using Projections in generic Merge Into DMLs

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5405:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Avoid using Projections in generic Merge Into DMLs
> --
>
> Key: HUDI-5405
> URL: https://issues.apache.org/jira/browse/HUDI-5405
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.14.0
>
>
> Currently, `MergeIntoHoodieTableCommand` squarely relies on semantic 
> implemented by `ExpressionPayload` to be able to insert/update records.
> While this is necessary since MIT semantic enables users to do sophisticated 
> and fine-grained updates (for ex, partial updating), this is not necessary in 
> the most generic case:
>  
> {code:java}
> MERGE INTO target
> USING ... source
> ON target.id = source.id
> WHEN MATCHED THEN UPDATE *
> WHEN NOT MATCHED THEN INSERT *{code}
> This is essentially just a SQL way of implementing an upsert – if there are 
> matching records in the table we update them, otherwise – insert.
> In this case there's actually no need to use ExpressionPayload at all, and we 
> can just simply use normal Hudi upserting flow to handle it (avoiding all of 
> the ExpressionPayload overhead)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5361) Propagate Hudi properties set in Spark's SQLConf to Hudi

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5361:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Propagate Hudi properties set in Spark's SQLConf to Hudi
> 
>
> Key: HUDI-5361
> URL: https://issues.apache.org/jira/browse/HUDI-5361
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Alexey Kudinkin
>Assignee: Jonathan Vexler
>Priority: Critical
> Fix For: 0.14.0
>
>
> Currently, the only property we propagate from Spark's SQLConf is 
> hoodie.metadata.enable.
> Instead we should actually pull all of the Hudi related configs from SQLConf 
> and pass them to Hudi.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5438) Benchmark calls w/ metadata enabled and ensure no calls to direct FS

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5438:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Benchmark calls w/ metadata enabled and ensure no calls to direct FS
> 
>
> Key: HUDI-5438
> URL: https://issues.apache.org/jira/browse/HUDI-5438
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.14.0
>
>
> we need to benchmark calls to S3 (s3 access logs) and ensure when metadata is 
> enabled, we don't make any direct calls to FS. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5352) Jackson fails to serialize LocalDate when updating Delta Commit metadata

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5352:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Jackson fails to serialize LocalDate when updating Delta Commit metadata
> 
>
> Key: HUDI-5352
> URL: https://issues.apache.org/jira/browse/HUDI-5352
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Alexey Kudinkin
>Assignee: Raymond Xu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently, running TestColumnStatsIndex on Spark 3.3 fails the MOR tests due 
> to Jackson not being able to serialize LocalData as is and requiring 
> additional JSR310 dependency.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5319) NPE in Bloom Filter Index

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5319:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> NPE in Bloom Filter Index
> -
>
> Key: HUDI-5319
> URL: https://issues.apache.org/jira/browse/HUDI-5319
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
>
> {code:java}
> /12/02 11:05:49 WARN TaskSetManager: Lost task 3.0 in stage 1098.0 (TID 
> 1300185) (ip-172-31-23-246.us-east-2.compute.internal executor 10): 
> java.lang.RuntimeException: org.apache.hudi.exception.HoodieIndexException: 
> Error checking bloom filter index.
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
>         at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>         at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
>         at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:183)
>         at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>         at org.apache.spark.scheduler.Task.run(Task.scala:138)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
> Caused by: org.apache.hudi.exception.HoodieIndexException: Error checking 
> bloom filter index.
>         at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:110)
>         at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:60)
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
>         ... 16 more
> Caused by: java.lang.NullPointerException
>         at 
> org.apache.hudi.io.HoodieKeyLookupHandle.addKey(HoodieKeyLookupHandle.java:87)
>         at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:92)
>         ... 18 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4944) The encoded slash (%2F) in partition path is not properly decoded during Spark read

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4944:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> The encoded slash (%2F) in partition path is not properly decoded during 
> Spark read
> ---
>
> Key: HUDI-4944
> URL: https://issues.apache.org/jira/browse/HUDI-4944
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
> Attachments: Untitled
>
>
> When the source partitioned parquet table of the bootstrap operation has the 
> encoded slash (%2F) in the partition path, e.g., 
> "partition_path=2015%2F03%2F17", after the metadata-only bootstrap with the 
> bootstrap indexing storing the data file path containing the partition path 
> with the encoded slash (%2F), the target bootstrapped Hudi table cannot be 
> read due to FileNotFound exception.  The root cause is that the encoding of 
> the slash is lost when creating the new Path instance with the URI (see 
> below, that "partition_path=2015/03/17" instead of 
> "partition_path=2015%2F03%2F17").
> {code:java}
> Caused by: java.io.FileNotFoundException: File does not exist: 
> hdfs://localhost:62738/user/ethan/test_dataset_bootstrapped/partition_path=2015/03/17/e0fa3466-d3bc-43f7-b586-2f95d8745095_3-161-675_01.parquet
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1528)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1521)
>     at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1521)
>     at 
> org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:39)
>     at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.Spark24HoodieParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$lzycompute$1(Spark24HoodieParquetFileFormat.scala:131)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.Spark24HoodieParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$1(Spark24HoodieParquetFileFormat.scala:130)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.Spark24HoodieParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(Spark24HoodieParquetFileFormat.scala:134)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.Spark24HoodieParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(Spark24HoodieParquetFileFormat.scala:111)
>     at 
> org.apache.hudi.HoodieDataSourceHelper$$anonfun$buildHoodieParquetReader$1.apply(HoodieDataSourceHelper.scala:71)
>     at 
> org.apache.hudi.HoodieDataSourceHelper$$anonfun$buildHoodieParquetReader$1.apply(HoodieDataSourceHelper.scala:70)
>     at org.apache.hudi.HoodieBootstrapRDD.compute(HoodieBootstrapRDD.scala:60)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) {code}
> The path conversion that causes the problem is in the code below.  "new 
> URI(file.filePath)" decodes the "%2F" and converts the slash.
> Spark24HoodieParquetFileFormat (same for Spark32PlusHoodieParquetFileFormat)
> {code:java}
> val fileSplit =
>   new FileSplit(new Path(new URI(file.filePath)), file.start, file.length, 
> Array.empty) {code}
> This fails the tests below and we need to use a partition path without 
> slashes in the value for now: 
> TestHoodieDeltaStreamer#testBulkInsertsAndUpsertsWithBootstrap
> ITTestHoodieDemo#testParquetDemo



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4937) Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT readers

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4937:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT 
> readers
> -
>
> Key: HUDI-4937
> URL: https://issues.apache.org/jira/browse/HUDI-4937
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, writer-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently, `HoodieTable` is holding `HoodieBackedTableMetadata` that is setup 
> not to reuse actual LogScanner and HFileReader used to read MT itself.
> This is proving to be wasteful on a number of occasions already, including 
> (not an exhaustive list):
> https://github.com/apache/hudi/issues/6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4738) [MOR] Bloom Index missing new records inserted into Log files

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4738:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> [MOR] Bloom Index missing new records inserted into Log files
> -
>
> Key: HUDI-4738
> URL: https://issues.apache.org/jira/browse/HUDI-4738
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.14.0
>
>
> Currently, Bloom Index is implemented under following assumption that 
> _file-group (once written), has fixed set of records that could not be 
> expanded_ (this is encoded t/h assumption that at least one version of every 
> record w/in the file group is stored w/in its base file).
> This is relied upon when we tag incoming records w/ the locations of the 
> file-groups they could potentially belong to (in case, when such records are 
> updates), by fetching the Bloom Index info from either a) base-file or b) 
> record in MT Bloom Index associated w/ particular file-group id.
>  
> However this assumption is not always true, since it's possible for _new_ 
> records to be inserted into the log-files, which would mean that the records 
> key-set of a single file-group could expand. This could lead to potentially 
> some records that were previously written to log-files to be duplicated.
>  
> We need to reconcile these 2 aspects and do either of:
>  # Disallow expansion of the file-group records' set (by not allowing inserts 
> into log-files)
>  # Fix Bloom Index implementation to also check log-files during tagging.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5080) UnpersistRdds unpersist all rdds in the spark context

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5080:

Fix Version/s: (was: 0.13.1)

> UnpersistRdds unpersist all rdds in the spark context
> -
>
> Key: HUDI-5080
> URL: https://issues.apache.org/jira/browse/HUDI-5080
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> In SparkRDDWriteClient, we have a method to clean up persisted Rdds to free 
> up the space occupied. 
> [https://github.com/apache/hudi/blob/b78c3441c4e28200abec340eaff852375764cbdb/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java#L584]
> But the issue is, it cleans up all persisted rdds in the given spark context. 
> This will impact, async compaction or any other async table services running. 
> or even if there are multiple streams writing to different tables, this will 
> be cause a huge impact. 
>  
> This also needs to be fixed with DeltaSync. 
> [https://github.com/apache/hudi/blob/b78c3441c4e28200abec340eaff852375764cbdb/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L345]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4947) Missing .hoodie/hoodie.properties in Hudi table

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4947:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Missing .hoodie/hoodie.properties in Hudi table
> ---
>
> Key: HUDI-4947
> URL: https://issues.apache.org/jira/browse/HUDI-4947
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.14.0
>
>
> At some point, the ingestion job reports that hoodie.properties is missing 
> and neither of hoodie.properties nor hoodie.properties.backup is present.  
> Sample stacktrace:
>  
> {code:java}
> Caused by: org.apache.hudi.exception.HoodieIOException: Could not load Hoodie 
> properties from s3://.../.hoodie/hoodie.properties
> at 
> org.apache.hudi.common.table.HoodieTableConfig.(HoodieTableConfig.java:254)
> at 
> org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:125)
> at 
> org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:78)
> at 
> org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:668)
> at 
> org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$getHoodieTableConfig$1(HoodieSparkSqlWriter.scala:756)
> at scala.Option.getOrElse(Option.scala:189)
> at 
> org.apache.hudi.HoodieSparkSqlWriter$.getHoodieTableConfig(HoodieSparkSqlWriter.scala:757)
> at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:85)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:165) 
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4922) Presto query of bootstrapped data returns null

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4922:

Fix Version/s: 0.14.0
   (was: 0.13.1)

>  Presto query of bootstrapped data returns null
> ---
>
> Key: HUDI-4922
> URL: https://issues.apache.org/jira/browse/HUDI-4922
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.14.0
>
>
> https://github.com/apache/hudi/issues/6532



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5092) Querying Hudi table throws NoSuchMethodError in Databricks runtime

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5092:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Querying Hudi table throws NoSuchMethodError in Databricks runtime 
> ---
>
> Key: HUDI-5092
> URL: https://issues.apache.org/jira/browse/HUDI-5092
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
> Attachments: image (1).png, image.png
>
>
> Originally reported by the user: 
> [https://github.com/apache/hudi/issues/6137]
>  
> Crux of the issue is that Databricks's DBR runtime diverges from OSS Spark, 
> and in that case `FileStatusCache` API is very clearly divergent b/w the two. 
> There are a few approaches we can take: 
>  # Avoid reliance on Spark's FIleStatusCache implementation altogether and 
> rely on our own one
>  # Apply more staggered approach where we first try to use Spark's 
> FileStatusCache and if it doesn't match expected API, we fallback to our own 
> impl
>  
> Approach # 1  would actually mean that we're not sharing cache implementation 
> w/ Spark, which in turn would entail that in some cases we might be keeping 2 
> instances of the same cache. Approach # 2 remediates that and allows us to 
> only fallback in case API is not compatible. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5015) Cleaner does not work properly when metadata table is enabled

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5015:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Cleaner does not work properly when metadata table is enabled
> -
>
> Key: HUDI-5015
> URL: https://issues.apache.org/jira/browse/HUDI-5015
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
> Fix For: 0.14.0
>
>
> Please see [https://github.com/apache/hudi/pull/6926] for more context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4958) Provide accurate numDeletes in commit metadata

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4958:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Provide accurate numDeletes in commit metadata
> --
>
> Key: HUDI-4958
> URL: https://issues.apache.org/jira/browse/HUDI-4958
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.14.0
>
>
> When doing a simple computation of {{numInserts - numDeletes}} for all the 
> commits, this leads to negative total records.  Need to check if number of 
> inserts and deletes are accurate when both inserts and deletes exist in the 
> same input batch for upsert.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5229) Add flink avro version entry in root pom

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5229:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Add flink avro version entry in root pom
> 
>
> Key: HUDI-5229
> URL: https://issues.apache.org/jira/browse/HUDI-5229
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4777) Flink gen bucket index of mor table not consistent with spark lead to duplicate bucket issue

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4777:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Flink gen bucket index of mor table not consistent with spark lead to 
> duplicate bucket issue
> 
>
> Key: HUDI-4777
> URL: https://issues.apache.org/jira/browse/HUDI-4777
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: JinxinTang
>Assignee: JinxinTang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4921) Fix last completed commit in CleanPlanner

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4921:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fix last completed commit in CleanPlanner
> -
>
> Key: HUDI-4921
> URL: https://issues.apache.org/jira/browse/HUDI-4921
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Recently we added last completed commit in as part of clean commit metadata. 
> ideally the value should represent the last completed commit in timeline 
> before er which there are no inflight commits. but we just get the last 
> completed commit in active timeline and setting the value. 
> this needs fixing. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4854) Deltastreamer does not respect partition selector regex for metadata-only bootstrap

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4854:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Deltastreamer does not respect partition selector regex for metadata-only 
> bootstrap
> ---
>
> Key: HUDI-4854
> URL: https://issues.apache.org/jira/browse/HUDI-4854
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4852) Incremental sync not updating pending file groups under clustering

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4852:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Incremental sync not updating pending file groups under clustering
> --
>
> Key: HUDI-4852
> URL: https://issues.apache.org/jira/browse/HUDI-4852
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Surya Prasanna Yalla
>Assignee: Surya Prasanna Yalla
>Priority: Critical
> Fix For: 0.14.0
>
>
> Pending file groups under clustering are not updated through incremental sync 
> calls. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4818) Using CustomKeyGenerator fails w/ SparkHoodieTableFileIndex

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4818:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Using CustomKeyGenerator fails w/ SparkHoodieTableFileIndex
> ---
>
> Key: HUDI-4818
> URL: https://issues.apache.org/jira/browse/HUDI-4818
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently using `CustomKeyGenerator` with the partition-path config 
> \{hoodie.datasource.write.partitionpath.field=ts:timestamp} fails w/
> {code:java}
> Caused by: java.lang.RuntimeException: Failed to cast value `2022-05-11` to 
> `LongType` for partition column `ts_ms`
>   at 
> org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.$anonfun$parsePartition$2(Spark3ParsePartitionUtil.scala:72)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.$anonfun$parsePartition$1(Spark3ParsePartitionUtil.scala:65)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.parsePartition(Spark3ParsePartitionUtil.scala:63)
>   at 
> org.apache.hudi.SparkHoodieTableFileIndex.parsePartitionPath(SparkHoodieTableFileIndex.scala:274)
>   at 
> org.apache.hudi.SparkHoodieTableFileIndex.parsePartitionColumnValues(SparkHoodieTableFileIndex.scala:258)
>   at 
> org.apache.hudi.BaseHoodieTableFileIndex.lambda$getAllQueryPartitionPaths$3(BaseHoodieTableFileIndex.java:190)
>   at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>   at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
>   at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>   at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>   at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>   at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
>   at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:193)
>  {code}
>  
> This occurs b/c SparkHoodieTableFileIndex produces incorrect partition schema 
> at XXX
> where it properly handles only `TimestampBasedKeyGenerator`s but not the 
> other key-generators that might be changing the data-type of the 
> partition-value as compared to the source partition-column (in this case it 
> has `ts` as a long in the source table schema, but it produces 
> partition-value as string)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4632) Remove the force active property for flink1.14 profile

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4632:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Remove the force active property for flink1.14 profile
> --
>
> Key: HUDI-4632
> URL: https://issues.apache.org/jira/browse/HUDI-4632
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.11.1
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4643) MergeInto syntax WHEN MATCHED is optional but must be set

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4643:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> MergeInto syntax WHEN MATCHED is optional but must be set
> -
>
> Key: HUDI-4643
> URL: https://issues.apache.org/jira/browse/HUDI-4643
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
>  
> {code:java}
> spark.sql(
> s"""
> |create table $tableName (
> | id int,
> | name string,
> | price double,
> | ts long,
> | dt string
> |) using hudi
> | location '${tmp.getCanonicalPath}/$tableName'
> | tblproperties (
> | primaryKey ='id',
> | preCombineField = 'ts'
> | )
> """.stripMargin)
> // Insert data
> spark.sql(s"insert into $tableName select 1, 'a1', 1, 10, '2022-08-18'")
> spark.sql(
> s"""
> | merge into $tableName as t0
> | using (
> | select 1 as id, 'a1' as name, 11 as price, 110 as ts, '2022-08-19' as dt 
> union all
> | select 2 as id, 'a2' as name, 10 as price, 100 as ts, '2022-08-18' as dt
> | ) as s0
> | on t0.id = s0.id
> | when not matched then insert *
> """.stripMargin
> )
> {code}
>  
> {code:java}
> 11493 [Executor task launch worker for task 65] ERROR 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor  - Error 
> upserting bucketType UPDATE for partition :0
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: java.lang.AssertionError: assertion 
> failed: hoodie.payload.update.condition.assignments have not set
>     at 
> org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:149)
>     at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358)
>     at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349)
>     at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322)
>     at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleInsertPartition(BaseSparkCommitActionExecutor.java:335)
>     at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:246)
>  
> {code}
>  
>  
> if  set hoodie.merge.allow.duplicate.on.inserts = true，The result is one more 
> record than expected：
> [1,a1,1.0,10,2022-08-18], [1,a1,11.0,110,2022-08-19], 
> [2,a2,10.0,100,2022-08-18]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4704) bulk insert overwrite table will delete the table and then recreate a table

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4704:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> bulk insert overwrite table will delete the table and then recreate a table
> ---
>
> Key: HUDI-4704
> URL: https://issues.apache.org/jira/browse/HUDI-4704
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql, writer-core
>Affects Versions: 0.12.0
>Reporter: zouxxyy
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.14.0
>
>
> When hoodie.sql.bulk.insert.enable is enabled, executing insert overwrite 
> will delete the table and then recreate a table, so time travel cannot be 
> performed.
>  
> {code:java}
> create table hudi_cow_test_tbl (
>   id bigint,
>   name string,
>   ts bigint,
>   dt string,
>   hh string
> ) using hudi
> tblproperties (
>   type = 'cow',
>   primaryKey = 'id',
>   preCombineField = 'ts',
>   'hoodie.sql.insert.mode' = 'non-strict',
>   'hoodie.sql.bulk.insert.enable' = 'true'
> );
> insert into hudi_cow_test_tbl select 1, 'a1', 1001, '2021-12-09', '11';
> insert overwrite hudi_cow_test_tbl select 3, 'a3', 1001, '2021-12-09', '11';
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4542) Flink streaming query fails with ClassNotFoundException

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4542:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Flink streaming query fails with ClassNotFoundException
> ---
>
> Key: HUDI-4542
> URL: https://issues.apache.org/jira/browse/HUDI-4542
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink-sql
>Reporter: Ethan Guo
>Priority: Critical
> Fix For: 0.14.0
>
> Attachments: Screen Shot 2022-08-04 at 17.17.42.png
>
>
> Environment: EMR 6.7.0 Flink 1.14.2
> Reproducible steps: Build Hudi Flink bundle from master
> {code:java}
> mvn clean package -DskipTests  -pl :hudi-flink1.14-bundle -am {code}
> Copy to EMR master node /lib/flink/lib
> Launch Flink SQL client:
> {code:java}
> cd /lib/flink && ./bin/yarn-session.sh --detached
> ./bin/sql-client.sh {code}
> Write a Hudi table with a few commits with metadata table enabled (no column 
> stats).  Then, run the following for the streaming query
> {code:java}
> CREATE TABLE t2(
>    uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
>    name VARCHAR(10),
>    age INT,
>    ts TIMESTAMP(3),
>    `partition` VARCHAR(20)
>  )
>  PARTITIONED BY (`partition`)
>  WITH (
>    'connector' = 'hudi',
>    'path' = 's3a://',
>    'table.type' = 'MERGE_ON_READ',
>    'read.streaming.enabled' = 'true',  -- this option enable the streaming 
> read
>    'read.start-commit' = '20220803165232362', -- specifies the start commit 
> instant time
>    'read.streaming.check-interval' = '4' -- specifies the check interval for 
> finding new source commits, default 60s.
>  ); {code}
> {code:java}
> select * from t2; {code}
> {code:java}
> Flink SQL> select * from t2;
> 2022-08-05 00:12:43,635 INFO  org.apache.hadoop.metrics2.impl.MetricsConfig   
>              [] - Loaded properties from hadoop-metrics2.properties
> 2022-08-05 00:12:43,650 INFO  
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl            [] - Scheduled 
> Metric snapshot period at 300 second(s).
> 2022-08-05 00:12:43,650 INFO  
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl            [] - 
> s3a-file-system metrics system started
> 2022-08-05 00:12:47,722 INFO  org.apache.hadoop.fs.s3a.S3AInputStream         
>              [] - Switching to Random IO seek policy
> 2022-08-05 00:12:47,941 INFO  org.apache.hadoop.yarn.client.RMProxy           
>              [] - Connecting to ResourceManager at 
> ip-172-31-9-157.us-east-2.compute.internal/172.31.9.157:8032
> 2022-08-05 00:12:47,942 INFO  org.apache.hadoop.yarn.client.AHSProxy          
>              [] - Connecting to Application History server at 
> ip-172-31-9-157.us-east-2.compute.internal/172.31.9.157:10200
> 2022-08-05 00:12:47,942 INFO  org.apache.flink.yarn.YarnClusterDescriptor     
>              [] - No path for the flink jar passed. Using the location of 
> class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
> 2022-08-05 00:12:47,942 WARN  org.apache.flink.yarn.YarnClusterDescriptor     
>              [] - Neither the HADOOP_CONF_DIR nor the YARN_CONF_DIR 
> environment variable is set.The Flink YARN Client needs one of these to be 
> set to properly load the Hadoop configuration for accessing YARN.
> 2022-08-05 00:12:47,959 INFO  org.apache.flink.yarn.YarnClusterDescriptor     
>              [] - Found Web Interface 
> ip-172-31-3-92.us-east-2.compute.internal:39605 of application 
> 'application_1659656614768_0001'.
> [ERROR] Could not execute SQL statement. Reason:
> java.lang.ClassNotFoundException: 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat{code}
> {code:java}
> 2022-08-04 17:12:59
> org.apache.flink.runtime.JobException: Recovery is suppressed by 
> NoRestartBackoffTimeStrategy
>     at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:138)
>     at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:82)
>     at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:228)
>     at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:218)
>     at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:209)
>     at 
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:679)
>     at 
> org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:79)
>     at 
> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:444)
>     at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
>     at 
> sun.reflect

[jira] [Updated] (HUDI-4573) Fix HoodieMultiTableDeltaStreamer to write all tables in continuous mode

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4573:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fix HoodieMultiTableDeltaStreamer to write all tables in continuous mode
> 
>
> Key: HUDI-4573
> URL: https://issues.apache.org/jira/browse/HUDI-4573
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4541) Flink job fails with column stats enabled in metadata table due to NotSerializableException

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4541:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Flink job fails with column stats enabled in metadata table due to 
> NotSerializableException

> 
>
> Key: HUDI-4541
> URL: https://issues.apache.org/jira/browse/HUDI-4541
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink-sql
>Reporter: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
> Attachments: Screen Shot 2022-08-04 at 17.10.05.png
>
>
> Environment: EMR 6.7.0 Flink 1.14.2
> Reproducible steps: Build Hudi Flink bundle from master
> {code:java}
> mvn clean package -DskipTests  -pl :hudi-flink1.14-bundle -am {code}
> Copy to EMR master node /lib/flink/lib
> Launch Flink SQL client:
> {code:java}
> cd /lib/flink && ./bin/yarn-session.sh --detached
> ./bin/sql-client.sh {code}
> Run the following from the Flink quick start guide with metadata table, 
> column stats, and data skipping enabled
> {code:java}
> CREATE TABLE t1(
>   uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
>   name VARCHAR(10),
>   age INT,
>   ts TIMESTAMP(3),
>   `partition` VARCHAR(20)
> )
> PARTITIONED BY (`partition`)
> WITH (
>   'connector' = 'hudi',
>   'path' = 's3a://',
>   'table.type' = 'MERGE_ON_READ', -- this creates a MERGE_ON_READ table, by 
> default is COPY_ON_WRITE
>   'metadata.enabled' = 'true', -- enables multi-modal index and metadata table
>   'hoodie.metadata.index.column.stats.enable' = 'true', -- enables column 
> stats in metadata table
>   'read.data.skipping.enabled' = 'true' -- enables data skipping
> );
> INSERT INTO t1 VALUES
>   ('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
>   ('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
>   ('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
>   ('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
>   ('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
>   ('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
>   ('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
>   ('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4'); {code}
> !Screen Shot 2022-08-04 at 17.10.05.png|width=1130,height=463!
> Exception:
> {code:java}
> 2022-08-04 17:04:41
> org.apache.flink.runtime.JobException: Recovery is suppressed by 
> NoRestartBackoffTimeStrategy
>     at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:138)
>     at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:82)
>     at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:228)
>     at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:218)
>     at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:209)
>     at 
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:679)
>     at 
> org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:79)
>     at 
> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:444)
>     at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:316)
>     at 
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
>     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:314)
>     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:217)
>     at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78)
>     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163)
>     at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
>     at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
>     at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
>     at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
>     at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
>     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
>     at scala.PartialFunction$

[jira] [Updated] (HUDI-4457) Make sure IT docker test return code non-zero when failed

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4457:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Make sure IT docker test return code non-zero when failed
> -
>
> Key: HUDI-4457
> URL: https://issues.apache.org/jira/browse/HUDI-4457
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: tests-ci
>Reporter: Raymond Xu
>Priority: Major
> Fix For: 0.14.0
>
>
> IT testcase where docker command runs and returns exit code 0, but test 
> actually failed. This will be misleading for troubleshooting.
> TODO
> 1. verify the behavior
> 2. fix it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4430) Incorrect type casting while reading HUDI table created with CustomKeyGenerator and unixtimestamp paritioning field

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4430:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Incorrect type casting while reading HUDI table created with 
> CustomKeyGenerator and unixtimestamp paritioning field
> ---
>
> Key: HUDI-4430
> URL: https://issues.apache.org/jira/browse/HUDI-4430
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.12.0
>Reporter: Volodymyr Burenin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.14.0
>
>
> Hi,
> I have discovered an issue that doesn't play nicely with the custom key 
> generatosr, basically anything that is not TimestampBasedKeyGenerator or 
> TimestampBasedAvroKeyGenerator.
> {{While trying to read a table that was created with these parameters(the 
> rest don't matter):}}
> {code:java}
> hoodie.datasource.write.recordkey.field=query_id,event_type
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
> hoodie.datasource.write.partitionpath.field=create_time_epoch_seconds:timestamp
> hoodie.deltastreamer.keygen.timebased.timestamp.type=UNIX_TIMESTAMP
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd{code}
> {color:#172b4d}I get and error that looks like:{color}
> {code:java}
> 22/07/20 20:32:48 DEBUG Spark32HoodieParquetFileFormat: Appending 
> StructType(StructField(create_time_epoch_seconds,LongType,true)) [2022/07/13]
> 22/07/20 20:32:48 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
> java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot 
> be cast to java.lang.Long
>     at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107)
>     at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong(rows.scala:42)
>     at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong$(rows.scala:42)
>     at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getLong(rows.scala:195)
>     at 
> org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:66)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:245)
>  {code}
> Apparently the issue is in _partitionSchemaFromProperties function in file: 
> hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala
> that checks for a class type it uses StructType of String for.
> Once it is any non Timestamp based known class it basically uses whatever 
> type it is and then fails to retrieve the value for.
> I have a proposal here which we probably need: Give a user a way to force a 
> string type if needed and add ability to add a prefixed column that contains 
> a processed partition value. It could be done as two separate features.
> This problem is critical for me, so I have to change Hoodie source code on my 
> end temporary to make it work.
> Here is how I roughly changed the referenced function:
>  
> {code:java}
> /**
>  * Get the partition schema from the hoodie.properties.
>  */
> private lazy val _partitionSchemaFromProperties: StructType = {
>   val tableConfig = metaClient.getTableConfig
>   val partitionColumns = tableConfig.getPartitionFields
>   if (partitionColumns.isPresent) {
> val partitionFields = partitionColumns.get().map(column => 
> StructField("_hoodie_"+column, StringType))
> StructType(partitionFields)
>   } else {
> // If the partition columns have not stored in hoodie.properties(the 
> table that was
> // created earlier), we trait it as a non-partitioned table.
> logWarning("No partition columns available from hoodie.properties." +
>   " Partition pruning will not work")
> new StructType()
>   }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4539) Make Hudi's CLI API consistent

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4539:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Make Hudi's CLI API consistent
> --
>
> Key: HUDI-4539
> URL: https://issues.apache.org/jira/browse/HUDI-4539
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cli
>Reporter: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.14.0
>
>
> Currently API provided by the CLI is inconsistent:
>  # Some of the commands (to display metadata for ex) are applicable to some 
> commits/actions but not others
>  # Same actions should be applicable to both active and archived timeline 
> (from the CLI standpoint there should be essentially no difference)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4330) NPE when trying to upsert into a dataset with no Meta Fields

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4330:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> NPE when trying to upsert into a dataset with no Meta Fields
> 
>
> Key: HUDI-4330
> URL: https://issues.apache.org/jira/browse/HUDI-4330
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Raymond Xu
>Priority: Critical
> Fix For: 0.14.0
>
>
> When trying to upsert into a dataset with Meta Fields being disabled, you 
> will encounter obscure NPE like below:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 25 in stage 20.0 failed 4 times, most recent failure: Lost task 25.3 in 
> stage 20.0 (TID 4110) (ip-172-31-20-53.us-west-2.compute.internal executor 
> 7): java.lang.RuntimeException: 
> org.apache.hudi.exception.HoodieIndexException: Error checking bloom filter 
> index.
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
>         at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>         at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
>         at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
>         at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>         at org.apache.spark.scheduler.Task.run(Task.scala:131)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
> Caused by: org.apache.hudi.exception.HoodieIndexException: Error checking 
> bloom filter index.
>         at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:110)
>         at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:60)
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
>         ... 16 more
> Caused by: java.lang.NullPointerException
>         at 
> org.apache.hudi.io.HoodieKeyLookupHandle.addKey(HoodieKeyLookupHandle.java:88)
>         at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:92)
>         ... 18 more {code}
> Instead, we could be more explicit as to why this could have happened 
> (meta-fields disabled -> no bloom filter created -> unable to do upserts)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4266) Flink streaming reader can not work when there are multiple partition fields

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4266:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Flink streaming reader can not work when there are multiple partition fields
> 
>
> Key: HUDI-4266
> URL: https://issues.apache.org/jira/browse/HUDI-4266
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink-sql
>Affects Versions: 0.11.0
>Reporter: Danny Chen
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4321) Fix Hudi to not write in Parquet legacy format

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4321:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fix Hudi to not write in Parquet legacy format
> --
>
> Key: HUDI-4321
> URL: https://issues.apache.org/jira/browse/HUDI-4321
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.14.0
>
>
> Currently Hudi have to write in Parquet legacy-format 
> ("spark.sql.parquet.writeLegacyFormat") whenever schema contains Decimals, 
> due to the fact that it relies on AvroParquetReader which is unable to read 
> Decimals in the non-legacy format (ie it could only read Decimals encoded as 
> FIXED_BYTE_ARRAY, and not as INT32/INT64)
> This leads to suboptimal storage footprint where for example on some datasets 
> this could lead to a bloat of 10% or more.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4184) Creating external table in Spark SQL modifies "hoodie.properties"

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4184:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Creating external table in Spark SQL modifies "hoodie.properties"
> -
>
> Key: HUDI-4184
> URL: https://issues.apache.org/jira/browse/HUDI-4184
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Alexey Kudinkin
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.14.0
>
>
> My setup was like following:
>  # There's a table existing in one AWS account
>  # I'm trying to access that table from Spark SQL from _another_ AWS account 
> that only has Read permissions to the bucket with the table.
>  # Now when issuing "CREATE TABLE" Spark SQL command it fails b/c Hudi tries 
> to modify "hoodie.properties" file for whatever reason, even though i'm not 
> modifying the table and just trying to create table in the catalog.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4369) Hudi Kafka Connect Sink writing to GCS bucket

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4369:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Hudi Kafka Connect Sink writing to GCS bucket
> -
>
> Key: HUDI-4369
> URL: https://issues.apache.org/jira/browse/HUDI-4369
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: kafka-connect
>Reporter: Vishal Agarwal
>Priority: Critical
> Fix For: 0.14.0
>
>
> Hi team,
> I am trying to use Hudi sink connector with Kafka Connect to write to GCS 
> bucket. But I am getting error regarding "gs" file scheme. I have added all 
> GCS related properties in core-site.xml and the corresponding gcs-connector 
> jar in the plugin path. But still facing the issue.
> The issue was already reported with S3 as per jira 
> https://issues.apache.org/jira/browse/HUDI-3610. But I am unable to get the 
> resolution.
> Happy to discuss on this !
> Thanks
> *StackTrace-*
> %d [%thread] %-5level %logger - %msg%n 
> org.apache.hudi.exception.HoodieException: Fatal error instantiating Hudi 
> Write Provider 
>  at 
> org.apache.hudi.connect.writers.KafkaConnectWriterProvider.(KafkaConnectWriterProvider.java:103)
>  ~[connectors-uber.jar:?]
>  at 
> org.apache.hudi.connect.transaction.ConnectTransactionParticipant.(ConnectTransactionParticipant.java:65)
>  ~[connectors-uber.jar:?]
>  at org.apache.hudi.connect.HoodieSinkTask.bootstrap(HoodieSinkTask.java:198) 
> [connectors-uber.jar:?]
>  at org.apache.hudi.connect.HoodieSinkTask.open(HoodieSinkTask.java:151) 
> [connectors-uber.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.openPartitions(WorkerSinkTask.java:587)
>  [connect-runtime-2.4.1.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.access$1100(WorkerSinkTask.java:67)
>  [connect-runtime-2.4.1.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask$HandleRebalance.onPartitionsAssigned(WorkerSinkTask.java:652)
>  [connect-runtime-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.invokePartitionsAssigned(ConsumerCoordinator.java:272)
>  [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:400)
>  [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:421)
>  [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:340)
>  [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:471)
>  [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1267)
>  [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1231) 
> [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1211) 
> [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.pollConsumer(WorkerSinkTask.java:444)
>  [connect-runtime-2.4.1.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:317) 
> [connect-runtime-2.4.1.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)
>  [connect-runtime-2.4.1.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)
>  [connect-runtime-2.4.1.jar:?]
>  at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177) 
> [connect-runtime-2.4.1.jar:?]
>  at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227) 
> [connect-runtime-2.4.1.jar:?]
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [?:1.8.0_331]
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_331]
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_331]
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_331]
>  at java.lang.Thread.run(Thread.java:750) [?:1.8.0_331]
> Caused by: org.apache.hudi.exception.HoodieIOException: Failed to get 
> instance of org.apache.hadoop.fs.FileSystem
>  at org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:109) 
> ~[connectors-uber.jar:?]
>  at org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:100) 
> ~[connectors-uber.jar:?]
>  at org.apache.hudi.client.BaseHoodieClient.(BaseHoodieClient.java:69) 
> ~[connectors-uber.jar:?]
>  at 
> org.apache.hudi.client.BaseHoodieWriteClient.(BaseHoodieWriteClient.java:175)
>  ~[connectors-uber.jar:?]
>  a

[jira] [Updated] (HUDI-4341) HoodieHFileReader is not compatible with Hadoop 3

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4341:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> HoodieHFileReader is not compatible with Hadoop 3
> -
>
> Key: HUDI-4341
> URL: https://issues.apache.org/jira/browse/HUDI-4341
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: spark
> Fix For: 0.14.0
>
>
> [https://github.com/apache/hudi/issues/5765]
> Spark SQL throws "java.lang.NoSuchMethodError: 
> org.apache.hadoop.hdfs.client.HdfsDataInputStream.getReadStatistics()" after 
> a while.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4185) Evaluate alternatives to using "hoodie.properties" as state store for Metadata Table

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4185:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Evaluate alternatives to using "hoodie.properties" as state store for 
> Metadata Table
> 
>
> Key: HUDI-4185
> URL: https://issues.apache.org/jira/browse/HUDI-4185
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Alexey Kudinkin
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.14.0
>
>
> Currently Metadata Table uses "hoodie.properties" file as a state-store 
> adding properties reflecting the state of the metadata table being indexed.
> This is creating some issues (for ex, HUDI-4138) in respect to the 
> "hoodie.properties" lifecycle as most of the already existing code assumes 
> that the file is (mostly) immutable.
> We should re-evaluate our usage of "hoodie.properties" as a state-store given 
> that it has ripple effects on the existing components.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4306) ComplexKeyGenerator and ComplexAvroKeyGenerator support non-partitioned table

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4306:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> ComplexKeyGenerator and ComplexAvroKeyGenerator support non-partitioned table
> -
>
> Key: HUDI-4306
> URL: https://issues.apache.org/jira/browse/HUDI-4306
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink-sql
>Reporter: Danny Chen
>Assignee: Nicholas Jiang
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3940) Lock manager does not increment retry count upon exception

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3940:

Fix Version/s: (was: 0.13.1)

> Lock manager does not increment retry count upon exception
> --
>
> Key: HUDI-3940
> URL: https://issues.apache.org/jira/browse/HUDI-3940
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0, 0.12.1, 0.13.0, 0.12.3, 0.14.0
>
>
> Came up while debugging CI failure: 
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=8198&view=logs&j=3272dbb2-0925-5f35-bae7-04e75ae62175&t=e3c8a1bc-8efe-5852-1800-3bd561aebfc8



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3976) Newly introduced HiveSyncConfig config, syncAsSparkDataSourceTable is defaulted as true

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3976:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Newly introduced HiveSyncConfig config, syncAsSparkDataSourceTable is 
> defaulted as true
> ---
>
> Key: HUDI-3976
> URL: https://issues.apache.org/jira/browse/HUDI-3976
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Surya Prasanna Yalla
>Priority: Critical
> Fix For: 0.14.0
>
>
> Newly introduced HiveSyncConfig config, syncAsSparkDataSourceTable is 
> defaulted as true. With this config enabled, both tableProperties and 
> serdeProperties are added to HMS. After that when running spark.sql queries 
> on the table the queries are failing with schema mismatch errors. Also this 
> is not backward compatible, using 0.8 version we are not able to read the 
> hive queries.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4112) Relax constraint in metadata table that rollback of a commit that got archived in MDT throws exception

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4112:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Relax constraint in metadata table that rollback of a commit that got 
> archived in MDT throws exception
> --
>
> Key: HUDI-4112
> URL: https://issues.apache.org/jira/browse/HUDI-4112
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.14.0
>
>
> when we are trying to rollback a commit, and if the commit it archived in 
> MDT, when this rollback is getting applied to MDT, we throw exception. 
>  
> excerpt from HoodieTableMetadataUtil.java
>  
> {code:java}
> HoodieInstant syncedInstant = new HoodieInstant(false, 
> HoodieTimeline.DELTA_COMMIT_ACTION, instantToRollback);
> if 
> (metadataTableTimeline.getCommitsTimeline().isBeforeTimelineStarts(syncedInstant.getTimestamp()))
>  {
>   throw new HoodieMetadataException(String.format("The instant %s required to 
> sync rollback of %s has been archived",
>   syncedInstant, instantToRollback));
> }
> shouldSkip = !metadataTableTimeline.containsInstant(syncedInstant);
> if (!hasNonZeroRollbackLogFiles && shouldSkip) {
>   LOG.info(String.format("Skipping syncing of rollbackMetadata at %s, since 
> this instant was never committed to Metadata Table",
>   instantToRollback));
>   return;
> } {code}
>  
> This is very much valid in case of restore operation. 
> C1, C2, C3, C4, C5, C6. 
> C2 savepointed. 
> but MDT could have archived C2(aggressive archival commits) since all C1 to 
> C6 are committed. So, when we trigger restore to C1, it will invoke rollback 
> of C6, C5... C2. 
> So, with savepoint and restore flow, this is a valid scenario and we need to 
> relax the constraint. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3342) MOR Delta Block Rollbacks not applied if Lazy Block reading is disabled

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3342:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> MOR Delta Block Rollbacks not applied if Lazy Block reading is disabled
> ---
>
> Key: HUDI-3342
> URL: https://issues.apache.org/jira/browse/HUDI-3342
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Alexey Kudinkin
>Assignee: Raymond Xu
>Priority: Critical
> Fix For: 0.14.0
>
>
> While working on HUDI-3322, i've spotted following contraption:
> When we are rolling back Delta Commits, we add corresponding 
> {{ROLLBACK_PREVIOUS_BLOCK}} Command Block at the back of the "queue". When we 
> restore, we issue a sequence of Rollbacks, which means that stack if such 
> Rollback Blocks could be of size > 1.
> However, when reading that MOR table if the reader does not specify 
> `readBlocksLazily=true`, we'd be merging Blocks eagerly (when instants 
> increment) therefore essentially rendering such Rollback Blocks useless since 
> they can't "unmerge" previously merged records, resurrecting the data that 
> was supposed to be rolled back.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4154) Unable to write HUDI Tables to S3 via Flink SQL

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4154:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Unable to write HUDI Tables to S3 via Flink SQL
> ---
>
> Key: HUDI-4154
> URL: https://issues.apache.org/jira/browse/HUDI-4154
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: connectors
>Reporter: sambhav gupta
>Priority: Major
> Fix For: 0.14.0
>
> Attachments: Error_hudi.png, Flink-conf.yaml.png, 
> FlinkHudiTable_create.png, core-site.xml.png
>
>
> When trying to write Hudi Tables into MinIO(S3) via Flink SQL we are facing 
> issues .
> The configuration is as follows:
> 1) MinIO S3 working on localhost:9000 - Latest docker image
> 2) Flink 1.13.6 
> 3) Hudi - hudi-flink-bundle_2.11-0.10.1.jar
> 4) etc/core/site.xml set with S3 properties for access key, secret key and 
> endpoint already
> When we create a MOR Hudi table as follows and try to insert records in it we 
> face an issue.
> > create table t1s3hudi(id int PRIMARY KEY, name varchar(50)) with 
> > ('connector' = 'hudi', 'path' = 's3a://test123/t1s3hudi', 'table.type' = 
> > 'MERGE_ON_READ', 'hoodie.aws.access.key' = 'minioadmin', 
> > 'hoodie.aws.secret.key' = 'minioadmin');
> > insert into t1s3hudi values(1,'one number s3');
>  
> The exception that we get in error logs is: 
> *Caused by: org.apache.hudi.exception.HoodieException: Error while checking 
> whether table exists under path:s3a://test123/t1s3hudi*
>     *at org.apache.hudi.util.StreamerUtil.tableExists(StreamerUtil.java:292) 
> ~[hudi-flink-bundle_2.11-0.10.1.jar:0.10.1]*
>     *at 
> org.apache.hudi.util.StreamerUtil.initTableIfNotExists(StreamerUtil.java:258) 
> ~[hudi-flink-bundle_2.11-0.10.1.jar:0.10.1]*
>     *at 
> org.apache.hudi.sink.StreamWriteOperatorCoordinator.start(StreamWriteOperatorCoordinator.java:164)
>  ~[hudi-flink-bundle_2.11-0.10.1.jar:0.10.1]*
>     *at 
> org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:194)
>  ~[flink-dist_2.11-1.13.6.jar:1.13.6]*
>     *at 
> org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85)
>  ~[flink-dist_2.11-1.13.6.jar:1.13.6]*
>     *at 
> org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:592)
>  ~[flink-dist_2.11-1.13.6.jar:1.13.6]*
>     *at 
> org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955)
>  ~[flink-dist_2.11-1.13.6.jar:1.13.6]*
>     *at 
> org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873)
>  ~[flink-dist_2.11-1.13.6.jar:1.13.6]*
>     *at 
> org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) 
> ~[flink-dist_2.11-1.13.6.jar:1.13.6]*
>     *at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181)
>  ~[flink-dist_2.11-1.13.6.jar:1.13.6]*
>     *at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605)
>  ~[flink-dist_2.11-1.13.6.jar:1.13.6]*
>     *... 18 more*
> *Caused by: java.nio.file.AccessDeniedException: 
> s3a://test123/t1s3hudi/.hoodie: getFileStatus on 
> s3a://test123/t1s3hudi/.hoodie: 
> com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon 
> S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 
> XAJMZTMQDGHRWZS8; S3 Extended Request ID: 
> qaTd5xTZCvnRwThI9fTSeuWVuzXpuw9H6w7roFGBnBVNQmHe1O7mgHbzEZmEIKNp/bx3Iyb9/Kc=; 
> Proxy: null), S3 Extended Request ID: 
> qaTd5xTZCvnRwThI9fTSeuWVuzXpuw9H6w7roFGBnBVNQmHe1O7mgHbzEZmEIKNp/bx3Iyb9/Kc=:403
>  Forbidden*
>     *at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:218) 
> ~[flink-s3-fs-hadoop-1.13.6.jar:1.13.6]*
>     *at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:145) 
> ~[flink-s3-fs-hadoop-1.13.6.jar:1.13.6]*
>     *at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2184)
>  ~[flink-s3-fs-hadoop-1.13.6.jar:1.13.6]*
>     *at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149)
>  ~[flink-s3-fs-hadoop-1.13.6.jar:1.13.6]*
>     *at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088) 
> ~[flink-s3-fs-hadoop-1.13.6.jar:1.13.6]*
>     *at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1734) 
> ~[flink-s3-fs-hadoop-1.13.6.jar:1.13.6]*
>     *at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:2970) 
> ~[flink-s3-fs-hadoop-1.13.6.jar:1.13.6]*
>     *at org.apache.hudi.util.StreamerUtil.tableExists(StreamerUtil.java:290) 
> ~[hudi-flink-bundle_2.11-0.10.1.jar:0.10.1]*
>     *at 
> org.apache.hudi.util.StreamerUtil.initTableIfNotExists(Stream

[jira] [Updated] (HUDI-3646) The Hudi update syntax should not modify the nullability attribute of a column

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3646:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> The Hudi update syntax should not modify the nullability attribute of a column
> --
>
> Key: HUDI-3646
> URL: https://issues.apache.org/jira/browse/HUDI-3646
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.10.1
> Environment: spark3.1.2
>Reporter: Tao Meng
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.14.0
>
>
> now， when we use sparksql to update hudi table, we find that  hudi will 
> change the nullability attribute of a column
> eg:
> {code:java}
> // code placeholder
>  val tableName = generateTableName
>  val tablePath = s"${new Path(tmp.getCanonicalPath, 
> tableName).toUri.toString}"
>  // create table
>  spark.sql(
>s"""
>   |create table $tableName (
>   |  id int,
>   |  name string,
>   |  price double,
>   |  ts long
>   |) using hudi
>   | location '$tablePath'
>   | options (
>   |  type = '$tableType',
>   |  primaryKey = 'id',
>   |  preCombineField = 'ts'
>   | )
> """.stripMargin)
>  // insert data to table
>  spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000")
>  spark.sql(s"select * from $tableName").printSchema()
>  // update data
>  spark.sql(s"update $tableName set price = 20 where id = 1")
>  spark.sql(s"select * from $tableName").printSchema() {code}
>  
>  |-- _hoodie_commit_time: string (nullable = true)
>  |-- _hoodie_commit_seqno: string (nullable = true)
>  |-- _hoodie_record_key: string (nullable = true)
>  |-- _hoodie_partition_path: string (nullable = true)
>  |-- _hoodie_file_name: string (nullable = true)
>  |-- id: integer (nullable = true)
>  |-- name: string (nullable = true)
>  *|-- price: double (nullable = true)*
>  |-- ts: long (nullable = true)
>  
>  |-- _hoodie_commit_time: string (nullable = true)
>  |-- _hoodie_commit_seqno: string (nullable = true)
>  |-- _hoodie_record_key: string (nullable = true)
>  |-- _hoodie_partition_path: string (nullable = true)
>  |-- _hoodie_file_name: string (nullable = true)
>  |-- id: integer (nullable = true)
>  |-- name: string (nullable = true)
>  *|-- price: double (nullable = false )*
>  |-- ts: long (nullable = true)
>  
> the nullable attribute of price has been changed to false， This is not the 
> result we want



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3786) how to deduce what MDT partitions to update on the write path w/ async indeing

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3786:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> how to deduce what MDT partitions to update on the write path w/ async indeing
> --
>
> Key: HUDI-3786
> URL: https://issues.apache.org/jira/browse/HUDI-3786
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: code-quality, metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.14.0
>
>
> w/ async indexing, how do we deduce what are the MDT partitions to update on 
> the regular write path? 
>  
> {code:java}
> private MetadataRecordsGenerationParams getRecordsGenerationParams() {
>   return new MetadataRecordsGenerationParams(
>   dataMetaClient, enabledPartitionTypes, 
> dataWriteConfig.getBloomFilterType(),
>   dataWriteConfig.getBloomIndexParallelism(),
>   dataWriteConfig.isMetadataColumnStatsIndexEnabled(),
>   dataWriteConfig.getColumnStatsIndexParallelism(),
>   
> StringUtils.toList(dataWriteConfig.getColumnsEnabledForColumnStatsIndex()),
>   
> StringUtils.toList(dataWriteConfig.getColumnsEnabledForBloomFilterIndex()));
> } {code}
> As of now, I see above code snippet is what deciding that. But don't we need 
> to decide on tableConfig ? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3683) Support evolved schema for HFile Reader

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3683:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Support evolved schema for HFile Reader
> ---
>
> Key: HUDI-3683
> URL: https://issues.apache.org/jira/browse/HUDI-3683
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> {code:java}
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.057 
> s <<< FAILURE! - in org.apache.hudi.io.storage.TestHoodieHFileReaderWriter
> [ERROR] 
> org.apache.hudi.io.storage.TestHoodieHFileReaderWriter.testWriteReadWithEvolvedSchema
>   Time elapsed: 0.055 s  <<< ERROR!
> org.apache.avro.AvroTypeException: Found example.schema.trip, expecting 
> example.schema.trip, missing required field added_field
>   at 
> org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
>   at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>   at 
> org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130)
>   at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215)
>   at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
>   at 
> org.apache.hudi.avro.HoodieAvroUtils.bytesToAvro(HoodieAvroUtils.java:137)
>   at 
> org.apache.hudi.io.storage.HoodieHFileReader.deserialize(HoodieHFileReader.java:394)
>   at 
> org.apache.hudi.io.storage.HoodieHFileReader.getRecordFromCell(HoodieHFileReader.java:378)
>   at 
> org.apache.hudi.io.storage.HoodieHFileReader.access$000(HoodieHFileReader.java:63)
>   at 
> org.apache.hudi.io.storage.HoodieHFileReader$2.hasNext(HoodieHFileReader.java:300)
>   at 
> org.apache.hudi.io.storage.TestHoodieReaderWriterBase.verifyReaderWithSchema(TestHoodieReaderWriterBase.java:231)
>   at 
> org.apache.hudi.io.storage.TestHoodieReaderWriterBase.testWriteReadWithEvolvedSchema(TestHoodieReaderWriterBase.java:153)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
>   at 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
>   at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
>   at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
>   at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
>   at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:84)
>   at 
> org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
>   at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
>   at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
>   at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
>   at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
>   at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
>   at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
>   at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
>   at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:212)
>   at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
>   at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:208)
>   at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:137)
>   at 
> org.junit.jupiter.engine.descriptor.TestMethodTest

[jira] [Updated] (HUDI-3626) Refactor TableSchemaResolver to remove `includeMetadataFields` flags

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3626:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Refactor TableSchemaResolver to remove `includeMetadataFields` flags
> 
>
> Key: HUDI-3626
> URL: https://issues.apache.org/jira/browse/HUDI-3626
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.14.0
>
>
> Currently, `TableSchemaResolver` to pass `includeMetadataFields` in its APIs 
> that would selectively remove metadata fields from the returned schemas. 
> There are multiple issues with this flag:
>  # It's applied inconsistently: sometimes it just means that meta fields 
> {_}won't be added{_}, and sometimes it means fields _would be removed_ even 
> if present
>  # This flag actually spells the usage context into TableSchemaResolver: it's 
> highly contextual whether caller wants to remove, omit such meta-fields, or 
> take the schema as is and should not be spilled into the Resolver itself.
>  # B/c of it there's no way to know actually what was actual schema the data 
> was written with (b/c flag might not only omit, but also change the original 
> schema)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3603) Support read DateType for hive2/hive3

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3603:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Support read  DateType  for hive2/hive3
> ---
>
> Key: HUDI-3603
> URL: https://issues.apache.org/jira/browse/HUDI-3603
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive
>Affects Versions: 0.10.1
>Reporter: Tao Meng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> now hudi only support read dateType for hive2, we should support read 
> DateType  for both hive2 and hive3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3639) [Incremental] Add Proper Incremental Records FIltering support into Hudi's custom RDD

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3639:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> [Incremental] Add Proper Incremental Records FIltering support into Hudi's 
> custom RDD
> -
>
> Key: HUDI-3639
> URL: https://issues.apache.org/jira/browse/HUDI-3639
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently, Hudi's `MergeOnReadIncrementalRelation` solely relies on 
> `ParquetFileReader` to do record-level filtering of the records that don't 
> belong to a timeline span being queried.
> As a side-effect, Hudi actually have to disable the use of 
> [VectorizedParquetReader|https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-vectorized-parquet-reader.html]
>  (since using one would prevent records from being filtered by the Reader)
>  
> Instead, we should make sure that proper record-level filtering is performed 
> w/in the returned RDD, instead of squarely relying on FileReader to do that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3887) Spark query can not read the data changes which written by flink after the spark query connection created

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3887:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Spark query can not read the data changes which written by flink after the 
> spark query connection created
> -
>
> Key: HUDI-3887
> URL: https://issues.apache.org/jira/browse/HUDI-3887
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: chacha.tang
>Priority: Major
> Fix For: 0.14.0
>
>
> Enviroment:
> hudi version: 0.10.1
> flink version: 13.2
> spark version: 3.1.2
>  
> Spark query can not read the data changes which written by flink after the 
> spark query connection created。
> For example:
> step1: use flink hudi job write table
> INSERT INTO t1 VALUES ('id1','Danny',20,TIMESTAMP '1970-01-01 
> 00:00:01','par1');
> step2: create the spark jdbc connection to query the data，in this time data 
> can be query correctly
> step3: change the age property to write the data again。
> INSERT INTO t1 VALUES ('id1','Danny',27,TIMESTAMP '1970-01-01 
> 00:00:01','par1');
> step4: use the spark jdbc connection which created in step2 to query data，and 
> found there is no change happened。
> step5:create a new spark jdbc connection to query the data, then the result 
> is correct



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3636) Clustering fails due to marker creation failure

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3636:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Clustering fails due to marker creation failure
> ---
>
> Key: HUDI-3636
> URL: https://issues.apache.org/jira/browse/HUDI-3636
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: multi-writer
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Scenario: multi-writer test, one writer doing ingesting with Deltastreamer 
> continuous mode, COW, inserts, async clustering and cleaning (partitions 
> under 2022/1, 2022/2), another writer with Spark datasource doing backfills 
> to different partitions (2021/12).  
> 0.10.0 no MT, clustering instant is inflight (failing it in the middle before 
> upgrade) ➝ 0.11 MT, with multi-writer configuration the same as before.
> The clustering/replace instant cannot make progress due to marker creation 
> failure, failing the DS ingestion as well.  Need to investigate if this is 
> timeline-server-based marker related or MT related.
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
> stage 46.0 failed 1 times, most recent failure: Lost task 2.0 in stage 46.0 
> (TID 277) (192.168.70.231 executor driver): java.lang.RuntimeException: 
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> 2022/1/24/aa2f24d3-882f-4d48-b20e-9fcd3540c7a7-0_2-46-277_20220314101326706.parquet.marker.CREATE
> Connect to localhost:26754 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] 
> failed: Connection refused (Connection refused)
>     at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
>     at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>     at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>     at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>     at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
>     at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
>     at scala.collection.AbstractIterator.to(Iterator.scala:1431)
>     at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
>     at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
>     at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
>     at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
>     at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
>     at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
>     at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>     at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>     at org.apache.spark.scheduler.Task.run(Task.scala:131)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> 2022/1/24/aa2f24d3-882f-4d48-b20e-9fcd3540c7a7-0_2-46-277_20220314101326706.parquet.marker.CREATE
> Connect to localhost:26754 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] 
> failed: Connection refused (Connection refused)
>     at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:94)
>     at

[jira] [Updated] (HUDI-3668) Fix failing unit tests in hudi-integ-test

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3668:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fix failing unit tests in hudi-integ-test
> -
>
> Key: HUDI-3668
> URL: https://issues.apache.org/jira/browse/HUDI-3668
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: tests-ci
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.14.0
>
>
> org.apache.hudi.integ.testsuite.TestDFSHoodieTestSuiteWriterAdapter#testDFSTwoFilesWriteWithRollover
> {code:java}
> org.mockito.exceptions.verification.TooManyActualInvocations: 
> avroFileDeltaInputWriter.canWrite();
> Wanted 2 times:
> -> at 
> org.apache.hudi.integ.testsuite.TestDFSHoodieTestSuiteWriterAdapter.testDFSTwoFilesWriteWithRollover(TestDFSHoodieTestSuiteWriterAdapter.java:119)
> But was 3 times:
> -> at 
> org.apache.hudi.integ.testsuite.writer.DFSDeltaWriterAdapter.write(DFSDeltaWriterAdapter.java:50)
> -> at 
> org.apache.hudi.integ.testsuite.writer.DFSDeltaWriterAdapter.write(DFSDeltaWriterAdapter.java:50)
> -> at 
> org.apache.hudi.integ.testsuite.writer.DFSDeltaWriterAdapter.write(DFSDeltaWriterAdapter.java:50)
>     at 
> org.apache.hudi.integ.testsuite.TestDFSHoodieTestSuiteWriterAdapter.testDFSTwoFilesWriteWithRollover(TestDFSHoodieTestSuiteWriterAdapter.java:119)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
>     at 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
>     at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
>     at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
>     at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
>     at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:84)
>     at 
> org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
>     at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
>     at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
>     at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
>     at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
>     at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
>     at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
>     at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
>     at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:212)
>     at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
>     at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:208)
>     at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:137)
>     at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:71)
>     at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$5(NodeTestTask.java:139)
>     at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
>     at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$7(NodeTestTask.java:129)
>     at 
> org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
>     at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:127)
>     at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
>     at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:126)
>     at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTas

[jira] [Updated] (HUDI-3818) hudi doesn't support bytes column as primary key

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3818:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> hudi doesn't support bytes column as primary key
> 
>
> Key: HUDI-3818
> URL: https://issues.apache.org/jira/browse/HUDI-3818
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive
>Reporter: rex xiong
>Assignee: rex xiong
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
>  when use bytes column as primary key, hudi will generate fixed hoodie key, 
> then upserts will only insert one row. 
> {code:java}
> scala> sql("desc extended binary_test1").show()
> +++---+
> |            col_name|           data_type|comment|
> +++---+
> | _hoodie_commit_time|              string|   null|
> |_hoodie_commit_seqno|              string|   null|
> |  _hoodie_record_key|              string|   null|
> |_hoodie_partition...|              string|   null|
> |   _hoodie_file_name|              string|   null|
> |                  id|              binary|   null|
> |                name|              string|   null|
> |                  dt|              string|   null|
> |                    |                    |       |
> |# Detailed Table ...|                    |       |
> |            Database|             default|       |
> |               Table|        binary_test1|       |
> |               Owner|                root|       |
> |        Created Time|Sat Apr 02 13:28:...|       |
> |         Last Access|             UNKNOWN|       |
> |          Created By|         Spark 3.2.0|       |
> |                Type|             MANAGED|       |
> |            Provider|                hudi|       |
> |    Table Properties|[last_commit_time...|       |
> |          Statistics|        435194 bytes|       |
> +++---+
> scala> sql("select * from binary_test1").show()
> +---+++--+++-++
> |_hoodie_commit_time|_hoodie_commit_seqno|  
> _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|               
>    id|     name|      dt|
> +---+++--+++-++
> |  20220402132927590|20220402132927590...|id:java.nio.HeapB...|               
>        |1a06106e-5e7a-4e6...|[03 45 6A 00 00 0...|Mary Jane|20220401|
> +---+++--+++-++{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3648) Failed to execute rollback due to HoodieIOException: Could not delete instant

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3648:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Failed to execute rollback due to HoodieIOException: Could not delete instant
> -
>
> Key: HUDI-3648
> URL: https://issues.apache.org/jira/browse/HUDI-3648
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Critical
>  Labels: hudi-on-call
> Fix For: 0.14.0
>
>
> Deltastreamer continuous mode writing to COW table with async clustering and 
> cleaning.
> {code:java}
> org.apache.hudi.exception.HoodieRollbackException: Failed to rollback 
> file:/Users/ethan/Work/scripts/mt_rollout_testing/deploy_b_single_writer_async_services/b3_ds_cow_010mt_011mt_conf/test_table
>  commits 20220314165647208
>     at 
> org.apache.hudi.client.BaseHoodieWriteClient.rollback(BaseHoodieWriteClient.java:695)
>     at 
> org.apache.hudi.client.BaseHoodieWriteClient.rollbackFailedWrites(BaseHoodieWriteClient.java:1037)
>     at 
> org.apache.hudi.client.BaseHoodieWriteClient.tryUpgrade(BaseHoodieWriteClient.java:1404)
>     at 
> org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1302)
>     at 
> org.apache.hudi.client.SparkRDDWriteClient.insert(SparkRDDWriteClient.java:174)
>     at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:574)
>     at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:329)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:656)
>     at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieIOException: Could not delete 
> instant [==>20220314165647208__commit__INFLIGHT]
>     at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.deleteInstantFile(HoodieActiveTimeline.java:250)
>     at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.deletePending(HoodieActiveTimeline.java:201)
>     at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.deleteInflightAndRequestedInstant(BaseRollbackActionExecutor.java:270)
>     at 
> org.apache.hudi.table.action.rollback.CopyOnWriteRollbackActionExecutor.executeRollback(CopyOnWriteRollbackActionExecutor.java:90)
>     at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.doRollbackAndGetStats(BaseRollbackActionExecutor.java:218)
>     at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.runRollback(BaseRollbackActionExecutor.java:115)
>     at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.execute(BaseRollbackActionExecutor.java:144)
>     at 
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.rollback(HoodieSparkCopyOnWriteTable.java:346)
>     at 
> org.apache.hudi.client.BaseHoodieWriteClient.rollback(BaseHoodieWriteClient.java:680)
>     ... 11 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3407) Make sure Restore operation is Not Concurrent w/ Writes in Multi-Writer scenario

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3407:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Make sure Restore operation is Not Concurrent w/ Writes in Multi-Writer 
> scenario
> 
>
> Key: HUDI-3407
> URL: https://issues.apache.org/jira/browse/HUDI-3407
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: multi-writer
>Reporter: Alexey Kudinkin
>Priority: Major
> Fix For: 0.14.0
>
>
> Currently there's no guard-rail that would prevent Restore from running 
> concurrently with Writes in Multi-Writer scenario, which might lead to table 
> getting into inconsistent state.
>  
> One of the approaches could be letting Restore to acquire the Write lock for 
> the whole duration of its operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3487) The global index is enabled regardless of changlog

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3487:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> The global index is enabled regardless of changlog
> --
>
> Key: HUDI-3487
> URL: https://issues.apache.org/jira/browse/HUDI-3487
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink, index
>Reporter: waywtdcc
>Assignee: waywtdcc
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3467) Check shutdown logic with async compaction in Spark Structured Streaming

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3467:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Check shutdown logic with async compaction in Spark Structured Streaming
> 
>
> Key: HUDI-3467
> URL: https://issues.apache.org/jira/browse/HUDI-3467
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: compaction, spark
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call
> Fix For: 0.14.0
>
>
> Related issue
> https://github.com/apache/hudi/issues/5046



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3517:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: Lokesh Jain
>Priority: Blocker
>  Labels: hudi-on-call
> Fix For: 0.14.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0&basepath=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test&fileid=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0&lastinstantts=20220225182311228&timelinehash=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool

[jira] [Updated] (HUDI-3300) Timeline server FSViewManager should avoid point lookup for metadata file partition

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3300:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Timeline server FSViewManager should avoid point lookup for metadata file 
> partition
> ---
>
> Key: HUDI-3300
> URL: https://issues.apache.org/jira/browse/HUDI-3300
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata, timeline-server
>Reporter: Manoj Govindassamy
>Assignee: Yue Zhang
>Priority: Major
> Fix For: 0.14.0
>
>
> When inline reading is enabled, that is 
> hoodie.metadata.enable.full.scan.log.files = false, 
> MetadataMergedLogRecordReader doesn't cache the file listings records via the 
> ExternalSpillableMap. So, every file listing will lead to re-reading of 
> metadata files partition log and base files. Since files partition is less in 
> size, even when inline reading is enabled, the TimelineServer should 
> construct the FSViewManager with inline reading disabled for metadata files 
> partition. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3067) "Table already exists" error with multiple writers and dynamodb

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3067:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> "Table already exists" error with multiple writers and dynamodb
> ---
>
> Key: HUDI-3067
> URL: https://issues.apache.org/jira/browse/HUDI-3067
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Nikita Sheremet
>Assignee: Wenning Ding
>Priority: Critical
>  Labels: hudi-on-call
> Fix For: 0.14.0
>
>
> How reproduce:
>  # Set up multiple writing 
> [https://hudi.apache.org/docs/concurrency_control/] for dynamodb (do not 
> forget to set _hoodie.write.lock.dynamodb.region_ and 
> {_}hoodie.write.lock.dynamodb.billing_mode{_}). Do not create anty dynamodb 
> table.
>  # Run multiple writers to the table
> (Tested on aws EMR, so multiple writers is EMR steps)
> Expected result - all steps completed.
> Actual result: some steps failed with exception 
> {code:java}
> Caused by: com.amazonaws.services.dynamodbv2.model.ResourceInUseException: 
> Table already exists: truedata_detections (Service: AmazonDynamoDBv2; Status 
> Code: 400; Error Code: ResourceInUseException; Request ID:; Proxy: null)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1819)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1403)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1372)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
>   at 
> com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
>   at 
> com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
>   at 
> com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:6214)
>   at 
> com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:6181)
>   at 
> com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeCreateTable(AmazonDynamoDBClient.java:1160)
>   at 
> com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.createTable(AmazonDynamoDBClient.java:1124)
>   at 
> org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.createLockTableInDynamoDB(DynamoDBBasedLockProvider.java:188)
>   at 
> org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.(DynamoDBBasedLockProvider.java:99)
>   at 
> org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.(DynamoDBBasedLockProvider.java:77)
>   ... 54 more
> 21/12/19 13:42:06 INFO Yar {code}
> This happens because all steps tried to create table at the same time.
>  
> Suggested solution:
> A catch statment for _Table already exists_ exception should be added into 
> dynamodb table creation code. May be with delay and additional check that 
> table is present.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1748) Read operation will possibility fail on mor table rt view when a write operations is concurrency running

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-1748:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Read operation will possibility fail on mor table rt view when a write 
> operations is concurrency running
> 
>
> Key: HUDI-1748
> URL: https://issues.apache.org/jira/browse/HUDI-1748
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: multi-writer
>Reporter: lrz
>Priority: Major
>  Labels: core-flow-ds, pull-request-available, query-eng, 
> user-support-issues
> Fix For: 0.14.0
>
>
> during reading operation, a new base file maybe produced by a writting 
> operation. then the reading will opooibility to get a NPE when getSplit. here 
> is the exception stack:
> !https://wa.vision.huawei.com/vision-file-storage/api/file/download/upload-v2/2021/2/15/qwx352829/7bacca8042104499b0991d50b4bc3f2a/image.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3117) Kafka Connect can not clearly distinguish every task log

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3117:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Kafka Connect can not clearly distinguish every task log
> 
>
> Key: HUDI-3117
> URL: https://issues.apache.org/jira/browse/HUDI-3117
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: cdmikechen
>Assignee: Ethan Guo
>Priority: Major
>  Labels: kafka-connect
> Fix For: 0.14.0
>
>
> After creating multiple tasks in Kafka connect, it is difficult to 
> distinguish which task is processed in the current part through log 
> information because there is no field declaration of task related information.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3057) Instants should be generated strictly under locks

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3057:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Instants should be generated strictly under locks
> -
>
> Key: HUDI-3057
> URL: https://issues.apache.org/jira/browse/HUDI-3057
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: multi-writer, writer-core
>Reporter: Alexey Kudinkin
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: sev:high
> Fix For: 0.14.0
>
> Attachments: logs.txt
>
>
> While looking into the flakiness of the tests outlined here:
> https://issues.apache.org/jira/browse/HUDI-3043
>  
> I've stumbled upon following failure where one of the writers tries to 
> complete the Commit but it couldn't b/c such file does already exist:
> {code:java}
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
> org.apache.hudi.exception.HoodieIOException: Failed to create file 
> /var/folders/kb/cnff55vj041g2nnlzs5ylqk0gn/T/junit5142536255031969586/testtable_MERGE_ON_READ/.hoodie/20211217150157632.commit
>     at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>     at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>     at 
> org.apache.hudi.utilities.functional.TestHoodieDeltaStreamerWithMultiWriter.runJobsInParallel(TestHoodieDeltaStreamerWithMultiWriter.java:336)
>     at 
> org.apache.hudi.utilities.functional.TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWriters(TestHoodieDeltaStreamerWithMultiWriter.java:150)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
>     at 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
>     at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
>     at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
>     at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
>     at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92)
>     at 
> org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
>     at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
>     at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
>     at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
>     at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
>     at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
>     at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
>     at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
>     at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:212)
>     at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
>     at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:208)
>     at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:137)
>     at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:71)
>     at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$5(NodeTestTask.java:139)
>     at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
>     at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$7(NodeTestTask.java:129)
>     at 
> org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
>     at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:127)
>     at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execut

[jira] [Updated] (HUDI-3023) Fix order of tests

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3023:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fix order of tests
> --
>
> Key: HUDI-3023
> URL: https://issues.apache.org/jira/browse/HUDI-3023
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 0.14.0
>
>
> Recently, we encountered an issue in integ tests where namenode was not ready 
> yet to receive connections (still in safemode) and hdfs commands in 
> ITTestHoodieDemo setup were not succeeeding. The namenode typically takes 
> some time (2-3 minutes) to come up. While adding a delay is a workaround, it 
> would be better to execute this test after others like 
> ITTestHoodieSyncCommand and ITTestHoodieSanity. With JUnit 5.8 we can order 
> test classes and as we already have a working patch for upgrading junit 
> (https://github.com/apache/hudi/pull/3748) we should consider ficing the 
> order to avoid flakiness.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3055) Make sure that Compression Codec configuration is respected across the board

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3055:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Make sure that Compression Codec configuration is respected across the board
> 
>
> Key: HUDI-3055
> URL: https://issues.apache.org/jira/browse/HUDI-3055
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: storage-management
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Major
>  Labels: new-to-hudi
> Fix For: 0.14.0
>
>
> Currently there are quite a few places where we assume GZip as the 
> compression codec which is incorrect, given that this is configurable and 
> users might actually prefer to use different compression codec.
> Examples:
> [HoodieParquetDataBlock|https://github.com/apache/hudi/pull/4333/files#diff-798a773c6eef4011aef2da2b2fb71c25f753500548167b610021336ef6f14807]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1779) Fail to bootstrap/upsert a table which contains timestamp column

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-1779:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fail to bootstrap/upsert a table which contains timestamp column
> 
>
> Key: HUDI-1779
> URL: https://issues.apache.org/jira/browse/HUDI-1779
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies, spark
>Reporter: lrz
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
> Attachments: unsupportInt96.png, upsertFail.png, upsertFail2.png
>
>
> current when hudi bootstrap a parquet file, or upsert into a parquet file 
> which contains timestmap column, it will fail because these issues:
> 1) At bootstrap operation, if the origin parquet file was written by a spark 
> application, then spark will default save timestamp as int96(see 
> spark.sql.parquet.int96AsTimestamp), then bootstrap will fail, it’s because 
> of Hudi can not read Int96 type now.(this issue can be solve by upgrade 
> parquet to 1.12.0, and set parquet.avro.readInt96AsFixed=true, please check 
> [https://github|https://github/] 
> <[https://github/]>.com/apache/parquet-mr/pull/831/files) 
> 2) after bootstrap, doing upsert will fail because we use hoodie schema to 
> read origin parquet file. The schema is not match because hoodie schema  
> treat timestamp as long and at origin file it’s Int96 
> 3) after bootstrap, and partial update for a parquet file will fail, because 
> we copy the old record and save by hoodie schema( we miss a 
> convertFixedToLong operation like spark does)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3114) Kafka Connect can not connect Hive by jdbc

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3114:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Kafka Connect can not connect Hive by jdbc
> --
>
> Key: HUDI-3114
> URL: https://issues.apache.org/jira/browse/HUDI-3114
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies, kafka-connect
>Reporter: cdmikechen
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.14.0
>
>
> Current Kafka Connect does not import hive-jdbc dependency, which makes it 
> impossible to create hive tables using hive jdbc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2930) Rollbacks are not archived when metadata table is enabled

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-2930:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Rollbacks are not archived when metadata table is enabled
> -
>
> Key: HUDI-2930
> URL: https://issues.apache.org/jira/browse/HUDI-2930
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: HUDI-bug
> Fix For: 0.14.0
>
>
> I run bulk inserts into COW table using DeltaStreamer continuous mode and I 
> observed that the rollbacks are not archived.  There were commits in between 
> these old rollbacks but after the archival process kicks in, the old 
> rollbacks are still in the active timeline while the other commits are 
> archived.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3019) Upserts with Dataype promotion only to a subset of partition fails

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3019:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Upserts with Dataype promotion only to a subset of partition fails
> --
>
> Key: HUDI-3019
> URL: https://issues.apache.org/jira/browse/HUDI-3019
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.10.0
>Reporter: sivabalan narayanan
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.14.0
>
>
> Upserts with Dataype promotion only to a subset of partition fails.
>  
> Lets say intial insert was done to partition1 and partition2. with col1 type 
> as integer. 
> commit2 inserted records to partition2 and partition3, with col1 type as 
> long. integer -> long is backwards compatible evolution and hence write 
> succeeds. but when trying to read data from hudi, we run into issues. This is 
> not seen when a new column is added. 
>  
> Reference issue: 
> [https://github.com/apache/hudi/issues/3558]
>  
> {code:java}
> spark.sql("select * from hudi_trips_snapshot2").show()
> 21/12/14 12:11:48 ERROR Executor: Exception in task 0.0 in stage 165.0 (TID 
> 1620)
> java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
>   at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:36)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748) {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2782) Fix marker based strategy for structured streaming

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-2782:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fix marker based strategy for structured streaming
> --
>
> Key: HUDI-2782
> URL: https://issues.apache.org/jira/browse/HUDI-2782
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.14.0
>
>
> As part of [this|https://github.com/apache/hudi/pull/3967] patch, we are 
> making timeline server based as the default marker type. But we have an issue 
> w/ structured streaming. Looks like after 1st micro batch, the timeline 
> server gets shutdown and for subsequent micro batches, timeline server is not 
> available. So, in the patch we have made marker based overridden just for 
> structured streaming. 
>  
> We may want to revisit this and see how to go about it. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2910) Hudi CLI "commits showarchived" throws NPE

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-2910:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Hudi CLI "commits showarchived" throws NPE
> --
>
> Key: HUDI-2910
> URL: https://issues.apache.org/jira/browse/HUDI-2910
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cli
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
>
> When trying to show archived commits through Hudi CLI command "commits 
> showarchived", NullPointerException is thrown.  I'm using 0.10.0-rc2.
> {code:java}
> hudi:test_table->commits showarchived
> Command failed java.lang.NullPointerException
> java.lang.NullPointerException
>     at 
> org.apache.hudi.common.table.timeline.HoodieArchivedTimeline.lambda$readCommit$2(HoodieArchivedTimeline.java:154)
>     at org.apache.hudi.common.util.Option.map(Option.java:107)
>     at 
> org.apache.hudi.common.table.timeline.HoodieArchivedTimeline.readCommit(HoodieArchivedTimeline.java:149)
>     at 
> org.apache.hudi.common.table.timeline.HoodieArchivedTimeline.lambda$loadInstants$5(HoodieArchivedTimeline.java:228)
>     at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>     at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
>     at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
>     at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>     at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>     at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>     at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>     at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
>     at 
> org.apache.hudi.common.table.timeline.HoodieArchivedTimeline.loadInstants(HoodieArchivedTimeline.java:230)
>     at 
> org.apache.hudi.common.table.timeline.HoodieArchivedTimeline.loadInstants(HoodieArchivedTimeline.java:193)
>     at 
> org.apache.hudi.common.table.timeline.HoodieArchivedTimeline.loadInstants(HoodieArchivedTimeline.java:189)
>     at 
> org.apache.hudi.common.table.timeline.HoodieArchivedTimeline.loadInstantDetailsInMemory(HoodieArchivedTimeline.java:112)
>     at 
> org.apache.hudi.cli.commands.CommitsCommand.showArchivedCommits(CommitsCommand.java:217)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:216)
>     at 
> org.springframework.shell.core.SimpleExecutionStrategy.invoke(SimpleExecutionStrategy.java:68)
>     at 
> org.springframework.shell.core.SimpleExecutionStrategy.execute(SimpleExecutionStrategy.java:59)
>     at 
> org.springframework.shell.core.AbstractShell.executeCommand(AbstractShell.java:134)
>     at 
> org.springframework.shell.core.JLineShell.promptLoop(JLineShell.java:533)
>     at org.springframework.shell.core.JLineShell.run(JLineShell.java:179)
>     at java.lang.Thread.run(Thread.java:748) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2745) Record count does not match input after compaction is scheduled when running Hudi Kafka Connect sink

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-2745:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Record count does not match input after compaction is scheduled when running 
> Hudi Kafka Connect sink
> 
>
> Key: HUDI-2745
> URL: https://issues.apache.org/jira/browse/HUDI-2745
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: compaction
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
>
> Spark Shell command to do snapshot query:
> {code:java}
> val basePath = "/tmp/hoodie/hudi-test-topic"
> val df = spark.read.format("hudi").load(basePath)
> df.createOrReplaceTempView("hudi_test_table")
> spark.sql("select count(*) from hudi_test_table").show() {code}
> Two cases of count mismatch:
> (1) Compaction scheduled, more deltacommits later on: the count does not 
> match input size.  After compaction is executed.  The count becomes correct.
> (2) Clustering scheduled, more deltacommits later on: the count is correct, 
> equal to the input size.  After clustering is executed, the count drops and 
> becomes incorrect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2528) Flaky test: MERGE_ON_READ testTableOperationsWithRestore

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-2528:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Flaky test: MERGE_ON_READ testTableOperationsWithRestore
> 
>
> Key: HUDI-2528
> URL: https://issues.apache.org/jira/browse/HUDI-2528
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Testing, tests-ci
>Reporter: Raymond Xu
>Priority: Critical
> Fix For: 0.14.0
>
>
>  
> {code:java}
>  [ERROR] Failures:[ERROR] There files should have been rolled-back when 
> rolling back commit 002 but are still remaining. Files: 
> [file:/tmp/junit6464799159313857398/2016/03/15/9d59f0f1-9cfa-41a4-b247-6bf002ad6cc7-0_0-592-8761_001.parquet,
>  
> file:/tmp/junit6464799159313857398/2016/03/15/9d59f0f1-9cfa-41a4-b247-6bf002ad6cc7-0_0-585-8754_001.parquet]
>  ==> expected: <0> but was: <2>[ERROR] Errors:[ERROR] No Compaction 
> request available at 007 to run compaction {code}
>  
> Probably the same cause as HUDI-2108
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1889) Support partition path in a nested field in HoodieFileIndex

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-1889:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Support partition path in a nested field in HoodieFileIndex
> ---
>
> Key: HUDI-1889
> URL: https://issues.apache.org/jira/browse/HUDI-1889
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.14.0
>
>
> Partition path in a nested field is not supported in HoodieFileIndex.  When 
> using a nested field for the partition path, the following exception is 
> thrown:
> {code:java}
> java.lang.IllegalArgumentException: Cannot find column: 'fare.currency' in 
> the 
> schema[StructField(_row_key,StringType,true),StructField(timestamp,LongType,true),StructField(name,StringType,true),StructField(fare,StructType(StructField(value,LongType,true),
>  StructField(currency,StringType,true)),true)]
>   at 
> org.apache.hudi.HoodieFileIndex$$anonfun$4$$anonfun$apply$1.apply(HoodieFileIndex.scala:98)
>   at 
> org.apache.hudi.HoodieFileIndex$$anonfun$4$$anonfun$apply$1.apply(HoodieFileIndex.scala:98)
>   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>   at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>   at 
> org.apache.hudi.HoodieFileIndex$$anonfun$4.apply(HoodieFileIndex.scala:98)
>   at 
> org.apache.hudi.HoodieFileIndex$$anonfun$4.apply(HoodieFileIndex.scala:97)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.hudi.HoodieFileIndex._partitionSchemaFromProperties$lzycompute(HoodieFileIndex.scala:97)
>   at 
> org.apache.hudi.HoodieFileIndex._partitionSchemaFromProperties(HoodieFileIndex.scala:91)
>   at 
> org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:245)
>   at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:147)
>   at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:116)
>   at 
> org.apache.hudi.TestHoodieRowWriting.testRowWriting(TestHoodieRowWriting.scala:103)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
>   at 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
>   at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
>   at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
>   at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
>   at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92)
>   at 
> org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
>   at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
>   at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
>   at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
>   at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
>   at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
>   at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
>   at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
>   at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:212)
>   at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
>

[jira] [Updated] (HUDI-1380) Async cleaning does not work with Timeline Server

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-1380:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Async cleaning does not work with Timeline Server
> -
>
> Key: HUDI-1380
> URL: https://issues.apache.org/jira/browse/HUDI-1380
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core, table-service, timeline-server
>Reporter: Nishith Agarwal
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1369) Bootstrap datasource jobs from hanging via spark-submit

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-1369:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Bootstrap datasource jobs from hanging via spark-submit
> ---
>
> Key: HUDI-1369
> URL: https://issues.apache.org/jira/browse/HUDI-1369
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
>
> MOR table creation via Hudi datasource hangs at the end of the spark-submit 
> job.
> Looks like {{HoodieWriteClient}} at 
> [https://github.com/apache/hudi/blob/release-0.6.0/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L255]
>  not being closed which does not stop the timeline server at the end, and as 
> a result the job hangs and never exits.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1117) Add tdunning json library to spark and utilities bundle

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-1117:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Add tdunning json library to spark and utilities bundle
> ---
>
> Key: HUDI-1117
> URL: https://issues.apache.org/jira/browse/HUDI-1117
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies, meta-sync
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: user-support-issues
> Fix For: 0.14.0
>
>
> Exception during Hive Sync:
> ```
> An error occurred while calling o175.save.\n: java.lang.NoClassDefFoundError: 
> org/json/JSONException\n\tat 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10047)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)\n\tat
>  org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)\n\tat 
> org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)\n\tat 
> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)\n\tat 
> org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)\n\tat 
> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)\n\tat 
> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)\n\tat 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:515)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:498)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:488)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:273)\n\tat
>  org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:146)\n\tat
> ```
> This is from using hudi-spark-bundle. 
> [https://github.com/apache/hudi/issues/1787]
> JSONException class is coming from 
> https://mvnrepository.com/artifact/org.json/json There is licensing issue and 
> hence not part of hudi bundle packages. The underlying issue is due to Hive 
> 1.x vs 2.x ( See 
> https://issues.apache.org/jira/browse/HUDI-150?jql=text%20~%20%22org.json%22%20and%20project%20%3D%20%22Apache%20Hudi%22%20)
> Spark Hive integration still brings in hive 1.x jars which depends on 
> org.json. I believe this was provided in user's environment and hence we have 
> not seen folks complaining about this issue.
> Even though this is not Hudi issue per se, let me check a jar with compatible 
> license : https://mvnrepository.com/artifact/com.tdunning/json/1.8 and if it 
> works, we will add to 0.6 bundles after discussing with community. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1158) Optimizations in parallelized listing behaviour for markers and bootstrap source files

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-1158:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Optimizations in parallelized listing behaviour for markers and bootstrap 
> source files
> --
>
> Key: HUDI-1158
> URL: https://issues.apache.org/jira/browse/HUDI-1158
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Udit Mehrotra
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.14.0
>
>
> * Extract out the common inner logic
>  * Parallelize not just at top directory level, but at the leaf partition 
> folders level



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1036) HoodieCombineHiveInputFormat not picking up HoodieRealtimeFileSplit

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-1036:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> HoodieCombineHiveInputFormat not picking up HoodieRealtimeFileSplit
> ---
>
> Key: HUDI-1036
> URL: https://issues.apache.org/jira/browse/HUDI-1036
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive
>Affects Versions: 0.9.0
>Reporter: Bhavani Sudha
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: user-support-issues
> Fix For: 0.14.0
>
>
> Opening this Jira based on the GitHub issue reported here - 
> [https://github.com/apache/hudi/issues/1735] when hive.input.format = 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat it is not able to 
> create HoodieRealtimeFileSplit for querying _rt table. Please see the GitHub 
> issue more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1145) Debug if Insert operation calls upsert in case of small file handling path.

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-1145:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Debug if Insert operation calls upsert in case of small file handling path.
> ---
>
> Key: HUDI-1145
> URL: https://issues.apache.org/jira/browse/HUDI-1145
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.14.0
>
>
> INSERT operations may be triggering UPSERT internally in the Merging process 
> when dealing with small files. This surfaced out of a SLACK thread. Need to 
> config if this is indeed is happening. If yes, this needs to be fixed such 
> that the MERGE HANDLE should not call upsert and instead let conflicting 
> records into the file if it is an INSERT operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1286) Merge On Read queries (_rt) fails on docker demo for test suite

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-1286:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Merge On Read queries (_rt) fails on docker demo for test suite
> ---
>
> Key: HUDI-1286
> URL: https://issues.apache.org/jira/browse/HUDI-1286
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dev-experience, Testing, tests-ci
>Affects Versions: 0.9.0
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 0.14.0
>
>
> When running the following query -> 
> {code:java}
> select count(*) from testdb.table1_rt
> {code}
> we see the following exception in hiveserver :
> {code:java}
> 2020-09-16T03:41:07,668 INFO  LocalJobRunner Map Task Executor #0: 
> realtime.AbstractRealtimeRecordReader 
> (AbstractRealtimeRecordReader.java:init(88)) - Writer Schema From Parquet => 
> [_hoodie_commit_time type:UNION pos:0, _hoodie_commit_seqno type:UNION pos:1, 
> _hoodie_record_key type:UNION pos:2, _hoodie_partition_path type:UNION pos:3, 
> _hoodie_file_name type:UNION pos:4, timestamp type:LONG pos:5, _row_key 
> type:STRING pos:6, rider type:STRING pos:7, driver type:STRING pos:8, 
> begin_lat type:DOUBLE pos:9, begin_lon type:DOUBLE pos:10, end_lat 
> type:DOUBLE pos:11, end_lon type:DOUBLE pos:12, fare type:DOUBLE 
> pos:13]2020-09-16T03:41:07,668 INFO  LocalJobRunner Map Task Executor #0: 
> realtime.AbstractRealtimeRecordReader 
> (AbstractRealtimeRecordReader.java:init(88)) - Writer Schema From Parquet => 
> [_hoodie_commit_time type:UNION pos:0, _hoodie_commit_seqno type:UNION pos:1, 
> _hoodie_record_key type:UNION pos:2, _hoodie_partition_path type:UNION pos:3, 
> _hoodie_file_name type:UNION pos:4, timestamp type:LONG pos:5, _row_key 
> type:STRING pos:6, rider type:STRING pos:7, driver type:STRING pos:8, 
> begin_lat type:DOUBLE pos:9, begin_lon type:DOUBLE pos:10, end_lat 
> type:DOUBLE pos:11, end_lon type:DOUBLE pos:12, fare type:DOUBLE 
> pos:13]2020-09-16T03:41:07,670 INFO  [Thread-465]: mapred.LocalJobRunner 
> (LocalJobRunner.java:runTasks(483)) - map task executor 
> complete.2020-09-16T03:41:07,671 WARN  [Thread-465]: mapred.LocalJobRunner 
> (LocalJobRunner.java:run(587)) - job_local242522391_0010java.lang.Exception: 
> java.io.IOException: org.apache.hudi.exception.HoodieException: Error 
> ordering fields for storage read. #fieldNames: 4, #fieldPositions: 5 at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489) 
> ~[hadoop-mapreduce-client-common-2.8.4.jar:?] at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:549) 
> ~[hadoop-mapreduce-client-common-2.8.4.jar:?]Caused by: java.io.IOException: 
> org.apache.hudi.exception.HoodieException: Error ordering fields for storage 
> read. #fieldNames: 4, #fieldPositions: 5 at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
>  ~[hive-exec-2.3.3.jar:2.3.3] at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
>  ~[hive-exec-2.3.3.jar:2.3.3] at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379)
>  ~[hive-exec-2.3.3.jar:2.3.3] at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:169) 
> ~[hadoop-mapreduce-client-core-2.8.4.jar:?] at 
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) 
> ~[hadoop-mapreduce-client-core-2.8.4.jar:?] at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) 
> ~[hadoop-mapreduce-client-core-2.8.4.jar:?] at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270)
>  ~[hadoop-mapreduce-client-common-2.8.4.jar:?] at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[?:1.8.0_212] at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> ~[?:1.8.0_212] at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_212] at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_212] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]Caused 
> by: org.apache.hudi.exception.HoodieException: Error ordering fields for 
> storage read. #fieldNames: 4, #fieldPositions: 5 at 
> org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.orderFields(HoodieRealtimeRecordReaderUtils.java:258)
>  ~[hoodie-hadoop-mr-bundle.jar:0.6.1-SNAPSHOT] at 
> org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:99)
>  ~[hoodie-hadoop-mr-bundle.jar:0.6.1-SNAPSHOT] at 
> org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(Abstra

[jira] [Updated] (HUDI-234) Graceful degradation of ObjectSizeCalculator for non hotspot jvms

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-234:
---
Fix Version/s: 0.14.0
   (was: 0.13.1)

> Graceful degradation of ObjectSizeCalculator for non hotspot jvms
> -
>
> Key: HUDI-234
> URL: https://issues.apache.org/jira/browse/HUDI-234
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.5.0
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: new-to-hudi
> Fix For: 0.14.0
>
>
> https://github.com/apache/incubator-hudi/issues/860 bug report 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-992) For hive-style partitioned source data, partition columns synced with Hive will always have String type

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-992:
---
Fix Version/s: 0.14.0
   (was: 0.13.1)

> For hive-style partitioned source data, partition columns synced with Hive 
> will always have String type
> ---
>
> Key: HUDI-992
> URL: https://issues.apache.org/jira/browse/HUDI-992
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap, meta-sync
>Affects Versions: 0.9.0
>Reporter: Udit Mehrotra
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
>
> Currently bootstrap implementation is not able to handle partition columns 
> correctly when the source data has *hive-style partitioning*, as is also 
> mentioned in https://jira.apache.org/jira/browse/HUDI-915
> The schema inferred while performing bootstrap and stored in the commit 
> metadata does not have partition column schema(in case of hive partitioned 
> data). As a result during hive-sync when hudi tries to determine the type of 
> partition column from that schema, it would not find it and assume the 
> default data type *string*.
> Here is where partition column schema is determined for hive-sync:
> [https://github.com/apache/hudi/blob/master/hudi-hive-sync/src/main/java/org/apache/hudi/hive/util/HiveSchemaUtil.java#L417]
>  
> Thus no matter what the data type of partition column is in the source data 
> (atleast what spark infers it as from the path), it will always be synced as 
> string.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-83) Map Timestamp type in spark to corresponding Timestamp type in Hive during Hive sync

2023-05-22 Thread Yue Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-83?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-83:
--
Fix Version/s: 0.14.0
   (was: 0.13.1)

> Map Timestamp type in spark to corresponding Timestamp type in Hive during 
> Hive sync
> 
>
> Key: HUDI-83
> URL: https://issues.apache.org/jira/browse/HUDI-83
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive, meta-sync, Usability
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: cdmikechen
>Priority: Critical
>  Labels: pull-request-available, query-eng, sev:critical, 
> user-support-issues
> Fix For: 0.14.0
>
>
> [https://github.com/apache/incubator-hudi/issues/543] &; related issues 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

< 1 2 3 4 5 >

101 - 200 of 412 matches

Mail list logo