Re: [I] [SUPPORT] Hudi Metadata Compaction is not happening [hudi]

2024-07-12 Thread via GitHub


danny0405 commented on issue #11535:
URL: https://github.com/apache/hudi/issues/11535#issuecomment-2226621918

   Yeah, the MDT delta_commit is archived based on its own strategy.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi Metadata Compaction is not happening [hudi]

2024-07-12 Thread via GitHub


Jason-liujc commented on issue #11535:
URL: https://github.com/apache/hudi/issues/11535#issuecomment-2226519310

   Update:
   Tried cleaning the table synchronously instead of asynchronously, we can see 
the compaction commit after the second run. Seems the first run fixed a lot of 
pending commits we had in the table:
   ```
   Obtaining marker files for all created, merged paths 
   
   Perform rollback actions: componentoutputs_discretionarycoop
   ```
   
   and the second run ran:
   ```
   Preparing compaction metadata: componentoutputs_discretionarycoop_metadata
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Hudi Failed to read MARKERS file [hudi]

2024-07-10 Thread via GitHub


danny0405 commented on issue #6900:
URL: https://github.com/apache/hudi/issues/6900#issuecomment-2221758340

   > Could not read commit details from 
hdfs://hacluster/user/kylin/flink/data/streaming_rdss_rcsp_lab/2024062815382133
   
   Is this a real file on storage? Did you check the integrity of it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Hudi Failed to read MARKERS file [hudi]

2024-07-10 Thread via GitHub


fanfanAlice commented on issue #6900:
URL: https://github.com/apache/hudi/issues/6900#issuecomment-2219857140

   yes
   set hoodie.embed.timeline.server=false


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] hudi-common 0.14.0 jar in mavenCentral appears to have corrupt generated avro classes [hudi]

2024-07-09 Thread via GitHub


lucasmo commented on issue #11602:
URL: https://github.com/apache/hudi/issues/11602#issuecomment-2218542973

   https://github.com/apache/hudi/issues/11378 appears to be caused by this 
same issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] hudi-common 0.14.0 jar in mavenCentral appears to have corrupt generated avro classes [hudi]

2024-07-09 Thread via GitHub


lucasmo opened a new issue, #11602:
URL: https://github.com/apache/hudi/issues/11602

   **Describe the problem you faced**
   
   When diagnosing a problem with XTable (see 
https://github.com/apache/incubator-xtable/issues/466), I noticed that avro 
classes were unable to even be instantiated for schema in a very simple test 
case when using `hudi-common-0.14.0` as a dependency. 
   
   However, this issue does not exist when using 
`hudi-spark3.4-bundle_2.12-0.14.0` as a dependency, which contains the same 
avro autogenerated classes. A good specific example is 
`org/apache/hudi/avro/model/HoodieCleanPartitionMetadata.class`.
   
   When compiling hudi locally (tag `release-0.14.0`, `mvn clean package 
-DskipTests -Dspark3.4`, java 1.8), both generated jar files have the correct 
implementations of avro autogenerated classes.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Download and uncompress hudi-spark3.4-bundle_2.12-0.14.0.jar and 
hudi-common-0.14.0.jar from mavencentral
   2. Build Hudi locally
   3. Run javap on 
`org/apache/hudi/avro/model/HoodieCleanPartitionMetadata.class` in all four of 
the jars
   4. Note the file size of the text output of javap is 4232 for the file from 
every single jar aside from hudi-common, which has a javap text file size of 
2323.
   
   OR
   
   run the following in Java 11, replacing $PATH_TO_A_HOODIE_AVRO_MODELS_JAR 
with a path to one of the four jar files
   ```
   jshell --class-path 
~/.m2/repository/org/apache/avro/avro/1.11.3/avro-1.11.3.jar:~/.m2/repository/com/fasterxml/jackson/core/jackson-core/2.17.1/jackson-core-2.17.1.jar:~/.m2/repository/com/fasterxml/jackson/core/jackson-databind/2.17.1/jackson-databind-2.17.1.jar:~/.m2/repository/com/fasterxml/jackson/core/jackson-annotations/2.17.1/jackson-annotations-2.17.1.jar:~/.m2/repository/org/slf4j/slf4j-api/2.0.9/slf4j-api-2.0.9.jar:$PATH_TO_A_HOODIE_AVRO_MODELS_JAR
   ```
   
   Then, copy and paste this into the shell:
   ```
   org.apache.avro.Schema schema = new 
org.apache.avro.Schema.Parser().parse("{\"type\":\"record\",\"name\":\"HoodieCleanPartitionMetadata\",\"namespace\":\"org.apache.hudi.avro.model\",\"fields\":[{\"name\":\"partitionPath\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"}},{\"name\":\"policy\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"}},{\"name\":\"deletePathPatterns\",\"type\":{\"type\":\"array\",\"items\":{\"type\":\"string\",\"avro.java.string\":\"String\"}}},{\"name\":\"successDeleteFiles\",\"type\":{\"type\":\"array\",\"items\":{\"type\":\"string\",\"avro.java.string\":\"String\"}}},{\"name\":\"failedDeleteFiles\",\"type\":{\"type\":\"array\",\"items\":{\"type\":\"string\",\"avro.java.string\":\"String\"}}},{\"name\":\"isPartitionDeleted\",\"type\":[\"null\",\"boolean\"],\"default\":null}]}");
 System.out.println("Class for schema: " + 
org.apache.avro.specific.SpecificData.get().getClass(schema));
   ```
   
   On the MavenCentral hudi-common-0.14.0 jar, you should get:
   ```
   |  Exception java.lang.ExceptionInInitializerError
   |at Class.forName0 (Native Method)
   |at Class.forName (Class.java:398)
   ...
   |  Caused by: java.lang.IllegalStateException: Recursive update
   |at ConcurrentHashMap.computeIfAbsent (ConcurrentHashMap.java:1760)
   ```
   
   
   **Expected behavior**
   
   The above code snippet prints
   ```
   Class for schema: class 
org.apache.hudi.avro.model.HoodieCleanPartitionMetadata
   ```
   
   **Environment Description**
   
   * Hudi version : 0.14.0
   
   everything else n/a, but duplicated issue on macOS and Ubuntu 22.04.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Hudi Failed to read MARKERS file [hudi]

2024-07-09 Thread via GitHub


danny0405 commented on issue #6900:
URL: https://github.com/apache/hudi/issues/6900#issuecomment-2217396181

   > the dataset: [=-)
   Embedded timeline server is disabled
   
   Did you disable the embedded timeline server?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi Metadata Compaction is not happening [hudi]

2024-07-08 Thread via GitHub


danny0405 commented on issue #11535:
URL: https://github.com/apache/hudi/issues/11535#issuecomment-2215855839

   @Jason-liujc Thanks for these tries, but from high-level, we should 
definitely simplify the design of the MDT, at least from 1.x, the MDT 
compaction can work smothly with any async table service now, the next step is 
to make it NB-CC totally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi Metadata Compaction is not happeing [hudi]

2024-07-08 Thread via GitHub


Jason-liujc commented on issue #11535:
URL: https://github.com/apache/hudi/issues/11535#issuecomment-2214835602

   Had an offline discussion with Shiyan.

   As long as the metadata table is not compacted properly, the insertion 
performance will become worse and worse gradually.

   Here’s some action items we are taking:

   1. For future Hudi issues, we’ll try to create github issues first. I’ll 
create another one for some incremental query errors (but its totally mitigable 
on our end)
   2. For this specific issue on metadata table not being compacted, we’ll try 
the following
   a. Run scripts to delete previous uncommitted instants (and any files 
created if any) and see if the metadata compaction resumes
   b. Run workload with synchronous cleaning to see if it can compact the 
metadata table
   c. After cleaning up pending commits, see if we successfully reinitialize 
the metadata table


   Will give an update here on how it goes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi Metadata Compaction is not happeing [hudi]

2024-07-05 Thread via GitHub


xushiyan commented on issue #11535:
URL: https://github.com/apache/hudi/issues/11535#issuecomment-2211178125

   > run compaction of the metadata table asynchrounously
   
   no option to do that as MT compaction is managed internally
   
   > `hoodie.metadata.max.deltacommits.when_pending` parameter to say like 
100
   
   @Jason-liujc this is only a mitigation strategy. To get MT to compact, you 
need to resolve the pending commit (let it finish or rollback) on data table's 
timeline. if you email us the zipped `.hoodie/` we can help analyze it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi Metadata Compaction is not happeing [hudi]

2024-07-04 Thread via GitHub


danny0405 commented on issue #11535:
URL: https://github.com/apache/hudi/issues/11535#issuecomment-2208403502

   Did you check that whether data table has a long pending instant there that 
does not finish?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi 0.12.1 support for Spark Structured Streaming. read clustering metadata replace avro file error. Unrecognized token 'Obj^A^B^Vavro' [hudi]

2024-07-03 Thread via GitHub


sdudi-te commented on issue #7375:
URL: https://github.com/apache/hudi/issues/7375#issuecomment-2205762639

   Is there a possible workaround for this ? In other words how do we recover 
from this situation ? 
   
   We are using spark structured streaming on kafka and write output to hudi on 
s3.
   
   Upon deleting the partial commit file (as a workaround), we are observing 
even though streaming job is progressing with updated offsets, but no data is 
ever written to hudi.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi Metadata Compaction is not happeing [hudi]

2024-07-02 Thread via GitHub


Jason-liujc commented on issue #11535:
URL: https://github.com/apache/hudi/issues/11535#issuecomment-2204845455

   @danny0405 Ahh gotcha, we do have async cleaner that runs for our Hudi 
tables. 
   
   @ad1happy2go  I don't see any compaction on metadata table since a given 
date (I believe that's when we moved Hudi cleaning from sync to async, based on 
Danny's comment). When I delete the metadata and try to reinitialize I do see 
this error, which I believe they are the blocking instants:
   
   ```
   24/06/15 01:06:20 ip-10-0-157-87 WARN HoodieBackedTableMetadataWriter: 
Cannot initialize metadata table as operation(s) are in progress on the 
dataset: [[==>20240523221631416__commit__INFLIGHT__20240523224939000], 
[==>20240523225648799__commit__INFLIGHT__20240523232254000], 
[==>20240524111304660__commit__INFLIGHT__20240524142426000], 
[==>20240524235127638__commit__INFLIGHT__2024052500064], 
[==>20240525005114829__commit__INFLIGHT__20240525011802000], 
[==>20240525065356540__commit__INFLIGHT__20240525071004000], 
[==>20240525170219523__commit__INFLIGHT__20240525192315000], 
[==>20240527184608604__commit__INFLIGHT__20240527190327000], 
[==>20240528190417601__commit__INFLIGHT__20240528192418000], 
[==>20240529054718316__commit__INFLIGHT__20240529060542000], 
[==>20240530125710177__commit__INFLIGHT__20240531081522000], 
[==>20240530234238360__commit__INFLIGHT__20240530234726000], 
[==>20240531082713041__commit__REQUESTED__20240531082715000], 
[==>20240601164223688__commit__INFLIGHT__2024060
 1190853000], [==>20240602072248313__commit__INFLIGHT__20240603005951000], 
[==>20240603010859993__commit__INFLIGHT__20240603100305000], 
[==>20240604043334594__commit__INFLIGHT__20240604061732000], 
[==>20240605061406367__commit__REQUESTED__20240605061412000], 
[==>20240605063936872__commit__REQUESTED__20240605063943000], 
[==>20240605071904045__commit__REQUESTED__2024060507191], 
[==>20240605074456040__commit__REQUESTED__20240605074502000], 
[==>20240605082437667__commit__REQUESTED__20240605082443000], 
[==>20240605085008272__commit__REQUESTED__20240605085014000], 
[==>20240605123632368__commit__REQUESTED__20240605123638000], 
[==>20240605130201503__commit__REQUESTED__20240605130207000], 
[==>20240605134213113__commit__REQUESTED__20240605134219000], 
[==>20240605140741158__commit__REQUESTED__20240605140747000], 
[==>20240605144756228__commit__REQUESTED__20240605144802000], 
[==>20240605151313557__commit__REQUESTED__20240605151319000], 
[==>20240605195405678__commit__REQUESTED__202406051954110
 00], [==>20240605202017653__commit__REQUESTED__20240605202023000], 
[==>20240605205949232__commit__REQUESTED__20240605205955000], 
[==>20240605212536568__commit__REQUESTED__20240605212542000], 
[==>20240605220432089__commit__REQUESTED__20240605220438000], 
[==>20240606152537217__commit__INFLIGHT__20240607031027000], 
[==>20240606181110800__commit__INFLIGHT__2024060843000], 
[==>20240607112530977__commit__INFLIGHT__20240607212013000], 
[==>20240607213124841__commit__INFLIGHT__20240609024214000], 
[==>20240608001245366__commit__INFLIGHT__2024060904553], 
[==>20240609030620894__commit__INFLIGHT__2024060918031], 
[==>20240609181330488__commit__REQUESTED__20240609181336000], 
[==>20240609194304829__commit__INFLIGHT__20240611095337000], 
[==>20240611003906613__commit__INFLIGHT__20240611014341000], 
[==>20240611100258837__commit__INFLIGHT__20240612075536000], 
[==>20240611174425406__commit__INFLIGHT__20240611184626000], 
[==>20240612081821910__commit__INFLIGHT__20240612102427000], [==>2024061
 2204659323__commit__REQUESTED__20240612204705000], 
[==>20240613044301243__commit__INFLIGHT__20240613075101000], 
[==>20240613085334404__commit__INFLIGHT__20240613105718000], 
[==>20240613113055212__commit__REQUESTED__20240613113101000], 
[==>20240613122745696__commit__REQUESTED__20240613122751000], 
[==>20240614094542418__commit__REQUESTED__20240614094548000], 
[==>20240614172456990__commit__REQUESTED__20240614172503000], 
[==>20240614175526954__commit__REQUESTED__20240614175529000], 
[==>20240614181441857__commit__REQUESTED__20240614181444000], 
[==>20240614222012190__commit__REQUESTED__20240614222015000], 
[==>20240614225952031__commit__REQUESTED__20240614225954000], 
[==>20240614235545094__commit__REQUESTED__20240614235547000]]
   ```
   
   I guess my next questions are:
   
   1. Is there a way to run compaction of the metadata table asynchrounously, 
without cleaning up commits, deleting metadata table and recreating them again? 
The process is a bit expensive and since based on what Danny said, the going 
forward metadata table compaction still won't work. 
   
   2. Also if we just increase the 
`hoodie.metadata.max.deltacommits.when_pending` parameter to say like 100, 
what type of performance hit would we expect it take? is it mostly on the S3 
file listing level? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to 

Re: [I] [SUPPORT] Hudi Metadata Compaction is not happeing [hudi]

2024-07-02 Thread via GitHub


danny0405 commented on issue #11535:
URL: https://github.com/apache/hudi/issues/11535#issuecomment-2197789686

   This is an known issue, probably because you have enabled async table 
service on data table, the 0.x Hudi metadata table does not work with any async 
table services, that would cause the MDT not compaction issue, and it is fixed 
on master now, with our new completion time based file slicing and non-blocking 
style concurrency control.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Hudi insert job failded due to multiple files belongs to the same bucket id [hudi]

2024-07-02 Thread via GitHub


beyond1920 commented on issue #11527:
URL: https://github.com/apache/hudi/issues/11527#issuecomment-2197044665

   @danny0405 @dongtingting 
   Good point. I think your analysis is reasonable.
   Generate fileid in driver could avoid different fg id for the same bucket 
id, but it might cost too much memory for some cases.
   
   + @xushiyan WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation [hudi]

2024-06-24 Thread via GitHub


Limess commented on issue #8114:
URL: https://github.com/apache/hudi/issues/8114#issuecomment-2185835901

   > @codope :As stated by the Issue, the problem is a necessary occurrence. 
The version we are currently using is 0.14. @Limess :Have you not encountered 
this problem again? May I ask how was it avoided?Thanks!
   
   We never pursued this and are still on 0.13.0 for now, so I can't verify 
either way, sorry!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation [hudi]

2024-06-23 Thread via GitHub


codope commented on issue #8114:
URL: https://github.com/apache/hudi/issues/8114#issuecomment-2185495368

   @zhaobangcai The full context is that the issue was fixed but in order to 
fix, the archived timeline was also being read. This caused too high sync 
latency. Hence, the fix was reverted. Generally, reading archived timeline is 
an anti-pattern in Hudi, and we are optimizing this by implemeting LSM timeline 
in 1.0.0. That said, I think we did fix the timeline loading in 
https://github.com/apache/hudi/commit/ab61f61df9686793406300c0018924a119b02855 
which I believe is in 0.14. Can you please share a script/test case to 
reproduce the issue with all configs that you used in your env? I am going to 
reopen the issue based on your comment and debug further once you provide the 
script/test case. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation [hudi]

2024-06-23 Thread via GitHub


Limess opened a new issue, #8114:
URL: https://github.com/apache/hudi/issues/8114

   **Describe the problem you faced**
   
   After running an insert to overwrite a Hudi table inplace using 
`insert_overwrite_table`, partitions which no longer exist in the new input 
data are not removed by the Hive Sync. This causes some query engines to fail 
until the old partitions are manually removed (e.g. AWS Athena).
   
   This is on Hudi 0.12.1, but I'm fairly sure this issue still exists on 
0.13.0 - this change: https://github.com/apache/hudi/pull/6662 fixes this 
behaviour for `delete_partition` operations, but doesn't add any handling for 
`insert_overwrite_table`. 
   
   I'd be happy to be proven otherwise if this is fixed in 0.13.0 - I don't 
have an environment to easily test this without working out how to upgrade on 
EMR without a release.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create a new Hudi table using input data with two partitions, e.g. 
partition_col=1, partition_col=2
   2. Insert into the table using the operation 
`hoodie.datasource.write.operation=insert_overwrite_table` with input data 
containing 1/2 of the original partitions, e.g. only partition_col=2
   3. Run HiveSyncTool or similar (doesn't work with Spark writer sync or 
HiveSyncTool)
   4. Check the Hive partitions. Both partitions still exist
   
   **Expected behavior**
   
   I'd expect the partition which was not inserted to be removed, e.g. only 
partition_col=2 exists, partition_col=1 is deleted.
   
   **Environment Description**
   
   * Hudi version : 0.12.1
   
   * Spark version : 3.3.1
   
   * Hive version : AWS Glue
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Running on EMR 0.6.9
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation [hudi]

2024-06-23 Thread via GitHub


zhaobangcai commented on issue #8114:
URL: https://github.com/apache/hudi/issues/8114#issuecomment-2185478617

   @codope :As stated by the Issue, the problem is a necessary occurrence. The 
version we are currently using is 0.14.
   @Limess :Have you not encountered this problem again? May I ask how was it 
avoided?Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation [hudi]

2024-06-23 Thread via GitHub


zhaobangcai commented on issue #8114:
URL: https://github.com/apache/hudi/issues/8114#issuecomment-2185411734

   @codope :hello,this issue still exists in version 0.14, why was it closed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] [hudi]

2024-06-18 Thread via GitHub


ashwinagalcha-ps opened a new issue, #11468:
URL: https://github.com/apache/hudi/issues/11468

   When using Kafka + Debezium + Streamer, we are able to write data and the 
job works fine, but when using the SqlQueryBasedTransformer, it is able to 
write data on S3 with the new field but ultimately the job fails.
   
   Below are the Hudi Deltastreamer job configs:
   
   ```"--table-type", "COPY_ON_WRITE",  
   "--source-class", 
"org.apache.hudi.utilities.sources.debezium.PostgresDebeziumSource",
   "--transformer-class", 
"org.apache.hudi.utilities.transform.SqlQueryBasedTransformer",
   "--hoodie-conf", "hoodie.streamer.transformer.sql=SELECT *, extract(year 
from a.created_at) as year FROM  a",
   "--source-ordering-field", output["source_ordering_field"], 
   "--target-base-path", 
f"s3a://{env_params['deltastreamer_bucket']}/{db_name}/{schema}/{output['table_name']}/",
  
   "--target-table", output["table_name"],  
   "--auto.offset.reset=earliest
   "--props", properties_file,  
   "--payload-class", 
"org.apache.hudi.common.model.debezium.PostgresDebeziumAvroPayload",
   "--enable-hive-sync",  
   "--hoodie-conf", "hoodie.datasource.hive_sync.mode=hms",
   "--hoodie-conf", 
"hoodie.datasource.write.schema.allow.auto.evolution.column.drop=true",
   "--hoodie-conf", 
f"hoodie.deltastreamer.source.kafka.topic={connector_name}.{schema}.{output['table_name']}",
   "--hoodie-conf", f"schema.registry.url={env_params['schema_registry_url']}",
   "--hoodie-conf", 
f"hoodie.deltastreamer.schemaprovider.registry.url={env_params['schema_registry_url']}/subjects/{connector_name}.{schema}.{output['table_name']}-value/versions/latest",
   "--hoodie-conf", 
"hoodie.deltastreamer.source.kafka.value.deserializer.class=io.confluent.kafka.serializers.KafkaAvroDeserializer",
   "--hoodie-conf", "hoodie.datasource.hive_sync.use_jdbc=false",  
   "--hoodie-conf", 
f"hoodie.datasource.hive_sync.database={output['hive_database']}",  
   "--hoodie-conf", 
f"hoodie.datasource.hive_sync.table={output['table_name']}",  
   "--hoodie-conf", "hoodie.datasource.hive_sync.metastore.uris=", 
   "--hoodie-conf", "hoodie.datasource.hive_sync.enable=true",  
   "--hoodie-conf", "hoodie.datasource.hive_sync.support_timestamp=true", 
   "--hoodie-conf", "hoodie.deltastreamer.source.kafka.maxEvents=10",
   "--hoodie-conf", 
f"hoodie.datasource.write.recordkey.field={output['record_key']}", 
   "--hoodie-conf", 
f"hoodie.datasource.write.precombine.field={output['precombine_field']}",
   "--hoodie-conf", 
f"hoodie.datasource.hive_sync.glue_database={output['hive_database']}",
   "--continuous"```
   
   Properties file:
   ```bootstrap.servers=
   auto.offset.reset=earliest
   schema.registry.url=http://host:8081```
   
   **Expected behavior**: To be able to extract a new field (year) in the 
target hudi table with the help of SqlQueryBasedTransformer.
   
   **Environment Description**
   
   * Hudi version : 0.14.0
   
   * Spark version : 3.4.1
   
   * Hadoop version : 3.3.4
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   * Base image & jars:
   
`public.ecr.aws/ocean-spark/spark:platform-3.4.1-hadoop-3.3.4-java-11-scala-2.12-python-3.10-gen21`
   
   
`https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.4-bundle_2.12/0.14.0/hudi-spark3.4-bundle_2.12-0.14.0.jar
   
https://repo1.maven.org/maven2/org/apache/hudi/hudi-utilities-bundle_2.12/0.14.0/hudi-utilities-bundle_2.12-0.14.0.jar`
   
   **Stacktrace**
   
   ```2024-06-14T14:16:17.562738557Z 24/06/14 14:16:17 ERROR HoodieStreamer: 
Shutting down delta-sync due to exception
   2024-06-14T14:16:17.562785897Z 
org.apache.hudi.utilities.exception.HoodieTransformExecutionException: Failed 
to apply sql query based transformer
   2024-06-14T14:16:17.562797467Z   at 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer.apply(SqlQueryBasedTransformer.java:68)
   2024-06-14T14:16:17.562805097Z   at 
org.apache.hudi.utilities.transform.ChainedTransformer.apply(ChainedTransformer.java:105)
   2024-06-14T14:16:17.562812197Z   at 
org.apache.hudi.utilities.streamer.StreamSync.lambda$fetchFromSource$0(StreamSync.java:530)
   2024-06-14T14:16:17.562819517Z   at 
org.apache.hudi.common.util.Option.map(Option.java:108)
   2024-06-14T14:16:17.562826327Z   at 
org.apache.hudi.utilities.streamer.StreamSync.fetchFromSource(StreamSync.java:530)
   2024-06-14T14:16:17.562836838Z   at 
org.apache.hudi.utilities.streamer.StreamSync.readFromSource(StreamSync.java:495)
   2024-06-14T14:16:17.562844648Z   at 
org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:405)
   2024-06-14T14:16:17.562852958Z   at 
org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.lambda$startService$1(HoodieStreamer.java:757)
   2024-06-14T14:16:17.562860358Z   at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
   2024-06-14T14:16:17.562868059Z   at 

Re: [I] [SUPPORT] [hudi]

2024-06-13 Thread via GitHub


codope closed issue #11431: [SUPPORT]
URL: https://github.com/apache/hudi/issues/11431


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] [hudi]

2024-06-13 Thread via GitHub


ad1happy2go commented on issue #11431:
URL: https://github.com/apache/hudi/issues/11431#issuecomment-2165415200

   @zaminhassnain06 Closing this issue. Please reopen or create new one for any 
more doubts on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] [hudi]

2024-06-12 Thread via GitHub


zaminhassnain06 commented on issue #11431:
URL: https://github.com/apache/hudi/issues/11431#issuecomment-2163145963

   Thanks @ad1happy2go 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] [hudi]

2024-06-12 Thread via GitHub


ad1happy2go commented on issue #11431:
URL: https://github.com/apache/hudi/issues/11431#issuecomment-2162943432

   @zaminhassnain06 Correct. you need to rebuild.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] [hudi]

2024-06-12 Thread via GitHub


zaminhassnain06 commented on issue #11431:
URL: https://github.com/apache/hudi/issues/11431#issuecomment-2162768973

   @ad1happy2go  yes, data type of our columns are changing mostly from int to 
big int as our data is increasing. So in this scenario we should directly move 
towards higher version and rebuild our tables in the higher version , is that 
correct ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] [hudi]

2024-06-12 Thread via GitHub


ad1happy2go commented on issue #11431:
URL: https://github.com/apache/hudi/issues/11431#issuecomment-2162750491

   @zaminhassnain06 Did data type of your id column changed? Why you need to 
run alter command?
We can't make an integer field as long as it's not backward compatible.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi Application getting stuck when Async cleaner is spawned [hudi]

2024-06-11 Thread via GitHub


JuanAmayaBT commented on issue #7364:
URL: https://github.com/apache/hudi/issues/7364#issuecomment-2161690376

   any news on this? I am using **hudi 0.14.1 on aws glue** and getting from 
time to time the following error that seems to be related to this issue:
   Error waiting for async clean service to finish 
   ```
   
spark_df.write.format('hudi').options(**hudi_final_settings).mode('Append').save()
 File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", 
line 966, in save
   self._jwrite.save()
 File 
"/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 
1321, in __call__
   return_value = get_return_value(
 File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 
190, in deco
   return f(*a, **kw)
 File 
"/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 
326, in get_return_value
   raise Py4JJavaError(
   py4j.protocol.Py4JJavaError: An error occurred while calling o169.save.
   : org.apache.hudi.exception.HoodieException: Error waiting for async clean 
service to finish
at 
org.apache.hudi.async.AsyncCleanerService.waitForCompletion(AsyncCleanerService.java:77)
at 
org.apache.hudi.client.BaseHoodieTableServiceClient.asyncClean(BaseHoodieTableServiceClient.java:133)
at 
org.apache.hudi.client.BaseHoodieWriteClient.autoCleanOnCommit(BaseHoodieWriteClient.java:595)
at 
org.apache.hudi.client.BaseHoodieWriteClient.mayBeCleanAndArchive(BaseHoodieWriteClient.java:579)
at 
org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:248)
at 
org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:104)
at 
org.apache.hudi.HoodieSparkSqlWriterInternal.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:1081)
at 
org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:520)
at 
org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:204)
at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:121)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:103)
at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224)
at 
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:114)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$7(SQLExecution.scala:139)
at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:139)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:245)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:138)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:100)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:96)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:615)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:177)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:615)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at 

[I] [SUPPORT] [hudi]

2024-06-11 Thread via GitHub


zaminhassnain06 opened a new issue, #11431:
URL: https://github.com/apache/hudi/issues/11431

   Hi,
   I tried to update my hudi version from 0.6.0 to 0.11. I updated it gradually 
version by version starting from 0.6 to 0.9 and then from 0.9 to 0.10 and 
finally from 0.10 to 0.11. I am running it using EMR and querying the hudi 
table on athena. The table version was updated correctly after each update in 
hoodie.properties file on S3. However when I tried to Run Alter table command 
to Alter data type of colum from int to big in on 0.11 it is giving me folowing 
error
   
   `pyspark.sql.utils.AnalysisException: ALTER TABLE CHANGE COLUMN is not 
supported for changing column 'id' with type 'IntegerType' to 'id' with type 
'LongType'`
   
   Do we have to rebuild the tables to the newer version directly ???
   
   
   Following is my hoodie.properties file content 
   
   hoodie.table.timeline.timezone=LOCAL
   
hoodie.table.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
   hoodie.table.precombine.field=when_updated
   hoodie.table.version=4
   hoodie.database.name=
   hoodie.datasource.write.hive_style_partitioning=true
   hoodie.table.checksum=2716619607
   hoodie.partition.metafile.use.base.format=false
   hoodie.archivelog.folder=archived
   hoodie.table.name=amz_hudi_vc_11_accounts
   hoodie.populate.meta.fields=true
   hoodie.table.type=COPY_ON_WRITE
   hoodie.datasource.write.partitionpath.urlencode=false
   hoodie.table.base.file.format=PARQUET
   hoodie.datasource.write.drop.partition.columns=false
   hoodie.table.metadata.partitions=files
   hoodie.timeline.layout.version=1
   hoodie.table.recordkey.fields=id
   hoodie.table.partition.fields=
   
   
   Following is the complete error
   An error was encountered:
   ALTER TABLE CHANGE COLUMN is not supported for changing column 'id' with 
type 'IntegerType' to 'id' with type 'LongType'
   Traceback (most recent call last):
 File 
"/mnt/yarn/usercache/livy/appcache/application_1718084941070_0004/container_1718084941070_0004_01_01/pyspark.zip/pyspark/sql/session.py",
 line 723, in sql
   return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
 File 
"/mnt/yarn/usercache/livy/appcache/application_1718084941070_0004/container_1718084941070_0004_01_01/py4j-0.10.9.3-src.zip/py4j/java_gateway.py",
 line 1322, in __call__
   answer, self.gateway_client, self.target_id, self.name)
 File 
"/mnt/yarn/usercache/livy/appcache/application_1718084941070_0004/container_1718084941070_0004_01_01/pyspark.zip/pyspark/sql/utils.py",
 line 117, in deco
   raise converted from None
   pyspark.sql.utils.AnalysisException: ALTER TABLE CHANGE COLUMN is not 
supported for changing column 'id' with type 'IntegerType' to 'id' with type 
'LongType'
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Hudi Deltastreamer compaction is taking longer duration [hudi]

2024-06-07 Thread via GitHub


ad1happy2go commented on issue #11273:
URL: https://github.com/apache/hudi/issues/11273#issuecomment-2155139922

   @SuneethaYamani 
https://hudi.apache.org/docs/configurations/#hoodiemetadataenable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]

2024-06-06 Thread via GitHub


KnightChess commented on issue #11204:
URL: https://github.com/apache/hudi/issues/11204#issuecomment-2153754087

   > @KnightChess do you have intreast to push-forward this feature?
   
   @danny0405 yes, I follow up this problem


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]

2024-06-06 Thread via GitHub


danny0405 commented on issue #11204:
URL: https://github.com/apache/hudi/issues/11204#issuecomment-2153650845

   @KnightChess do you have intreast to push-forward this feature?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]

2024-06-06 Thread via GitHub


cono commented on issue #11204:
URL: https://github.com/apache/hudi/issues/11204#issuecomment-2152342411

   This is really useful feature to have.
   We want to use Hudi at work, but unfortunately we have couple of 
bucketed/sorted tables, and this is definitely a stopper for us to migrate to 
Hudi.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] [hudi]

2024-06-05 Thread via GitHub


danny0405 commented on issue #11403:
URL: https://github.com/apache/hudi/issues/11403#issuecomment-2151279640

   I would suggest you use the 0.12.3 or 0.14.1, 0.12.1 still got some 
stability issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] [hudi]

2024-06-05 Thread via GitHub


zaminhassnain06 opened a new issue, #11403:
URL: https://github.com/apache/hudi/issues/11403

   Hi
   Our organization is migrating from Hudi 0.6.0 to Hudi 0.12.1 and also 
updating the required spark and EMR versions. Our existing data sets (100s of 
TBs of data on S3) are written using Hudi 0.6.0.
   
   The latest version of Hudi has come way since 0.6.0, we are not sure about 
how to use 0.12.1 directly.
   
   Could someone provide the steps for upgrading from 0.6.0 to 0.12.1?
   
   Do we have to rebuild our tables, we are more concerned about this as tables 
are having billions of records ?
   
   Should we expect following imporvements after the upgrade: 
 – faster upserts
   
– columns add/modify (schema evolution)
   
– clustering
   
– possible solution for storing history of updates performed on recrods
   
   Thanks,
   Zamin Hassnain


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Hudi Deltastreamer compaction is taking longer duration [hudi]

2024-06-03 Thread via GitHub


SuneethaYamani commented on issue #11273:
URL: https://github.com/apache/hudi/issues/11273#issuecomment-2144490005

   @ad1happy2go can you please share the config to disable this.
   Temporirly I changed hoodie.metadata.compact.max.delta.commits=365 to avoid 
this blocker
   
   I am using below config
   arguments = [
   "--table-type", table_type,
   "--op", op,
   "--enable-sync",
   "--source-ordering-field", source_ordering_field,
   "--source-class", "org.apache.hudi.utilities.sources.JsonDFSSource",
   "--target-table", table_name,
   "--target-base-path", hudi_target_path,
   "--payload-class", "org.apache.hudi.common.model.HoodieAvroPayload",
   "--transformer-class", 
"org.apache.hudi.utilities.transform.SqlQueryBasedTransformer",
   "--props", props,
   "--schemaprovider-class", 
"org.apache.hudi.utilities.schema.FilebasedSchemaProvider",
   "--hoodie-conf", 
"hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator",
   "--hoodie-conf", 
"hoodie.datasource.write.recordkey.field={}".format(record_key),
   "--hoodie-conf", 
"hoodie.datasource.write.partitionpath.field={}".format(partition_field),
   "--hoodie-conf", 
"hoodie.streamer.source.dfs.root={}".format(delta_streamer_source),
   "--hoodie-conf", 
"hoodie.datasource.write.precombine.field={}".format(precombine),
   "--hoodie-conf", "hoodie.database.name={}".format(glue_db),
   "--hoodie-conf", "hoodie.datasource.hive_sync.enable=true",
   "--hoodie-conf", "hoodie.metadata.record.index.enable=true",
   "--hoodie-conf", "hoodie.datasource.insert.dup.policy=true",
   "--hoodie-conf", "hoodie.table.cdc.enabled=true",
   "--hoodie-conf", "hoodie.index.type=RECORD_INDEX", 
   "--hoodie-conf", 
"hoodie.datasource.hive_sync.table={}".format(table_name),
   "--hoodie-conf", 
"hoodie.datasource.hive_sync.partition_fields={}".format(partition_field),
   "--hoodie-conf", 
"hoodie.datasource.schema.avro.path={}".format(schema_path),
   "--hoodie-conf", "hoodie.datasource.schema.strategy=UNION",
   "--hoodie-conf", "hoodie.streamer.transformer.sql={}".format(sql),
   "--hoodie-conf", 
"hoodie.streamer.schemaprovider.source.schema.file={}".format(schema_path),
   "--hoodie-conf", 
"hoodie.streamer.schemaproider.target.schema.file={}".format(schema_path),
   "--hoodie-conf", "hoodie.metadata.compact.max.delta.commits=365"
   ]
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Hudi Deltastreamer compaction is taking longer duration [hudi]

2024-05-31 Thread via GitHub


ad1happy2go commented on issue #11273:
URL: https://github.com/apache/hudi/issues/11273#issuecomment-2142552863

   @SuneethaYamani Metadata table helps you to reduce file listing api calls. 
You can disable in case this is only becoming the bottleneck.
   
   Although we want to understand why it's taking so long. Can you share writer 
configs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi Sink Connector shows broker disconnected [hudi]

2024-05-30 Thread via GitHub


prabodh1194 commented on issue #9070:
URL: https://github.com/apache/hudi/issues/9070#issuecomment-2139020981

   but still facing a bunch of issues in the java classpath.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi Sink Connector shows broker disconnected [hudi]

2024-05-29 Thread via GitHub


prabodh1194 commented on issue #9070:
URL: https://github.com/apache/hudi/issues/9070#issuecomment-2138022231

   yeah. i just wanted to check out kafka-connect. got massively stuck on this 
issue :( . anyways, i think prefixing the props with 
   `consumer.override` works well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi Sink Connector shows broker disconnected [hudi]

2024-05-29 Thread via GitHub


soumilshah1995 commented on issue #9070:
URL: https://github.com/apache/hudi/issues/9070#issuecomment-2137973021

   why not use deltastreamer instead !
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi Sink Connector shows broker disconnected [hudi]

2024-05-29 Thread via GitHub


prabodh1194 commented on issue #9070:
URL: https://github.com/apache/hudi/issues/9070#issuecomment-2137909921

   i am facing same issue. i have searched around everywhere. what am I missing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Hudi Deltastreamer compaction is taking longer duration [hudi]

2024-05-28 Thread via GitHub


SuneethaYamani commented on issue #11273:
URL: https://github.com/apache/hudi/issues/11273#issuecomment-2134603908

   @ad1happy2go  Yes it is for metadata


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Hudi Deltastreamer compaction is taking longer duration [hudi]

2024-05-27 Thread via GitHub


ad1happy2go commented on issue #11273:
URL: https://github.com/apache/hudi/issues/11273#issuecomment-2134341804

   @SuneethaYamani That's not possible. Can you share the configs. One thing 
may be compaction what you are seeing is not for your main table, It may be for 
metadata table which is MOR by design
   Can you confirm if it's the metadata table. You can try disabling metadata 
table also.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] [hudi]

2024-05-23 Thread via GitHub


Pavan792reddy opened a new issue, #11275:
URL: https://github.com/apache/hudi/issues/11275

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :0.14
   
   * Spark version :3.3.2
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :GCS
   
   * Running on Docker? (yes/no) : NO
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   
   spark-submit--master 'local[*]'--deploy-mode client--packages 
'org.apache.hudi:hudi-spark3.1-bundle_2.12:0.14.1,io.streamnative.connectors:pulsar-spark-connector_2.12:3.2.0.2'
--repositories https://repo.maven.apache.org/maven2--conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer'--conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
--conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
--jars 
'/home/pavankumar_reddy/hudi-spark3.1-bundle_2.12-0.14.1.jar,/home/pavankumar_reddy/hudi-utilities_2.12-0.14.1.jar'
--class org.apache.hudi.utilities.streamer.HoodieStreamer ls 
/home/pavankumar_reddy/hudi-utilities-slim-bundle_2.12-0.14.1.jar
--source-class org.apache.hudi.utilities.sources.PulsarSource
--source-ordering-field when--target-base-path 
gs://pulsarstreamer-test/hudi_data/avroschema_stream--target-table 
avroschema_stre--hoodie-conf hoodie.datasource.writ
 e.recordkey.field=id--hoodie-conf 
hoodie.datasource.write.partitionpath.field=id--table-type COPY_ON_WRITE
--op UPSERT--hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
--hoodie-conf 
hoodie.streamer.source.pulsar.topic=persistent://mytenant/mynamespace/avroschema
--hoodie-conf 
hoodie.streamer.source.pulsar.endpoint.service.url=pulsar://localhost:6650
--hoodie-conf 
hoodie.streamer.source.pulsar.endpoint.admin.url=pulsar://localhost:8080   
--continuous
   
   
   
   Error:- 
   24/05/22 13:00:51 INFO org.apache.pulsar.client.impl.ConnectionPool: [[id: 
0x6e32bfc9, L:/10.128.0.70:55298 - R:10.128.0.40/10.128.0.40:6650]] Connected 
to server
   24/05/22 13:00:51 INFO org.apache.pulsar.client.impl.ClientCnx: [id: 
0x6e32bfc9, L:/10.128.0.70:55298 - R:10.128.0.40/10.128.0.40:6650] Connected 
through proxy to target broker at localhost:6650
   24/05/22 13:00:51 INFO org.apache.pulsar.client.impl.ConsumerImpl: 
[persistent://mytenant/mynamespace/avroschema][spark-pulsar-batch-97273cbf-ccc7-4e63-9c0c-60642c1ff1ed-persistent://mytenant/mynamespace/avroschema]
 Subscribing to topic on cnx [id: 0x6e32bfc9, L:/10.128.0.70:55298 - 
R:10.128.0.40/10.128.0.40:6650], consumerId 0
   24/05/22 13:00:51 INFO org.apache.pulsar.client.impl.ConsumerImpl: 
[persistent://mytenant/mynamespace/avroschema][spark-pulsar-batch-97273cbf-ccc7-4e63-9c0c-60642c1ff1ed-persistent://mytenant/mynamespace/avroschema]
 Subscribed to topic on 10.128.0.40/10.128.0.40:6650 -- consumer: 0
   24/05/22 13:00:51 ERROR org.apache.hudi.utilities.streamer.HoodieStreamer: 
Shutting down delta-sync due to exception
   java.lang.UnsupportedOperationException: MessageId is null
   at 
org.apache.pulsar.client.impl.MessageIdImpl.compareTo(MessageIdImpl.java:214)
   at 
org.apache.pulsar.client.impl.MessageIdImpl.compareTo(MessageIdImpl.java:32)
   at 
org.apache.pulsar.client.impl.ConsumerImpl.hasMoreMessages(ConsumerImpl.java:2291)
   at 
org.apache.pulsar.client.impl.ConsumerImpl.hasMessageAvailableAsync(ConsumerImpl.java:2237)
   at 
org.apache.pulsar.client.impl.ConsumerImpl.hasMessageAvailable(ConsumerImpl.java:2181)
   at 
org.apache.spark.sql.pulsar.PulsarHelper.getUserProvidedMessageId(PulsarHelper.scala:451)
   at 
org.apache.spark.sql.pulsar.PulsarHelper.$anonfun$fetchCurrentOffsets$1(PulsarHelper.scala:415)
   at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
   at scala.collection.immutable.Map$Map1.foreach(Map.scala:193)
   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
   at 
org.apache.spark.sql.pulsar.PulsarHelper.fetchCurrentOffsets(PulsarHelper.scala:408)
   at 

[I] [SUPPORT]Hudi Deltastreamer compaction is taking longer duration [hudi]

2024-05-23 Thread via GitHub


SuneethaYamani opened a new issue, #11273:
URL: https://github.com/apache/hudi/issues/11273

   Hi,
   
   I am creating COW table.I want run compaction separately instead of along 
with my write operation.So I used
   hoodie.datasource.write.streaming.disable.compaction=true.
   
   Still compaction is getting triggered. 
   
   Usually data write was happening in 2min when ever compaction is getting 
triggered jobs are staying stuck in running state,
   
   Thanks,
   Suneetha


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi SQL Based Transformer Fails when trying to provide SQL File as input [hudi]

2024-05-22 Thread via GitHub


soumilshah1995 closed issue #11258: [SUPPORT] Hudi SQL Based Transformer Fails 
when trying to provide SQL File as input 
URL: https://github.com/apache/hudi/issues/11258


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi SQL Based Transformer Fails when trying to provide SQL File as input [hudi]

2024-05-22 Thread via GitHub


soumilshah1995 commented on issue #11258:
URL: https://github.com/apache/hudi/issues/11258#issuecomment-2125096672

   Thanks man 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi SQL Based Transformer Fails when trying to provide SQL File as input [hudi]

2024-05-22 Thread via GitHub


soumilshah1995 commented on issue #11258:
URL: https://github.com/apache/hudi/issues/11258#issuecomment-2125075068

   really let me try 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi SQL Based Transformer Fails when trying to provide SQL File as input [hudi]

2024-05-21 Thread via GitHub


ad1happy2go commented on issue #11258:
URL: https://github.com/apache/hudi/issues/11258#issuecomment-2123907590

   @soumilshah1995 Your transformer class should be --transformer-class 
org.apache.hudi.utilities.transform.SqlFileBasedTransformer


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi SQL Based Transformer Fails when trying to provide SQL File as input [hudi]

2024-05-19 Thread via GitHub


soumilshah1995 commented on issue #11258:
URL: https://github.com/apache/hudi/issues/11258#issuecomment-2119296530

   when providing a sql file as I/p 
   ```
   java.lang.IllegalArgumentException: Property hoodie.streamer.transformer.sql 
not found
   
   ```
   
   looks like it still looking for hoodie.streamer.transformer.sql 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Hudi SQL Based Transformer Fails when trying to provide SQL File as input [hudi]

2024-05-19 Thread via GitHub


soumilshah1995 opened a new issue, #11258:
URL: https://github.com/apache/hudi/issues/11258

   
   Here is Delta Streamer 
   ```
   spark-submit \
 --class org.apache.hudi.utilities.streamer.HoodieStreamer \
 --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0 \
 --properties-file spark-config.properties \
 --master 'local[*]' \
 --executor-memory 1g \
  
/Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar
 \
 --table-type COPY_ON_WRITE \
 --op UPSERT \
 --transformer-class 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
 --source-ordering-field replicadmstimestamp \
 --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
 --target-base-path 
file:///Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/silver/
 \
 --target-table invoice \
 --props hudi_tbl.props
   ```
   # Hudi prop
   ```
   
hoodie.streamer.source.hoodieincr.missing.checkpoint.strategy=READ_UPTO_LATEST_COMMIT
   
hoodie.streamer.source.hoodieincr.path=s3a://warehouse/default/table_name=orders
   hoodie.datasource.write.recordkey.field=order_id
   hoodie.datasource.write.partitionpath.field=
   hoodie.datasource.write.precombine.field=ts
   
   
   ```
   
   Tried following options 
   ```
   
   hoodie.streamer.transformer.sql.file=join.sql
   OR
   
hoodie.streamer.transformer.sql.file=file:///Users/soumilshah/IdeaProjects/SparkProject/deltastreamerBroadcastJoins/join.sql
   OR
   
hoodie.streamer.transformer.sql.file=/Users/soumilshah/IdeaProjects/SparkProject/deltastreamerBroadcastJoins/join.sql
   ```
   
   # Error Message 
   
   ```
   
   FO BaseHoodieTableFileIndex: Refresh table orders, spent: 15 ms
   24/05/19 12:38:37 ERROR HoodieStreamer: Shutting down delta-sync due to 
exception
   java.lang.IllegalArgumentException: Property hoodie.streamer.transformer.sql 
not found
at 
org.apache.hudi.common.util.ConfigUtils.getStringWithAltKeys(ConfigUtils.java:334)
at 
org.apache.hudi.common.util.ConfigUtils.getStringWithAltKeys(ConfigUtils.java:308)
at 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer.apply(SqlQueryBasedTransformer.java:52)
at 
org.apache.hudi.utilities.transform.ChainedTransformer.apply(ChainedTransformer.java:105)
at 
org.apache.hudi.utilities.streamer.StreamSync.lambda$fetchFromSource$0(StreamSync.java:530)
at org.apache.hudi.common.util.Option.map(Option.java:108)
at 
org.apache.hudi.utilities.streamer.StreamSync.fetchFromSource(StreamSync.java:530)
at 
org.apache.hudi.utilities.streamer.StreamSync.readFromSource(StreamSync.java:495)
at 
org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:405)
at 
org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.lambda$startService$1(HoodieStreamer.java:757)
at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
   24/05/19 12:38:37 INFO HoodieStreamer: Delta Sync shutdown. Error ?true
   24/05/19 12:38:37 INFO HoodieStreamer: Ingestion completed. Has error: true
   24/05/19 12:38:37 INFO StreamSync: Shutting down embedded timeline server
   24/05/19 12:38:37 ERROR HoodieAsyncService: Service shutdown with error
   java.util.concurrent.ExecutionException: 
org.apache.hudi.exception.HoodieException: Property 
hoodie.streamer.transformer.sql not found
at 
java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
at 
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2005)
at 
org.apache.hudi.async.HoodieAsyncService.waitForShutdown(HoodieAsyncService.java:103)
at 
org.apache.hudi.utilities.ingestion.HoodieIngestionService.startIngestion(HoodieIngestionService.java:65)
at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
at 
org.apache.hudi.utilities.streamer.HoodieStreamer.sync(HoodieStreamer.java:205)
at 
org.apache.hudi.utilities.streamer.HoodieStreamer.main(HoodieStreamer.java:584)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at 

[I] [SUPPORT] Hudi COW Encryptions [hudi]

2024-05-19 Thread via GitHub


soumilshah1995 opened a new issue, #11257:
URL: https://github.com/apache/hudi/issues/11257

   # Sample Code 
   
   ```
   try:
   import os
   import sys
   import uuid
   import pyspark
   import datetime
   from pyspark.sql import SparkSession
   from pyspark import SparkConf, SparkContext
   from faker import Faker
   import datetime
   from datetime import datetime
   import random 
   import pandas as pd  # Import Pandas library for pretty printing
   
   print("Imports loaded ")
   
   except Exception as e:
   print("error", e)
   
   HUDI_VERSION = '1.0.0-beta1'
   SPARK_VERSION = '3.4'
   
   os.environ["JAVA_HOME"] = "/opt/homebrew/opt/openjdk@11"
   SUBMIT_ARGS = f"--packages 
org.apache.hudi:hudi-spark{SPARK_VERSION}-bundle_2.12:{HUDI_VERSION} 
pyspark-shell"
   os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
   os.environ['PYSPARK_PYTHON'] = sys.executable
   
   # Spark session
   spark = SparkSession.builder \
   .config('spark.serializer', 
'org.apache.spark.serializer.KryoSerializer') \
   .config('spark.sql.extensions', 
'org.apache.spark.sql.hudi.HoodieSparkSessionExtension') \
   .config('className', 'org.apache.hudi') \
   .config('spark.sql.hive.convertMetastoreParquet', 'false') \
   .getOrCreate()
   
   spark._jsc.hadoopConfiguration().set("parquet.crypto.factory.class", 
"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
   spark._jsc.hadoopConfiguration().set("parquet.encryption.kms.client.class" , 
"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
   spark._jsc.hadoopConfiguration().set("parquet.encryption.footer.key", "k1")
   spark._jsc.hadoopConfiguration().set("parquet.encryption.column.keys", 
"k2:customer_id")
   
   
   global faker
   faker = Faker()
   
   
   def get_customer_data(total_customers=2):
   customers_array = []
   for i in range(0, total_customers):
   customer_data = {
   "customer_id": str(uuid.uuid4()),
   "name": faker.name(),
   "state": faker.state(),
   "city": faker.city(),
   "email": faker.email(),
   "created_at": datetime.now().isoformat().__str__(),
   "adqdress": faker.address(),
  "salary": faker.random_int(min=3, max=10) 
   }
   customers_array.append(customer_data)
   return customers_array
   
   global total_customers, order_data_sample_size
   total_customers = 1
   customer_data = get_customer_data(total_customers=total_customers)
   
   spark_df_customers = spark.createDataFrame(data=[tuple(i.values()) for i in 
customer_data],
  
schema=list(customer_data[0].keys()))
   spark_df_customers.show(1, truncate=False)
   spark_df_customers.printSchema()
   
   
   
   def write_to_hudi(spark_df, 
 table_name, 
 db_name, 
 method='upsert',
 table_type='COPY_ON_WRITE',
 recordkey='',
 precombine='',
 partition_fields='',
 index_type='BLOOM'
):
   
   path = 
f"file:///Users/soumilshah/IdeaProjects/SparkProject/tem/database={db_name}/table_name{table_name}"
   
   hudi_options = {
   'hoodie.table.name': table_name,
   'hoodie.datasource.write.table.type': table_type,
   'hoodie.datasource.write.table.name': table_name,
   'hoodie.datasource.write.operation': method,
   'hoodie.datasource.write.recordkey.field': recordkey,
   'hoodie.datasource.write.precombine.field': precombine,
   "hoodie.datasource.write.partitionpath.field": partition_fields,
"hoodie.index.type": index_type,
   }
   
   if index_type == 'RECORD_INDEX':
   hudi_options.update({
   "hoodie.enable.data.skipping": "true",
   "hoodie.metadata.enable": "true",
   "hoodie.metadata.index.column.stats.enable": "true",
   "hoodie.write.concurrency.mode": 
"optimistic_concurrency_control",
   "hoodie.write.lock.provider": 
"org.apache.hudi.client.transaction.lock.InProcessLockProvider",
   "hoodie.metadata.record.index.enable": "true"
   })
   
   
   print("\n")
   print(path)
   print("\n")
   
   spark_df.write.format("hudi"). \
   options(**hudi_options). \
   mode("append"). \
   save(path)
   
   
   write_to_hudi(
   spark_df=spark_df_customers,
   db_name="default",
   table_name="customers",
   recordkey="customer_id",
   precombine="created_at",
   partition_fields="state",
   index_type="BLOOM"
   )
   ```
   
   # Error 
   ```
   24/05/19 11:01:48 ERROR SimpleExecutor: Failed consuming records
   org.apache.parquet.crypto.ParquetCryptoRuntimeException: 

Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]

2024-05-17 Thread via GitHub


ad1happy2go commented on issue #11170:
URL: https://github.com/apache/hudi/issues/11170#issuecomment-2117238692

   Thanks @matthijseikelenboom for the update


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]

2024-05-17 Thread via GitHub


matthijseikelenboom closed issue #11170: [SUPPORT] Hudi fails ACID verification 
test
URL: https://github.com/apache/hudi/issues/11170


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]

2024-05-17 Thread via GitHub


matthijseikelenboom commented on issue #11170:
URL: https://github.com/apache/hudi/issues/11170#issuecomment-2117016086

   Tested and verified. Closing issues.
   
    More info
   Solution has been tested on:
   - Java 8 ✅
   - Java 11 ✅
   - Java 17 ❌ (As of this moment, Hudi doesn't support this version)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]

2024-05-16 Thread via GitHub


ad1happy2go commented on issue #11170:
URL: https://github.com/apache/hudi/issues/11170#issuecomment-2115400503

   @matthijseikelenboom I was able to successfully test. There were two issues 
- 
   1. InprocessLockProvider doesn't work for multiple writes. So use 
FileSystemBasedLockProvider in transactionWriter.java
   ```
   dataSet.write().format("hudi")
   .option("hoodie.table.name", tableName)
   .option("hoodie.datasource.write.recordkey.field", 
"primaryKeyValue")
   .option("hoodie.datasource.write.partitionpath.field", 
"partitionKeyValue")
   .option("hoodie.datasource.write.precombine.field", 
"dataValue")
   .option("hoodie.write.lock.provider", 
"org.apache.hudi.client.transaction.lock.FileSystemBasedLockProvider")
   .mode(SaveMode.Append)
   .save(tablePath);
   ```
   3. Along with refresh, to add partitions to mock repair also in 
ReaderThread. 
   ```
   session.sql("REFRESH TABLE " + fullyQualifiedTableName);
   session.sql("MSCK REPAIR TABLE" + fullyQualifiedTableName);
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]

2024-05-16 Thread via GitHub


ad1happy2go commented on issue #11170:
URL: https://github.com/apache/hudi/issues/11170#issuecomment-2115401719

   @matthijseikelenboom Please let us know if it works for you also. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]

2024-05-16 Thread via GitHub


ad1happy2go commented on issue #11170:
URL: https://github.com/apache/hudi/issues/11170#issuecomment-2115007277

   @matthijseikelenboom I tried to run in my local but again seeing issues. We 
can connect once. If you are on Apache Hudi slack can you ping me "Aditya 
Goenka"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi 0.14.0 - deletion from table failing for org.apache.hudi.keygen.TimestampBasedKeyGenerator [hudi]

2024-05-15 Thread via GitHub


Priyanka128 commented on issue #10823:
URL: https://github.com/apache/hudi/issues/10823#issuecomment-2111792611

   > I think your timestamp.type should be "DATE_STRING".
   
   Tried this but getting below exception:
   _Caused by: java.lang.RuntimeException: 
hoodie.keygen.timebased.timestamp.scalar.time.unit is not specified but scalar 
it supplied as time value
 at 
org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.convertLongTimeToMillis(TimestampBasedAvroKeyGenerator.java:216)
 at 
org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:187)
 at 
org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:118)
 ... 18 more_
   
   After encountering this exception, removed 
"hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit" -> "DAYS" 
but still same exception "Caused by: java.lang.RuntimeException: 
hoodie.keygen.timebased.timestamp.scalar.time.unit is not specified but scalar 
it supplied as time value" was coming.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]

2024-05-13 Thread via GitHub


ziudu commented on issue #11204:
URL: https://github.com/apache/hudi/issues/11204#issuecomment-2109284886

   I'm a newbie. It took me a while to understand why bucket join does not work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]

2024-05-13 Thread via GitHub


danny0405 commented on issue #11204:
URL: https://github.com/apache/hudi/issues/11204#issuecomment-2109221161

   > So if we have to choose one between spark and hive, I think spark might be 
of higher priority
   
   I agree, do you have energy to complete that suspended PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]

2024-05-13 Thread via GitHub


ziudu commented on issue #11204:
URL: https://github.com/apache/hudi/issues/11204#issuecomment-2109160408

   Hi Danny0405,
   
   I think the support for 2 hudi tables' Spark sort-merge-join with bucket 
optimization is an important feature. 
   
   Currently if we join 2 hudi tables, the bucket index's bucket information is 
not usable by spark, so shuffle is always needs. As explained in 
[8657](https://github.com/apache/hudi/pull/8657) - hashing- file naming- file 
numbering- file sorting are different.
   
   Unfortunately, according to 
https://issues.apache.org/jira/browse/SPARK-19256, spark bucket is not 
compatible with hive bucket yet. So if we have to choose one between spark and 
hive, I think spark might be of higher priority.   
  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]

2024-05-13 Thread via GitHub


ziudu opened a new issue, #11204:
URL: https://github.com/apache/hudi/issues/11204

   According to [parisni in [HUDI-6150] Support bucketing for each hive client 
(https://github.com/apache/hudi/pull/8657)
   
   "So I assume hudi way of doing (which is not compliant with both hive and 
spark) cannot be used to improve query engines queries such join and filter. 
Then this leads all of below are wrong:
   
   the current config 
https://hudi.apache.org/docs/configurations/#hoodiedatasourcehive_syncbucket_sync
   this current PR
   the rfc statement about support of hive bucketing 
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index;
   
   Do you have any update on this?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi could override users' configurations [hudi]

2024-05-10 Thread via GitHub


boneanxs commented on issue #11188:
URL: https://github.com/apache/hudi/issues/11188#issuecomment-2105500024

   > > I actually see hudi could set many spark relate configures in SparkConf, 
most of them are related to parquet reader/writer.
   > 
   > Are these options configurable?
   
   Yes, these configures could be set by users


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi could override users' configurations [hudi]

2024-05-10 Thread via GitHub


danny0405 commented on issue #11188:
URL: https://github.com/apache/hudi/issues/11188#issuecomment-2105384268

   > I actually see hudi could set many spark relate configures in SparkConf, 
most of them are related to parquet reader/writer.
   
   Are these options configurable?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Hudi could override users' configurations [hudi]

2024-05-10 Thread via GitHub


boneanxs opened a new issue, #11188:
URL: https://github.com/apache/hudi/issues/11188

   We recently also met the issue https://github.com/apache/hudi/issues/9305, 
but with the different cause(we still use hudi 0.12).
   
   The user set the configure `spark.sql.parquet.enableVectorizedReader` to 
false manually, and read a hive table and cache it. Given spark will analyze 
the plan firstly if it needs to be cached, so currently spark won't add `C2R` 
to that cached plan since vectorized reader is false. At currently, spark won't 
execute that plan since there's no action operator.
   
   Then user tries to read a MOR read_optimized table and join that cached plan 
and get the result, as mor table will automatically update the 
`enableVectorizedReader` to true, actually that hive table is read as column 
batch, but the plan doesn't contain `C2R` to convert the batch to row, whereas 
the error occurs:
   
   ![Screenshot 2024-05-10 at 18 32 
22](https://github.com/apache/hudi/assets/10115332/14b387e0-ecee-4c04-9aff-ba024ce3af55)
   
   ```java
   ava.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch 
cannot be cast to org.apache.spark.sql.catalyst.InternalRow
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at 
org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.hasNext(InMemoryRelation.scala:118)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:302)
at 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1481)
at 
   ```
   ```scala
 override def imbueConfigs(sqlContext: SQLContext): Unit = {
   super.imbueConfigs(sqlContext)
   
sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.enableVectorizedReader",
 "true")
 }
   ```
   
   I see there's some modification in the master code, but I suspect this issue 
could still happen since we'd also modify it in 
`HoodieFileGroupReaderBasedParquetFileFormat`:
   
   ```scala
   spark.conf.set("spark.sql.parquet.enableVectorizedReader", 
supportBatchResult)
   ```
   
   Besides this issue, Is it suitable to set spark configures globally? No 
matter users set it or not, I actually see hudi could set many spark relate 
configures in `SparkConf`, most of them are related to parquet reader/writer. 
This could confuse users and make it hard for devs to find the cause.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]

2024-05-09 Thread via GitHub


matthijseikelenboom commented on issue #11170:
URL: https://github.com/apache/hudi/issues/11170#issuecomment-2102747936

   @ad1happy2go I've pushed a new branch on the repo where the project is 
downgraded to Java 8. When running the test then, the writers don't seem to 
fail anymore, but it still fails the verification test.
   
   https://github.com/apache/hudi/assets/1364843/30384d79-6905-4c2e-96f0-e246d5589469;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi Record Index not working as Expected: gives warning as "WARN SparkMetadataTableRecordIndex: Record index not initialized so falling back to GLOBAL_SIMPLE for tagging records" [h

2024-05-09 Thread via GitHub


zeeshan-media closed issue #10507: [SUPPORT] Hudi Record Index not working as 
Expected: gives warning as "WARN SparkMetadataTableRecordIndex: Record index 
not initialized so falling back to GLOBAL_SIMPLE for tagging records"
URL: https://github.com/apache/hudi/issues/10507


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]

2024-05-09 Thread via GitHub


matthijseikelenboom commented on issue #11170:
URL: https://github.com/apache/hudi/issues/11170#issuecomment-2102603694

   Okay, yeah sure. The original test was written with Java 11, but I updated 
to 17 because I thought why not and Spark 3.4.2 supports it.
   
   Is it known that Hudi (Or Kryo) also doesn't work with Java 11 and is that 
why you suggest Java 8? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]

2024-05-09 Thread via GitHub


ad1happy2go commented on issue #11170:
URL: https://github.com/apache/hudi/issues/11170#issuecomment-2102355732

   @matthijseikelenboom I noticed you are using JAVA 17 for the same. Hudi 
0.14.1 doesn't support JAVA 17 yet. The newer Hudi version will be able to 
support the same.
   
   Some reference to similar issue related to java 17 here - 
https://github.com/EsotericSoftware/kryo/issues/885
   
   Can you try with JAVA 8 once. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]

2024-05-08 Thread via GitHub


ad1happy2go commented on issue #11170:
URL: https://github.com/apache/hudi/issues/11170#issuecomment-2101987758

   @matthijseikelenboom Looks like some library conflicts are there in the 
project. Need to reproduce it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]

2024-05-08 Thread via GitHub


matthijseikelenboom commented on issue #11170:
URL: https://github.com/apache/hudi/issues/11170#issuecomment-2100205976

   @ad1happy2go Ah yes, you're right. I seem to have forgot to add the 
hudi-defaults.conf file to this project. I've added it to my repository and ran 
the test again. It comes further along, but still breaks down.
   
   Stacktrace (Be warned, it's a big one):
   ```
   ERROR! : Failed to upsert for commit time 20240508114518478
   24/05/08 11:45:18 ERROR TransactionWriter: Exception in writer.
   java.lang.RuntimeException: org.apache.hudi.exception.HoodieUpsertException: 
Failed to upsert for commit time 20240508114518478
at 
org.example.writer.TransactionWriter.wrapOrRethrowException(TransactionWriter.java:192)
at 
org.example.writer.TransactionWriter.tryTransaction(TransactionWriter.java:184)
at 
org.example.writer.TransactionWriter.updateTransaction(TransactionWriter.java:143)
at 
org.example.writer.TransactionWriter.lambda$handleTransaction$0(TransactionWriter.java:89)
at 
org.example.writer.TransactionWriter.withRetryOnException(TransactionWriter.java:109)
at 
org.example.writer.TransactionWriter.handleTransaction(TransactionWriter.java:83)
at org.example.writer.TransactionWriter.run(TransactionWriter.java:70)
   Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to upsert 
for commit time 20240508114518478
at 
org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:70)
at 
org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:44)
at 
org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:114)
at 
org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:103)
at 
org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:142)
at 
org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:224)
at 
org.apache.hudi.HoodieSparkSqlWriterInternal.liftedTree1$1(HoodieSparkSqlWriter.scala:504)
at 
org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:502)
at 
org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:204)
at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:121)
at 
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.executeUpsert(MergeIntoHoodieTableCommand.scala:439)
at 
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.run(MergeIntoHoodieTableCommand.scala:282)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:488)
at 

Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]

2024-05-08 Thread via GitHub


ad1happy2go commented on issue #11170:
URL: https://github.com/apache/hudi/issues/11170#issuecomment-2100011078

   @matthijseikelenboom I don't see any lock related configurations in your 
setup. I checked that you are using 2 parallel writers. So you may need to 
configure lock during write. Hudi follows OCC principal. 
   Check multi writer setup here - 
https://hudi.apache.org/docs/concurrency_control/#model-c-multi-writer
   
   Let me know in case I am missing anything on the same. Thanks a lot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Hudi fails ACID verification test [hudi]

2024-05-07 Thread via GitHub


matthijseikelenboom opened a new issue, #11170:
URL: https://github.com/apache/hudi/issues/11170

   **Describe the problem you faced**
   
   For work we had needed to have a concurrent read/write support for our data 
lake, which uses Spark. We where noticing some inconsistencies, so we wrote a 
test that can verify whether something like Hudi adheres to ACID. We did 
however find that Hudi fails this test.
   
   Now, it can be that we've wrongly configured Hudi or that there is some 
mistake in the test code.
   
   My question is if someone of you can take a look at it, and perhaps can 
explain what is going wrong here.
   
   **To Reproduce**
   
   How to run the test and it's findings are described in the README of the 
repository, but here is a short run down
   
   Steps to reproduce the behavior:
   
   1. Check out repo: 
[hudi-acid-verification](https://github.com/matthijseikelenboom/hudi-acid-verification)
   2. Start Docker if not already running
   3. Run the test 
[TransactionManagerTest.java](https://github.com/matthijseikelenboom/hudi-acid-verification/blob/master/src/test/java/org/example/writer/TransactionManagerTest.java)
   4. Observe that writers breakdown and that very transactions have been 
processed.
   
   **Expected behavior**
   
   1. I expect the writers not to break down
   2. I expect that the full amount of transactions are executed
   
   **Environment Description**
   
   * Hudi version : 0.14.1
   
   * Spark version : 3.4.2
   
   * Hive version : 4.0.0-beta-1
   
   * Hadoop version : 3.2.2
   
   * Storage (HDFS/S3/GCS..) : NTFS(Windows), APFS(macOS) & HDFS
   
   * Running on Docker? (yes/no) : No
   
   **Additional context**
   It's worth noting that other solutions, Iceberg and Delta Lake, have also 
been tested this way. Iceberg also didn't pass this test. Delta Lake did pass 
the test.
   
   **Stacktrace**
   
   ```
   24/05/07 21:49:38 ERROR TransactionWriter: Exception in writer.
   org.example.writer.TransactionFailedException: 
org.apache.hudi.exception.HoodieRollbackException: Failed to rollback 
file:/tmp/lakehouse/concurrencytestdb.db/acid_verification commits 
20240507214932607
at 
org.example.writer.TransactionWriter.wrapOrRethrowException(TransactionWriter.java:190)
at 
org.example.writer.TransactionWriter.tryTransaction(TransactionWriter.java:184)
at 
org.example.writer.TransactionWriter.updateTransaction(TransactionWriter.java:143)
at 
org.example.writer.TransactionWriter.lambda$handleTransaction$0(TransactionWriter.java:89)
at 
org.example.writer.TransactionWriter.withRetryOnException(TransactionWriter.java:109)
at 
org.example.writer.TransactionWriter.handleTransaction(TransactionWriter.java:83)
at org.example.writer.TransactionWriter.run(TransactionWriter.java:70)
   Caused by: org.apache.hudi.exception.HoodieRollbackException: Failed to 
rollback file:/tmp/lakehouse/concurrencytestdb.db/acid_verification commits 
20240507214932607
at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:1065)
at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:1012)
at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:940)
at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:922)
at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:917)
at 
org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:941)
at 
org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:222)
at 
org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:940)
at 
org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:933)
at 
org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:501)
at 
org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:204)
at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:121)
at 
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.executeUpsert(MergeIntoHoodieTableCommand.scala:439)
at 
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.run(MergeIntoHoodieTableCommand.scala:282)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at 

Re: [I] [SUPPORT] Hudi MOR high latency on data availability [hudi]

2024-05-01 Thread via GitHub


sgcisco commented on issue #8:
URL: https://github.com/apache/hudi/issues/8#issuecomment-2089073931

   @ad1happy2go record key looks as `record_keys=["timestamp", "A", "B", 
"C"],`. Where `timestamp` is monotonically increasing in ms, `A` a string with 
a range of some 500k values, `B` is similar to `A`, `C` is max hundred values.
   We use `upsert` which is a default operation but we don't expect any updates 
on the inserted values.
   We tried `insert` but observed latencies were worse.
   
   Increasing partitioning granularity from daily to hourly seems help to 
decrease latencies but not to solve the problem completely.
   ![Screenshot 2024-05-01 at 22 07 
16](https://github.com/apache/hudi/assets/168409126/7cd6bd72-2ecb-4826-99f6-567481b234bc)
   
   In this case partitioning size goes down from 100Gb to 4.7Gb.
   
   > Are you seeing the disk spill during this operation, you can try 
increasing the executor memory to avoid the same.
   
   No,  over 15h running job
   
   ![Screenshot 2024-05-01 at 22 19 
07](https://github.com/apache/hudi/assets/168409126/c860312d-ed03-427d-aaa8-ca9bedcb0ed5)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi MOR high latency on data availability [hudi]

2024-05-01 Thread via GitHub


ad1happy2go commented on issue #8:
URL: https://github.com/apache/hudi/issues/8#issuecomment-2088378573

   @sgcisco What is nature of your record key? Is it random id ? Building 
workload profile do the index lookup which is basically the join between the 
existing data with the incremental data to identify which records to be updated 
or inserted. 
   Are you seeing the disk spill during this operation, you can try increasing 
the executor memory to avoid the same.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi MOR high latency on data availability [hudi]

2024-04-30 Thread via GitHub


sgcisco commented on issue #8:
URL: https://github.com/apache/hudi/issues/8#issuecomment-2086534021

   @ad1happy2go thanks for your reply. We tried `compact num.delta commits as 
1` in one of the tests for other runs and in what try to use now it is a 
default value which is 5.
   
   As another test attempt we tried to run a pipeline over several days but 
with lower ingestion rate 600Kb/s and the same Hudi and Spark configuration as 
above.
   
   The most time consuming stage is `Building workload profile` which takes 2.5 
- 12 min, with average around 7 min.
   
   ![Screenshot 2024-04-30 at 19 44 
00](https://github.com/apache/hudi/assets/168409126/ceb6353a-b90f-4abd-8111-5477338701d5)
   
   ![Screenshot 2024-04-30 at 20 37 
15](https://github.com/apache/hudi/assets/168409126/03b7fe99-7eba-4a24-b4b6-446a6b527c67)
   
   So in this case it is around 35-40Mb per minute, current Structured 
Streaming minibatch, and workers can go up to 35Gb and 32 cores. 
   Does it look as a sufficient resource config for Hudi to handle such load?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi MOR high latency on data availability [hudi]

2024-04-30 Thread via GitHub


ad1happy2go commented on issue #8:
URL: https://github.com/apache/hudi/issues/8#issuecomment-2085803990

   Thanks for raising this @sgcisco . I noticed you are using compact num.delta 
commits as 1. Any reason for the same. If we need to compact after every 
commit, then better we use COW table itself. 
One other reason may be the ingestion Job is starved of resources as async 
compact job may be consuming.  Did we analysed spark UI. Which stage is started 
taking more time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Hudi MOR high latency on data availability [hudi]

2024-04-29 Thread via GitHub


sgcisco opened a new issue, #8:
URL: https://github.com/apache/hudi/issues/8

   **Describe the problem you faced**
   
   Running a streaming solution with Kafka - Structured Streaming (PySpark) - 
Hudi (MOR tables) + AWS Glue+S3 we observed periodically growing latencies on 
data availability at Hudi.
   Latencies were measured as difference between data generation `timestamp` 
and `_hudi_commit_timestamp` and could go up to 30 min. Periodical manual 
checks for the latest available data points `timestamps`, by running queries as 
described here 
https://hudi.apache.org/docs/0.13.1/querying_data#spark-snap-query,  confirmed 
such delays.
   
   
![image](https://github.com/apache/hudi/assets/168409126/5f7e6e1c-565b-47c1-b293-898cf2d8c40b)
   
   
![image](https://github.com/apache/hudi/assets/168409126/9bb379fa-85bc-467d-853f-8dc9651803b3)
   
   In case of using Spark with Hudi data read-out from Kafka had unstable rate
   
   ![Screenshot 2024-04-29 at 11 49 
29](https://github.com/apache/hudi/assets/168409126/1f114523-a574-4d39-90e8-a6d674f79aa0)
   
   To exclude impact from any other components but Hudi we ran some experiments 
with the same configuration and ingestion settings but without Hudi and with a 
direct write on S3. It did not reveal any delays above 2 mins, where 1 min 
delay is always present due to Structured Streaming minibatch granularity. In 
this case a read-out Kafka rate was stable overtime.
   
   **Additional context**
   
   We tried to optimize Hudi file sizing and MOR layout by applying suggestions 
from these references 
https://github.com/apache/hudi/issues/2151#issuecomment-706400445,
   
https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoItoavoidcreatingtonsofsmallfiles,
   https://github.com/apache/hudi/issues/2151#issuecomment-706400445
   
   We could get a target file size between 90-120Mb by downing 
`hoodie.copyonwrite.record.size.estimate` from 1024 to 100 and using 
`Inline.compact=false and delta.commits=1 and async.compact=true and 
hoodie.merge.small.file.group.candidates.limit=20` but it did not have any 
impact on a latency.
   
   Another commit strategy `NUM_OR_TIME` as suggested here 
https://github.com/apache/hudi/issues/8975#issuecomment-1593408753 with 
parameters below did not help to resolve a problem
   ```
   "hoodie.copyonwrite.record.size.estimate": "100",
   "hoodie.compact.inline.trigger.strategy": "NUM_OR_TIME",
   "hoodie.metadata.compact.max.delta.commits": "5",
   "hoodie.compact.inline.max.delta.seconds": "60",
   ``` 
   
   As a trade-off we came up to the configuration below, which allows us to 
have relatively low latencies for 90th percentile and file size 40-90Mb
   ```
   "hoodie.merge.small.file.group.candidates.limit": "40",
   "hoodie.cleaner.policy": "KEEP_LATEST_FILE_VERSIONS",
   ```
   
   
![10_31_12](https://github.com/apache/hudi/assets/168409126/bf85386b-7f6e-48a9-b855-ff8cb391080d)
   
   
   But still some records could go up to 30 min.
   
   
![02_42_29](https://github.com/apache/hudi/assets/168409126/9e6442c1-8cfa-4778-abc5-5d1050cb3653)
   
   However the last config works relatively well for low ingestion rates up to 
1.5Mb/s with a daily partitioning `partition_date=-MM-dd/` but stops work 
for the rates above 2.5 Mb/s even with more granular partitioning 
`partition_date=-MM-dd-HH/` 
   
   **Expected behavior**
   
   Since we use MOR tables:
   - low latencies on data availability 
   - proper file sizing defined by the limits 
   ```
   "hoodie.parquet.small.file.limit" : "104857600",
   "hoodie.parquet.max.file.size" : "125829120",
   
   
   **Environment Description**
   
   * Hudi version : 0.13.1
   
   * Spark version : 3.4.1
   
   * Hive version : 3.1
   
   * Hadoop version : EMR 6.13
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   Hudi configuration
   ```
   "hoodie.datasource.hive_sync.auto_create_database": "true",
   "hoodie.datasource.hive_sync.enable": "true",
   "hoodie.datasource.hive_sync.mode": "hms",
   "hoodie.datasource.hive_sync.table": table_name,
   "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
   "hoodie.datasource.hive_sync.use_jdbc": "false",
   "hoodie.datasource.hive_sync.database": _glue_db_name,
   "hoodie.datasource.write.hive_style_partitioning": "true",
   "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
   "hoodie.datasource.write.operation": "upsert",
   
"hoodie.datasource.write.schema.allow.auto.evolution.column.drop": "true",
   "hoodie.datasource.write.table.name": table_name,
   "hoodie.datasource.write.table.type": "MERGE_ON_READ",
   

Re: [I] [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation [hudi]

2024-04-29 Thread via GitHub


codope commented on issue #8114:
URL: https://github.com/apache/hudi/issues/8114#issuecomment-2081997720

   Yes this was fixed in 0.13.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation [hudi]

2024-04-29 Thread via GitHub


codope closed issue #8114: [SUPPORT] Hudi partitions not dropped by Hive sync 
after `insert_overwrite_table` operation
URL: https://github.com/apache/hudi/issues/8114


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation [hudi]

2024-04-28 Thread via GitHub


danny0405 commented on issue #8114:
URL: https://github.com/apache/hudi/issues/8114#issuecomment-2081940660

   cc @codope guess this should have been fixed? 
https://github.com/apache/hudi/pull/6662


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation [hudi]

2024-04-28 Thread via GitHub


zhaobangcai commented on issue #8114:
URL: https://github.com/apache/hudi/issues/8114#issuecomment-2081816664

   Has this problem been solved? @Limess 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation [hudi]

2024-04-28 Thread via GitHub


zhaobangcai commented on issue #8114:
URL: https://github.com/apache/hudi/issues/8114#issuecomment-2081815485

   Is there no further text on this question? Do you have any plans to fix it 
in which version? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] [hudi]

2024-04-15 Thread via GitHub


jack1234smith opened a new issue, #11023:
URL: https://github.com/apache/hudi/issues/11023

   **Describe the problem you faced**
   
   Error exception:
   java.util.NoSuchElementException: No value present in Option
at org.apache.hudi.common.util.Option.get(Option.java:89)
at 
org.apache.hudi.table.format.mor.MergeOnReadInputFormat.initIterator(MergeOnReadInputFormat.java:204)
at 
org.apache.hudi.table.format.mor.MergeOnReadInputFormat.open(MergeOnReadInputFormat.java:189)
at 
org.apache.hudi.source.StreamReadOperator.processSplits(StreamReadOperator.java:169)
at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
at 
org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMail(MailboxProcessor.java:398)
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:367)
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:352)
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:229)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788)
at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
at java.lang.Thread.run(Thread.java:745)
   
   data error:
   
![17131698404766](https://github.com/apache/hudi/assets/50668893/9fb26f15-6228-486d-a9c5-2b70c746f784)
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. kill yarn session
   2. Restart job from checkpoint
   
   
   **Environment Description**
   
   * Hudi version : 0.14.1
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.3.6
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Yarn Session? (yes/no) :  yes
   
   
   **Additional context**
   
   My table are:
   CREATE TABLE if not exists ods_table(
id int,
count_num double,
write_time timestamp(0),
_part string,
proc_time  timestamp(3),
WATERMARK FOR write_time AS write_time
   ) 
   PARTITIONED BY (_part)
   WITH (
   'connector'='hudi',
   'path' ='hdfs://masters/test/ods_table',
   'table.type'='MERGE_ON_READ',
   'hoodie.datasource.write.recordkey.field' = 'id',
   'hoodie.datasource.write.precombine.field' = 'write_time', 
   'write.bucket_assign.tasks'='1',
   'write.tasks' = '1', 
   'compaction.tasks' = '1',
   'compaction.async.enabled' = 'true',
   'compaction.schedule.enabled' = 'true',
   'compaction.trigger.strategy' = 'time_elapsed',
   'compaction.delta_seconds' = '600',
   'compaction.delta_commits' = '1',
   'read.streaming.enabled' = 'true',
   'read.streaming.skip_compaction' = 'true',
   'read.start-commit' = 'earliest',
   'changelog.enabled' = 'true',
   'hive_sync.enable'='true',
   'hive_sync.mode' = 'hms',
   'hive_sync.metastore.uris' = 'thrift://h35:9083',
   'hive_sync.db'='test',
   'hive_sync.table'='hive_ods_table'
   );
   
   CREATE TABLE if not exists ads_table(
sta_date string,
num double,
proc_time as proctime()
   )
   WITH (
   'connector'='hudi',
   'path' ='hdfs://masters/test/ads_table',
   'table.type'='COPY_ON_WRITE',
   'hoodie.datasource.write.recordkey.field' = 'sta_date',
   'write.bucket_assign.tasks'='1',
   'write.tasks' = '1', 
   'compaction.tasks' = '1',
   'compaction.async.enabled' = 'true',
   'compaction.schedule.enabled' = 'true',
   'compaction.trigger.strategy' = 'time_elapsed',
   'compaction.delta_seconds' = '600',
'compaction.delta_commits' = '1',
   'read.streaming.enabled' = 'true',
   'read.streaming.skip_compaction' = 'true',
'read.start-commit' = 'earliest',
   'changelog.enabled' = 'true',
   'hive_sync.enable'='true',
   'hive_sync.mode' = 'hms',
   'hive_sync.metastore.uris' = 'thrift://h35:9083',
   'hive_sync.db'='test',
   'hive_sync.table'='hive_ads_table'
   );
   
   My job is:
   insert into test.ads_table
   select _part, sum(count_num)
   from test.ods_table
   group by _part;
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 

[I] [SUPPORT] [hudi]

2024-04-11 Thread via GitHub


VitoMakarevich opened a new issue, #10997:
URL: https://github.com/apache/hudi/issues/10997

   **Describe the problem you faced**
   
   We are using Spark 3.3 and Hudi 0.12.2.
   I need your assistance in helping me to improve the `Doing partition and 
writing data` stage. For us, it looks to be the most time consuming. We are 
using `snappy` compression(the most lightweight from available as I know), file 
size is ~160mb, which is effectively 80-90 GB GZIP(default codec in Hudi for 
our workload). Files itself consist of 1.5-2M rows.
   So our problem is that unfortunately due to partitioning + CDC nature, we 
must udpate a lot of files at peak hours, we have clustering to group rows 
together, but it's still thousands of files affected. 75th percentile of 
individual file overwrite(task in the `Doing partition and writing data` stage) 
takes ~40-60 seconds, it does not correlate to the number of rows updated 
inside(for 75th percentile it's < 100 rows changed in every file). Also - the 
payload class is almost default(minor changes which not affect performance IMO).
   Q:
   1. What are knobs we can play with?
   We tried compression format(`snappy` looks to be the best among `zstd`- has 
memory leak in Spark 3.3 BTW and `gzip`)
   Also we tried `hoodie.write.buffer.limit.bytes` - rising to 32MB, 
unfortunately no visible difference.
   Is there any other?
   2. Do you know some performance improvements in newer 
versions(0.12.3-0.14.1) regarding specifically file write(`MergeHandle`) task
   
   **Environment Description**
   
   * Hudi version : 0.12.2
   
   * Spark version : 3.3.0
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi CLI bundle not working [hudi]

2024-04-08 Thread via GitHub


mansipp commented on issue #10566:
URL: https://github.com/apache/hudi/issues/10566#issuecomment-2043833097

   Getting the similar error while running the `commit rollback`
   ```
   commit rollback --commit 20240408231846380
   24/04/08 23:22:02 INFO InputStreamConsumer: Apr 08, 2024 11:22:02 PM 
org.apache.spark.launcher.Log4jHotPatchOption staticJavaAgentOption
   24/04/08 23:22:02 INFO InputStreamConsumer: WARNING: 
spark.log4jHotPatch.enabled is set to true, but 
/usr/share/log4j-cve-2021-44228-hotpatch/jdk17/Log4jHotPatchFat.jar does not 
exist at the configured location
   24/04/08 23:22:02 INFO InputStreamConsumer:
   24/04/08 23:22:03 INFO InputStreamConsumer: Error: Failed to load 
org.apache.hudi.cli.commands.SparkMain: 
org/apache/hudi/common/engine/HoodieEngineContext
   24/04/08 23:22:03 INFO InputStreamConsumer: 24/04/08 23:22:03 INFO 
ShutdownHookManager: Shutdown hook called
   24/04/08 23:22:03 INFO InputStreamConsumer: 24/04/08 23:22:03 INFO 
ShutdownHookManager: Deleting directory 
/mnt/tmp/spark-272bb6ef-f858-42a6-b9d0-9614f1f36371
   24/04/08 23:22:03 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from 
s3://mansipp-emr-dev/hudi_cli_migration/tables/mor/mansipp_hudi_mor_table_2/
   24/04/08 23:22:03 INFO HoodieTableConfig: Loading table properties from 
s3://mansipp-emr-dev/hudi_cli_migration/tables/mor/mansipp_hudi_mor_table_2/.hoodie/hoodie.properties
   24/04/08 23:22:03 INFO S3NativeFileSystem: Opening 
's3://mansipp-emr-dev/hudi_cli_migration/tables/mor/mansipp_hudi_mor_table_2/.hoodie/hoodie.properties'
 for reading
   24/04/08 23:22:03 INFO HoodieTableMetaClient: Finished Loading Table of type 
MERGE_ON_READ(version=1, baseFileFormat=PARQUET) from 
s3://mansipp-emr-dev/hudi_cli_migration/tables/mor/mansipp_hudi_mor_table_2/
   Commit 20240408231846380 failed to roll back```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] [hudi]

2024-04-07 Thread via GitHub


MrAladdin opened a new issue, #10972:
URL: https://github.com/apache/hudi/issues/10972

   **Describe the problem you faced**
   
   spark structured streaming upsert hudi(mor、RECORD_INDEX) --- very time 
consuming  :
   1、The number of tasks in each distinct stage of building workload profile is 
always 60, and there is a severe data skew.
   
   I want to know why it's always 60, how to adjust, the reasons for data skew 
and optimization solutions.
   I have done my best.
   
   
   **Environment Description**
   
   * Hudi version :0.14.1
   
   * Spark version :3.4.1
   
   * Hive version :3.1.2
   
   * Hadoop version :3.1.3
   
   * Storage (HDFS/S3/GCS..) :hdfs
   
   * Running on Docker? (yes/no) :no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi deltastreamer fails due to Clean [hudi]

2024-04-01 Thread via GitHub


codope closed issue #7209: [SUPPORT] Hudi deltastreamer fails due to Clean
URL: https://github.com/apache/hudi/issues/7209


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi deltastreamer fails due to Clean [hudi]

2024-04-01 Thread via GitHub


ad1happy2go commented on issue #7209:
URL: https://github.com/apache/hudi/issues/7209#issuecomment-2029660841

   @koldic Sorry we missed it. You can use multi writer concurrency control to 
handle that. 
https://hudi.apache.org/docs/concurrency_control/#enabling-multi-writing
   
   Closing this issue as it was due to multi writers. Thanks. Feel free to open 
new one in case of any new issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi cdc upserts stopped working after migrating from hudi 13.1 to 14.0 [hudi]

2024-04-01 Thread via GitHub


ROOBALJINDAL closed issue #10884: [SUPPORT] Hudi cdc upserts stopped working 
after migrating from hudi 13.1 to 14.0
URL: https://github.com/apache/hudi/issues/10884


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi cdc upserts stopped working after migrating from hudi 13.1 to 14.0 [hudi]

2024-04-01 Thread via GitHub


ROOBALJINDAL commented on issue #10884:
URL: https://github.com/apache/hudi/issues/10884#issuecomment-2029401141

   I have found the issue. We were using custom MssqlDebeziumSource class as 
debezium source and in constructor we were using `HoodieStreamerMetrics` 
instead of `HoodieIngestionMetrics` (which is introduced in hudi 14.0)
   
   Once corrected the class, it started working. We can close this issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi 0.14.0 - deletion from table failing for org.apache.hudi.keygen.TimestampBasedKeyGenerator [hudi]

2024-03-28 Thread via GitHub


xicm commented on issue #10823:
URL: https://github.com/apache/hudi/issues/10823#issuecomment-2024834691

   I think your timestamp.type should be "DATE_STRING".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi cdc upserts stopped working after migrating from hudi 13.1 to 14.0 [hudi]

2024-03-19 Thread via GitHub


ad1happy2go commented on issue #10884:
URL: https://github.com/apache/hudi/issues/10884#issuecomment-2007239157

   Dont think it can be kafka version related issue as job is not failing. we 
need to know more logs  to debug this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi cdc upserts stopped working after migrating from hudi 13.1 to 14.0 [hudi]

2024-03-19 Thread via GitHub


ROOBALJINDAL commented on issue #10884:
URL: https://github.com/apache/hudi/issues/10884#issuecomment-2006856626

   @ad1happy2go need time to setup new cluster. Our aws msk kafka cluster uses 
kafka version=2.6.2, can you confirm is this fine or this can be an issue? Any 
specific supported version of kafka?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi cdc upserts stopped working after migrating from hudi 13.1 to 14.0 [hudi]

2024-03-19 Thread via GitHub


ad1happy2go commented on issue #10884:
URL: https://github.com/apache/hudi/issues/10884#issuecomment-2006696281

   @ROOBALJINDAL Is it possible to try the same on EMR so that you will get all 
the logs to look into this more. There is no known updates which can cause this 
for 0.14.0 upgrade.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi cdc upserts stopped working after migrating from hudi 13.1 to 14.0 [hudi]

2024-03-19 Thread via GitHub


ROOBALJINDAL commented on issue #10884:
URL: https://github.com/apache/hudi/issues/10884#issuecomment-2006449206

   @nsivabalan can you please check


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Hudi cdc upserts stopped working after migrating from hudi 13.1 to 14.0 [hudi]

2024-03-19 Thread via GitHub


ROOBALJINDAL opened a new issue, #10884:
URL: https://github.com/apache/hudi/issues/10884

   Issue: 
   We have migrated from Hudi 0.13.0 to Hudi 0.14.0 and in this version, CDC 
events from Kafka upserts are not working.
   Table is created first time but afterwards, any new record added/updated 
into the sql table which pushes cdc event to kafka is not get updated in the 
hudi table. Is there any new configuration for hudi 0.14.0?
   
   We are running Aws EMR serverless: 6.15. We tried to enable debug level logs 
by providing following classification to serverless app which modified log4j 
properties to print hudi package logs but this also doesnt print.
   ```
   {
 "classification": "spark-driver-log4j2",
 "properties": {
   "rootLogger.level": "debug",
   "logger.hudi.level": "debug",
   "logger.hudi.name": "org.apache.hudi"
 }
   },
   {
 "classification": "spark-executor-log4j2",
 "properties": {
   "rootLogger.level": "debug",
   "logger.hudi.level": "debug",
   "logger.hudi.name": "org.apache.hudi"
 }
   }
   ```
   Since it is serverless we can't ssh tunnel into node and see log4j property 
file and couldn't get hudi logs.
   
   ### **Configurations:**
   
   **### Spark job parameters:**
   ```
   --class org.apache.hudi.utilities.streamer.HoodieMultiTableStreamer
   --conf spark.sql.avro.datetimeRebaseModeInWrite=CORRECTED
   --conf spark.sql.avro.datetimeRebaseModeInRead=CORRECTED
   --conf spark.executor.instances=1
   --conf spark.executor.memory=4g
   --conf spark.driver.memory=4g
   --conf spark.driver.cores=4
   --conf spark.dynamicAllocation.initialExecutors=1
   --props kafka-source.properties
   --config-folder table-config
   --payload-class com.myorg.MssqlDebeziumAvroPayload
   --source-class com.myorg.MssqlDebeziumSource
   --source-ordering-field _event_lsn
   --enable-sync
   --table-type COPY_ON_WRITE
   --source-limit 10
   --op UPSERT
   ```
   
   **### kafka-source.properties:**
   ```
   hoodie.streamer.ingestion.tablesToBeIngested=database1.student
auto.offset.reset=earliest

hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator

hoodie.streamer.source.kafka.value.deserializer.class=io.confluent.kafka.serializers.KafkaAvroDeserializer
hoodie.streamer.schemaprovider.registry.url=
schema.registry.url=http://schema-registry-x:8080/apis/ccompat/v6
bootstrap.servers=b-1..ikwdtc.c13.us-west-2.amazonaws.com:9096

hoodie.streamer.schemaprovider.registry.baseUrl=http://schema-registry-x:8080/apis/ccompat/v6/subjects/
hoodie.parquet.max.file.size=2147483648
hoodie.parquet.small.file.limit=1073741824
security.protocol=SASL_SSL
sasl.mechanism=SCRAM-SHA-512
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule 
required username="" password="x";
ssl.truststore.location=/usr/lib/jvm/java/jre/lib/security/cacerts
ssl.truststore.password=changeit
   ```

   **### Table config properties:**
   ```
   hoodie.datasource.hive_sync.database=database1
hoodie.datasource.hive_sync.support_timestamp=true

hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled=true
   hoodie.datasource.write.recordkey.field=studentsid
hoodie.datasource.write.partitionpath.field=studentcreationdate
hoodie.datasource.hive_sync.table=student
hoodie.datasource.write.schema.allow.auto.evolution.column.drop=true
hoodie.datasource.hive_sync.partition_fields=studentcreationdate
hoodie.keygen.timebased.timestamp.type=SCALAR
hoodie.keygen.timebased.timestamp.scalar.time.unit=DAYS
hoodie.keygen.timebased.input.dateformat=-MM-dd
hoodie.keygen.timebased.output.dateformat=-MM-01
hoodie.keygen.timebased.timezone=GMT+8:00
hoodie.datasource.write.hive_style_partitioning=true
hoodie.datasource.hive_sync.mode=hms
hoodie.streamer.source.kafka.topic=dev.student
hoodie.streamer.schemaprovider.registry.urlSuffix=-value/versions/latest
   ```
   
   **Environment Description**
   * Hudi version : 0.14.0
   
   * Spark version : 3.4.1
   
   * Hive version : 3.1.3
   
   * Hadoop version :3.3.6
   
   * Storage (HDFS/S3/GCS..) : S3
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi offline compaction ignores old data [hudi]

2024-03-13 Thread via GitHub


danny0405 commented on issue #10863:
URL: https://github.com/apache/hudi/issues/10863#issuecomment-1996493691

   So you are using the offline compaction because the online async compaction 
is disabled: `compaction.async.enabled' = 'false',"`. Did you check the 
compaction plan to see whether the files included are expected?
   
   BTW, you specify a compaction for each commit, so why not use the COW table 
then?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi offline compaction ignores old data [hudi]

2024-03-13 Thread via GitHub


ennox108 commented on issue #10863:
URL: https://github.com/apache/hudi/issues/10863#issuecomment-1996335338

   its streaming ingestion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   3   >