[GitHub] [hudi] hudi-bot commented on pull request #6680: [HUDI-4812] lazy fetching partition path & file slice for HoodieFileIndex

2022-10-20 Thread GitBox


hudi-bot commented on PR #6680:
URL: https://github.com/apache/hudi/pull/6680#issuecomment-1286540938

   
   ## CI report:
   
   * a65056adaa4e9fabda4205c1b5c7be2e48bdd67f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12382)
 
   * 903ac79702d02fe968d271b08a217f8401d63700 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7005: Utilities and spark ci gh actions

2022-10-20 Thread GitBox


hudi-bot commented on PR #7005:
URL: https://github.com/apache/hudi/pull/7005#issuecomment-1286531691

   
   ## CI report:
   
   * 56032a44e18867e8ae21f47383ee1273d5c9c806 UNKNOWN
   * c3c4376ccec55470031fd3c516722547cd553ee5 UNKNOWN
   * 03ffae125238e63048a248b3cb916b1397c882b5 UNKNOWN
   * 3ad93ee2ab40eb85811b7cd82f592451eb48b432 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12401)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6989: [HUDI-5000] Support schema evolution for Hive/presto

2022-10-20 Thread GitBox


hudi-bot commented on PR #6989:
URL: https://github.com/apache/hudi/pull/6989#issuecomment-1286531579

   
   ## CI report:
   
   * 8ef5a3d07c77bf6f0268db24cbe051af0f6664ad Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12396)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6946: [HUDI-5027] Improve getHBaseConnection Use Constants Replace HardCode.

2022-10-20 Thread GitBox


hudi-bot commented on PR #6946:
URL: https://github.com/apache/hudi/pull/6946#issuecomment-1286531419

   
   ## CI report:
   
   * f564ccd600c10035f96e8196c339108e49e360be Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12371)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xiaoshao commented on issue #6969: [SUPPORT] does hudi do the same in MOR and COW table?

2022-10-20 Thread GitBox


xiaoshao commented on issue #6969:
URL: https://github.com/apache/hudi/issues/6969#issuecomment-1286521661

   > When using spark, the default behavior is to write new insert records into 
parquet files and write update records into delta-log. If you want to write new 
insert records into a log, should use Bucket Index(above 0.12) or Hbase index
   
   got it. I will try to create a hudi table with bucket index under hudi 0.12. 
thx for your help.
   
   
   You mean 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xiaoshao commented on issue #6969: [SUPPORT] does hudi do the same in MOR and COW table?

2022-10-20 Thread GitBox


xiaoshao commented on issue #6969:
URL: https://github.com/apache/hudi/issues/6969#issuecomment-1286520086

   @nsivabalan  but the new parquet files includes all the records. it is 
expected?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7010: [SUPPORT]org.apache.hudi.utilities.HoodieCleaner cleanup, but spark-submit don't quit

2022-10-20 Thread GitBox


nsivabalan commented on issue #7010:
URL: https://github.com/apache/hudi/issues/7010#issuecomment-1286488077

   sure, will look into it. 
   https://issues.apache.org/jira/browse/HUDI-5065
   thanks for reporting. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #7010: [SUPPORT]org.apache.hudi.utilities.HoodieCleaner cleanup, but spark-submit don't quit

2022-10-20 Thread GitBox


nsivabalan closed issue #7010: [SUPPORT]org.apache.hudi.utilities.HoodieCleaner 
cleanup, but spark-submit don't quit
URL: https://github.com/apache/hudi/issues/7010


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-5065) HoodieCleaner does not exit after completion

2022-10-20 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-5065:
-

Assignee: sivabalan narayanan

> HoodieCleaner does not exit after completion
> 
>
> Key: HUDI-5065
> URL: https://issues.apache.org/jira/browse/HUDI-5065
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.13.0
>
>
> use org.apache.hudi.utilities.hoodieCleaner realize asynchronous cleaning, 
> after the completion of the spark-submit don't quit, need CTRL + c to exit
>  
> /opt/spark-3.2.2-bin-3.0.0-cdh6.2.1/bin/spark-submit \ --class 
> org.apache.hudi.utilities.HoodieCleaner \ --jars 
> /opt/hudi-utilities-bundle_2.12-0.12.1.jar \ 
> /opt/hudi-utilities-bundle_2.12-0.12.1.jar \ --target-base-path 
> /db/dwm_http_scan \ --spark-master yarn \ --hoodie-conf 
> hoodie.cleaner.policy=KEEP_LATEST_BY_HOURS \ --hoodie-conf 
> hoodie.cleaner.hours.retained=1
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5065) HoodieCleaner does not exit after completion

2022-10-20 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-5065:
--
Fix Version/s: 0.13.0

> HoodieCleaner does not exit after completion
> 
>
> Key: HUDI-5065
> URL: https://issues.apache.org/jira/browse/HUDI-5065
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.13.0
>
>
> use org.apache.hudi.utilities.hoodieCleaner realize asynchronous cleaning, 
> after the completion of the spark-submit don't quit, need CTRL + c to exit
>  
> /opt/spark-3.2.2-bin-3.0.0-cdh6.2.1/bin/spark-submit \ --class 
> org.apache.hudi.utilities.HoodieCleaner \ --jars 
> /opt/hudi-utilities-bundle_2.12-0.12.1.jar \ 
> /opt/hudi-utilities-bundle_2.12-0.12.1.jar \ --target-base-path 
> /db/dwm_http_scan \ --spark-master yarn \ --hoodie-conf 
> hoodie.cleaner.policy=KEEP_LATEST_BY_HOURS \ --hoodie-conf 
> hoodie.cleaner.hours.retained=1
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5065) HoodieCleaner does not exit after completion

2022-10-20 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-5065:
-

 Summary: HoodieCleaner does not exit after completion
 Key: HUDI-5065
 URL: https://issues.apache.org/jira/browse/HUDI-5065
 Project: Apache Hudi
  Issue Type: Bug
  Components: cleaning
Reporter: sivabalan narayanan


use org.apache.hudi.utilities.hoodieCleaner realize asynchronous cleaning, 
after the completion of the spark-submit don't quit, need CTRL + c to exit

 
/opt/spark-3.2.2-bin-3.0.0-cdh6.2.1/bin/spark-submit \ --class 
org.apache.hudi.utilities.HoodieCleaner \ --jars 
/opt/hudi-utilities-bundle_2.12-0.12.1.jar \ 
/opt/hudi-utilities-bundle_2.12-0.12.1.jar \ --target-base-path 
/db/dwm_http_scan \ --spark-master yarn \ --hoodie-conf 
hoodie.cleaner.policy=KEEP_LATEST_BY_HOURS \ --hoodie-conf 
hoodie.cleaner.hours.retained=1
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan commented on issue #7015: [SUPPORT] - BaseSparkCommitActionExecutor: Error upserting bucketType UPDATE for partition :0 org.apache.hudi.exception.HoodieAppendException: Fail

2022-10-20 Thread GitBox


nsivabalan commented on issue #7015:
URL: https://github.com/apache/hudi/issues/7015#issuecomment-1286485759

   I don't have much expertise running hdfs in prod. @n3nash @suryaprasanna : 
Can you folks assist here. 
   looks like data node unavailable or connectivity issue. But is there any 
pointers to assist the user. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] waywtdcc commented on pull request #7009: [HUDI-5058]Fix flink catalog read spark table error : primary key col can not be nullable

2022-10-20 Thread GitBox


waywtdcc commented on PR #7009:
URL: https://github.com/apache/hudi/pull/7009#issuecomment-1286485659

   > 
[5058.patch.zip](https://github.com/apache/hudi/files/9835480/5058.patch.zip) 
Thanks for the contribution, i have reviewed and applied a patch.
   
   Thank you for your reviews. Do you mean to let me merge this patch into my 
own branch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] honeyaya commented on pull request #7007: [HUDI-4809] Glue support drop partitions

2022-10-20 Thread GitBox


honeyaya commented on PR #7007:
URL: https://github.com/apache/hudi/pull/7007#issuecomment-1286484302

   @nsivabalan hi, I tested the function with my AWS glue's partition table, 
any questions? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7013: [SUPPORT] java.lang.NoSuchMethodError: org.apache.spark.serializer.KryoSerializer.newKryo()Lorg/apache/hudi/com/esotericsoftware/kryo/Kryo;

2022-10-20 Thread GitBox


nsivabalan commented on issue #7013:
URL: https://github.com/apache/hudi/issues/7013#issuecomment-1286484131

   @xushiyan : you are fixing some kryo related bundling issue right. Is this 
related ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6964: [SUPPORT] Error in querying hudi table stored in orc format with hive,Index is not populated for 10 (state=,code=0)

2022-10-20 Thread GitBox


nsivabalan commented on issue #6964:
URL: https://github.com/apache/hudi/issues/6964#issuecomment-1286482464

   https://issues.apache.org/jira/browse/HUDI-4496
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6964: [SUPPORT] Error in querying hudi table stored in orc format with hive,Index is not populated for 10 (state=,code=0)

2022-10-20 Thread GitBox


nsivabalan commented on issue #6964:
URL: https://github.com/apache/hudi/issues/6964#issuecomment-1286482266

   I don't think our ORC works w/ spark3.X. we are looking to fix it, but as of 
now, it has some gaps. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6796: [HUDI-4741] hotfix to avoid partial failover cause restored subtask f…

2022-10-20 Thread GitBox


hudi-bot commented on PR #6796:
URL: https://github.com/apache/hudi/pull/6796#issuecomment-1286481122

   
   ## CI report:
   
   * 65fd7f9ce6e2c3448433c683c034694df3b12c85 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12395)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6966: [SUPPORT]HoodieWriteHandle: Error writing record HoodieRecord{key=HoodieKey { recordKey=id308723 partitionPath=202210141643}, currentLocation='null

2022-10-20 Thread GitBox


nsivabalan commented on issue #6966:
URL: https://github.com/apache/hudi/issues/6966#issuecomment-1286481012

   I will let @fengjian428 follow up. but curious to know why are you setting 
max delta commits config value to 0. You might as well switch to using COW. but 
w/ spark structured streaming using MOR table, its not recommended to have very 
aggressive compaction. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6967: [SUPPORT]prestodb read hudi incorretly after ddl by spark

2022-10-20 Thread GitBox


nsivabalan commented on issue #6967:
URL: https://github.com/apache/hudi/issues/6967#issuecomment-1286467305

   thanks @xiarixiaoyao .
   @duanyongvictory : if you don't have any more questions, feel free to close 
out the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6970: [SUPPORT] Performance of Snapshot Exporter

2022-10-20 Thread GitBox


nsivabalan commented on issue #6970:
URL: https://github.com/apache/hudi/issues/6970#issuecomment-1286466166

   can you try setting partitioner 
   `--output-partitioner`
   this should help improve performance. 
   
   also, are you setting ay value for `--output-partition-field` ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7001: [HUDI-5061] bulk insert operation don't throw other exception except IOE Exception

2022-10-20 Thread GitBox


hudi-bot commented on PR #7001:
URL: https://github.com/apache/hudi/pull/7001#issuecomment-1286443650

   
   ## CI report:
   
   * e1200da31c768545aa302ebad0d842d5e9635acb Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12337)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12339)
 
   * b82500dac4731258c9043922f7c3a0f89fecef8a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12412)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7001: [HUDI-5061] bulk insert operation don't throw other exception except IOE Exception

2022-10-20 Thread GitBox


hudi-bot commented on PR #7001:
URL: https://github.com/apache/hudi/pull/7001#issuecomment-1286440374

   
   ## CI report:
   
   * e1200da31c768545aa302ebad0d842d5e9635acb Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12337)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12339)
 
   * b82500dac4731258c9043922f7c3a0f89fecef8a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6991: [HUDI-5049] HoodieCatalog supports the implementation of dropPartition

2022-10-20 Thread GitBox


hudi-bot commented on PR #6991:
URL: https://github.com/apache/hudi/pull/6991#issuecomment-1286436827

   
   ## CI report:
   
   * fbb4c300133ad1c677826b836652152032b4c616 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12392)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6984: [SUPPORT] hudi metrics with flink so little

2022-10-20 Thread GitBox


nsivabalan commented on issue #6984:
URL: https://github.com/apache/hudi/issues/6984#issuecomment-1286433720

   we do expose lot of metrics. not sure, what exactly are you looking for. For 
instance, you see should below set of metrics. 
   for regular commits:
   ```
 Metrics.registerGauge(getMetricsName(actionType, 
"totalPartitionsWritten"), totalPartitionsWritten);
 Metrics.registerGauge(getMetricsName(actionType, "totalFilesInsert"), 
totalFilesInsert);
 Metrics.registerGauge(getMetricsName(actionType, "totalFilesUpdate"), 
totalFilesUpdate);
 Metrics.registerGauge(getMetricsName(actionType, 
"totalRecordsWritten"), totalRecordsWritten);
 Metrics.registerGauge(getMetricsName(actionType, 
"totalUpdateRecordsWritten"), totalUpdateRecordsWritten);
 Metrics.registerGauge(getMetricsName(actionType, 
"totalInsertRecordsWritten"), totalInsertRecordsWritten);
 Metrics.registerGauge(getMetricsName(actionType, "totalBytesWritten"), 
totalBytesWritten);
 Metrics.registerGauge(getMetricsName(actionType, "totalScanTime"), 
totalTimeTakenByScanner);
 Metrics.registerGauge(getMetricsName(actionType, "totalCreateTime"), 
totalTimeTakenForInsert);
 Metrics.registerGauge(getMetricsName(actionType, "totalUpsertTime"), 
totalTimeTakenForUpsert);
 Metrics.registerGauge(getMetricsName(actionType, 
"totalCompactedRecordsUpdated"), totalCompactedRecordsUpdated);
 Metrics.registerGauge(getMetricsName(actionType, 
"totalLogFilesCompacted"), totalLogFilesCompacted);
 Metrics.registerGauge(getMetricsName(actionType, "totalLogFilesSize"), 
totalLogFilesSize);
   ```
   commitTime, duration, commitLatencyInMs. 
   Just that these are not categorized based on operation type. 
   
   @bhasudha : we should add a new page listing out all metrics exposed by 
hudi. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on a diff in pull request #7005: Utilities and spark ci gh actions

2022-10-20 Thread GitBox


xushiyan commented on code in PR #7005:
URL: https://github.com/apache/hudi/pull/7005#discussion_r1001331925


##
packaging/bundle-validation/utilities/newSchema.avsc:
##
@@ -0,0 +1,54 @@
+{
+"type" : "record",
+"name" : "test_struct",

Review Comment:
   the schema is here docker/demo/config/schema.avsc



##
packaging/bundle-validation/ci_run.sh:
##
@@ -59,7 +54,34 @@ elif [[ ${SPARK_PROFILE} == 'spark3.3' ]]; then
   IMAGE_TAG=spark330hive313
 fi
 
-cd packaging/bundle-validation/spark-write-hive-sync || exit 1
+# Copy bundle jars
+BUNDLE_VALIDATION_DIR=${GITHUB_WORKSPACE}/bundle-validation
+mkdir $BUNDLE_VALIDATION_DIR
+JARS_DIR=${BUNDLE_VALIDATION_DIR}/jars
+mkdir $JARS_DIR
+cp 
${GITHUB_WORKSPACE}/packaging/hudi-spark-bundle/target/hudi-${SPARK_PROFILE}-bundle_${SCALA_PROFILE#'scala-'}-$HUDI_VERSION.jar
 $JARS_DIR/spark.jar
+cp 
${GITHUB_WORKSPACE}/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_${SCALA_PROFILE#'scala-'}-$HUDI_VERSION.jar
 $JARS_DIR/utilities.jar
+cp 
${GITHUB_WORKSPACE}/packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_${SCALA_PROFILE#'scala-'}-$HUDI_VERSION.jar
 $JARS_DIR/utilities-slim.jar

Review Comment:
   this pattern `packaging//target/hudi-*-$HUDI_VERSION.jar ` 
should guarantee to copy the intended jar, as long as we do `mvn clean package` 
which will clean jars from last build. so we can avoid some interpolation with 
env vars, kinda error-prone.



##
packaging/bundle-validation/validate.sh:
##
@@ -0,0 +1,118 @@
+#!/bin/bash
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# NOTE: this script runs inside hudi-ci-bundle-validation container
+# $WORKDIR/jars/ is supposed to be mounted to a host directory where bundle 
jars are placed
+# TODO: $JAR_COMBINATIONS should have different orders for different jars to 
detect class loading issues
+
+WORKDIR=/opt/bundle-validation
+HIVE_DATA=${WORKDIR}/data/hive
+JAR_DATA=${WORKDIR}/data/jars
+UTILITIES_DATA=${WORKDIR}/data/utilities
+
+test_spark_bundle () {
+echo "::warning::validate.sh setting up hive sync"
+# put config files in correct place
+cp $HIVE_DATA/spark-defaults.conf $SPARK_HOME/conf/
+cp $HIVE_DATA/hive-site.xml $HIVE_HOME/conf/
+ln -sf $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf/hive-site.xml
+cp $DERBY_HOME/lib/derbyclient.jar $SPARK_HOME/jars/
+
+$DERBY_HOME/bin/startNetworkServer -h 0.0.0.0 &
+$HIVE_HOME/bin/hiveserver2 &
+echo "::warning::validate.sh hive setup complete. Testing"
+$SPARK_HOME/bin/spark-shell --jars $JAR_DATA/spark.jar < 
$HIVE_DATA/validate.scala
+if [ "$?" -ne 0 ]; then
+echo "::error::validate.sh failed hive testing)"
+exit 1
+fi
+echo "::warning::validate.sh hive testing succesfull. Cleaning up hive 
sync"
+# remove config files
+rm -f $SPARK_HOME/jars/derbyclient.jar
+unlink $SPARK_HOME/conf/hive-site.xml
+rm -f $HIVE_HOME/conf/hive-site.xml
+rm -f $SPARK_HOME/conf/spark-defaults.conf
+}
+
+test_utilities_bundle () {
+OPT_JARS=""
+if [[ -n $ADDITIONAL_JARS ]]; then
+OPT_JARS="--jars $ADDITIONAL_JARS"
+fi
+echo "::warning::validate.sh running deltastreamer"
+$SPARK_HOME/bin/spark-submit --driver-memory 8g --executor-memory 8g \
+--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
+$OPT_JARS $MAIN_JAR \
+--props $UTILITIES_DATA/newProps.props \
+--schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
+--source-class org.apache.hudi.utilities.sources.JsonDFSSource \
+--source-ordering-field ts --table-type MERGE_ON_READ \
+--target-base-path file://${OUTPUT_DIR} \
+--target-table utilities_tbl  --op UPSERT
+echo "::warning::validate.sh done with deltastreamer"
+
+OUTPUT_SIZE=$(du -s ${OUTPUT_DIR} | awk '{print $1}')
+if [[ -z $OUTPUT_SIZE || "$OUTPUT_SIZE" -lt "550" ]]; then
+echo "::error::validate.sh deltastreamer output folder ($OUTPUT_SIZE) 
is smaller than expected (550) )" 
+exit 1
+fi
+
+echo "::warning::validate.sh validating deltastreamer in spark shell"
+SHELL_COMMAND="$SPARK_HOME/bin/spark-shell --jars $ADDITIONAL_JARS 

[GitHub] [hudi] nsivabalan commented on issue #7002: [SUPPORT] Lost data in structured streaming

2022-10-20 Thread GitBox


nsivabalan commented on issue #7002:
URL: https://github.com/apache/hudi/issues/7002#issuecomment-1286430669

   
https://hudi.apache.org/docs/configurations/#hoodiedatasourcewritestreamingignorefailedbatch
   you may need to set this config to false. 
   We have made the default value to false in 0.12.1. but until 0.12.0, default 
value is true. hence there are chances of data loss. 
   
   Let us know if you still see data loss. we can definitely dig deeper. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7004: [SUPPORT] Better packaging for hudi-cli with hudi-cli-bundle.jar

2022-10-20 Thread GitBox


nsivabalan commented on issue #7004:
URL: https://github.com/apache/hudi/issues/7004#issuecomment-1286429041

   thanks for the ask. Will see how we can prioritize this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #7004: [SUPPORT] Better packaging for hudi-cli with hudi-cli-bundle.jar

2022-10-20 Thread GitBox


nsivabalan closed issue #7004: [SUPPORT] Better packaging for hudi-cli with 
hudi-cli-bundle.jar
URL: https://github.com/apache/hudi/issues/7004


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7004: [SUPPORT] Better packaging for hudi-cli with hudi-cli-bundle.jar

2022-10-20 Thread GitBox


nsivabalan commented on issue #7004:
URL: https://github.com/apache/hudi/issues/7004#issuecomment-1286428721

   yes, we have it in our plans. 
   https://issues.apache.org/jira/browse/HUDI-4666
   CC @rahil-c 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] huberylee commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

2022-10-20 Thread GitBox


huberylee commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r1001340777


##
rfc/rfc-52/rfc-52.md:
##
@@ -0,0 +1,284 @@
+
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level 
Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+- [KV Mapping](#impl-index-layer-kv-mapping)
+- [Build Index](#impl-index-layer-build-index)
+- [Read Index](#impl-index-layer-read-index)
+- [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## Abstract
+In query processing, we need to scan many data blocks in HUDI table. However, 
most of them may not
+match the query predicate even after using column statistic info in the 
metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to 
save IO has become
+the key point to improving query performance.
+
+## Background
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level 
statistics info can be used
+to filter data, and the process of reading data can be described as follows(Process A):
+- Step1: Comparing the inclusion relation of row group data's middle position 
and task split info
+   to decided which row groups should be handled by current task. If the row 
group data's middle
+   position is contained by task split, the row group should be handled by 
this task
+- Step2: Using pushed down predicates and row group level column statistics 
info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, 
then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched 
rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## Insufficiency
+Although page level statistics can greatly save IO cost, there is still some 
irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of 
reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care 
about to
+speed up query performance.
+
+## Architecture
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for 
managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using 
RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer 
to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations 
for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage 
the underlying index
+
+![](architecture.jpg)
+
+
+## Differences between Secondary Index and HUDI Record 
Level Index
+Before discussing secondary index, let's take a look at Record Level Index. 
Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets),
 ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, 
but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey 
predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please 
refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## Implementation
+
+### SQL Layer
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including 
create/drop/alter i

[GitHub] [hudi] nsivabalan commented on issue #6900: [SUPPORT]Hudi Failed to read MARKERS file

2022-10-20 Thread GitBox


nsivabalan commented on issue #6900:
URL: https://github.com/apache/hudi/issues/6900#issuecomment-1286427252

   we might need to inspect the timeline to see whats happening. may be 
metadata table is corrupt. we might need to inspect that. 
   
   Can you run our validation tool against your table and let us know what you 
see. 
   
   
https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
   
   enable all these 
   `--validate-latest-file-slices`: validate latest file slices for all 
partitions.
* - `--validate-latest-base-files`: validate latest base files for all 
partitions.
* - `--validate-all-file-groups`: validate all file groups, and all file 
slices within file groups.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] huberylee commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

2022-10-20 Thread GitBox


huberylee commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r1001338212


##
rfc/rfc-52/rfc-52.md:
##
@@ -0,0 +1,284 @@
+
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level 
Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+- [KV Mapping](#impl-index-layer-kv-mapping)
+- [Build Index](#impl-index-layer-build-index)
+- [Read Index](#impl-index-layer-read-index)
+- [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## Abstract
+In query processing, we need to scan many data blocks in HUDI table. However, 
most of them may not
+match the query predicate even after using column statistic info in the 
metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to 
save IO has become
+the key point to improving query performance.
+
+## Background
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level 
statistics info can be used
+to filter data, and the process of reading data can be described as follows(Process A):
+- Step1: Comparing the inclusion relation of row group data's middle position 
and task split info
+   to decided which row groups should be handled by current task. If the row 
group data's middle
+   position is contained by task split, the row group should be handled by 
this task
+- Step2: Using pushed down predicates and row group level column statistics 
info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, 
then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched 
rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## Insufficiency
+Although page level statistics can greatly save IO cost, there is still some 
irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of 
reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care 
about to
+speed up query performance.
+
+## Architecture
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for 
managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using 
RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer 
to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations 
for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage 
the underlying index
+
+![](architecture.jpg)
+
+
+## Differences between Secondary Index and HUDI Record 
Level Index
+Before discussing secondary index, let's take a look at Record Level Index. 
Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets),
 ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, 
but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey 
predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please 
refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## Implementation
+
+### SQL Layer
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including 
create/drop/alter i

[GitHub] [hudi] huberylee commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

2022-10-20 Thread GitBox


huberylee commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r1001336571


##
rfc/rfc-52/rfc-52.md:
##
@@ -0,0 +1,284 @@
+
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level 
Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+- [KV Mapping](#impl-index-layer-kv-mapping)
+- [Build Index](#impl-index-layer-build-index)
+- [Read Index](#impl-index-layer-read-index)
+- [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## Abstract
+In query processing, we need to scan many data blocks in HUDI table. However, 
most of them may not
+match the query predicate even after using column statistic info in the 
metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to 
save IO has become
+the key point to improving query performance.
+
+## Background
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level 
statistics info can be used
+to filter data, and the process of reading data can be described as follows(Process A):
+- Step1: Comparing the inclusion relation of row group data's middle position 
and task split info
+   to decided which row groups should be handled by current task. If the row 
group data's middle
+   position is contained by task split, the row group should be handled by 
this task
+- Step2: Using pushed down predicates and row group level column statistics 
info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, 
then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched 
rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## Insufficiency
+Although page level statistics can greatly save IO cost, there is still some 
irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of 
reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care 
about to
+speed up query performance.
+
+## Architecture
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for 
managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using 
RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer 
to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations 
for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage 
the underlying index
+
+![](architecture.jpg)
+
+
+## Differences between Secondary Index and HUDI Record 
Level Index
+Before discussing secondary index, let's take a look at Record Level Index. 
Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets),
 ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, 
but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey 
predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please 
refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## Implementation
+
+### SQL Layer
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including 
create/drop/alter i

[GitHub] [hudi] liufangqi commented on a diff in pull request #7001: [HUDI-5061] bulk insert operation don't throw other exception except IOE Exception

2022-10-20 Thread GitBox


liufangqi commented on code in PR #7001:
URL: https://github.com/apache/hudi/pull/7001#discussion_r1001330427


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bucket/BucketBulkInsertWriterHelper.java:
##
@@ -70,7 +70,7 @@ public void write(RowData tuple) throws IOException {
   handle.write(recordKey, partitionPath, record);
 } catch (Throwable throwable) {
   LOG.error("Global error thrown while trying to write records in 
HoodieRowDataCreateHandle", throwable);
-  throw throwable;
+  throw new IOException(throwable);

Review Comment:
   > can you add the same msg as with L72 within new IOExeption as well.
   
   @nsivabalan THX for your review, that's a good catch. I updated my code in 
the new commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yhyyz commented on issue #6900: [SUPPORT]Hudi Failed to read MARKERS file

2022-10-20 Thread GitBox


yhyyz commented on issue #6900:
URL: https://github.com/apache/hudi/issues/6900#issuecomment-1286415788

   @nsivabalan  Thanks for you help. 
   1.  using structured streaming multiple stream query instead of 
`forEachBatch` and  with the following properties. Application running for 19 
hours without any errors.  But if set `hoodie.embed.timeline.server=true`, 
error occurred `UpsertPartitioner: Error trying to compute average 
bytes/record,... Caused by: java.io.FileNotFoundException: No such file or 
directory /.hoodie/commit`.
   ```
   hoodie.datasource.hive_sync.enable=false
   hoodie.upsert.shuffle.parallelism=20
   hoodie.insert.shuffle.parallelism=20
   hoodie.keep.min.commits=6
   hoodie.keep.max.commits=7
   hoodie.parquet.small.file.limit=52428800
   hoodie.index.type=GLOBAL_BLOOM
   
hoodie.datasource.write.payload.class=org.apache.hudi.common.model.DefaultHoodieRecordPayload
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
   hoodie.metadata.enable=true
   hoodie.cleaner.commits.retained=3
   hoodie.clean.async=false
   hoodie.clean.automatic=true
   hoodie.archive.async=false
   hoodie.datasource.compaction.async.enable=true
   hoodie.write.markers.type=DIRECT
   hoodie.embed.timeline.server=false
   hoodie.embed.timeline.server.async=false
   ```
   2. using `forEachBatch` with multiple thread,enable inline compaction 
instead of offline compaction and with following properties, error occurred 
`UpsertPartitioner: Error trying to compute average bytes/record,... Caused by: 
java.io.FileNotFoundException: No such file or directory 
/.hoodie/commit`, but application  still runs. I will set 
`hoodie.embed.timeline.server=false` to test again, any new information I will 
sync here.
   ```
   hoodie.datasource.hive_sync.enable=false
   hoodie.upsert.shuffle.parallelism=20
   hoodie.insert.shuffle.parallelism=20
   hoodie.keep.min.commits=6
   hoodie.keep.max.commits=7
   hoodie.parquet.small.file.limit=52428800
   hoodie.index.type=GLOBAL_BLOOM
   
hoodie.datasource.write.payload.class=org.apache.hudi.common.model.DefaultHoodieRecordPayload
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
   hoodie.metadata.enable=true
   hoodie.cleaner.commits.retained=3
   hoodie.clean.max.commits=5
   hoodie.clean.async=false
   hoodie.clean.automatic=true
   hoodie.archive.async=false
   hoodie.compact.inline=true
   hoodie.datasource.compaction.async.enable=false
   hoodie.write.markers.type=DIRECT
   hoodie.embed.timeline.server=true
   hoodie.embed.timeline.server.async=false
   hoodie.compact.schedule.inline=false
   hoodie.compact.inline.max.delta.commits=2
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #7009: [HUDI-5058]Fix read spark table error : primary key col can not be nullable

2022-10-20 Thread GitBox


danny0405 commented on PR #7009:
URL: https://github.com/apache/hudi/pull/7009#issuecomment-1286415464

   [5058.patch.zip](https://github.com/apache/hudi/files/9835480/5058.patch.zip)
   Thanks for the contribution, i have reviewed and applied a patch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] liufangqi commented on a diff in pull request #7001: [HUDI-5061] bulk insert operation don't throw other exception except IOE Exception

2022-10-20 Thread GitBox


liufangqi commented on code in PR #7001:
URL: https://github.com/apache/hudi/pull/7001#discussion_r1001330427


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bucket/BucketBulkInsertWriterHelper.java:
##
@@ -70,7 +70,7 @@ public void write(RowData tuple) throws IOException {
   handle.write(recordKey, partitionPath, record);
 } catch (Throwable throwable) {
   LOG.error("Global error thrown while trying to write records in 
HoodieRowDataCreateHandle", throwable);
-  throw throwable;
+  throw new IOException(throwable);

Review Comment:
   > can you add the same msg as with L72 within new IOExeption as well.
   
   @nsivabalan THX for your review, that's a good catch. I updated my code as 
advice on the new commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #7007: [HUDI-4809] Glue support drop partitions

2022-10-20 Thread GitBox


nsivabalan commented on PR #7007:
URL: https://github.com/apache/hudi/pull/7007#issuecomment-1286409725

   have you tested the patch? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] huberylee commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

2022-10-20 Thread GitBox


huberylee commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r1001325640


##
rfc/rfc-52/rfc-52.md:
##
@@ -0,0 +1,284 @@
+
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level 
Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+- [KV Mapping](#impl-index-layer-kv-mapping)
+- [Build Index](#impl-index-layer-build-index)
+- [Read Index](#impl-index-layer-read-index)
+- [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## Abstract
+In query processing, we need to scan many data blocks in HUDI table. However, 
most of them may not
+match the query predicate even after using column statistic info in the 
metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to 
save IO has become
+the key point to improving query performance.
+
+## Background
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level 
statistics info can be used
+to filter data, and the process of reading data can be described as follows(Process A):
+- Step1: Comparing the inclusion relation of row group data's middle position 
and task split info
+   to decided which row groups should be handled by current task. If the row 
group data's middle
+   position is contained by task split, the row group should be handled by 
this task
+- Step2: Using pushed down predicates and row group level column statistics 
info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, 
then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched 
rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## Insufficiency
+Although page level statistics can greatly save IO cost, there is still some 
irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of 
reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care 
about to
+speed up query performance.
+
+## Architecture
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for 
managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using 
RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer 
to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations 
for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage 
the underlying index
+
+![](architecture.jpg)
+
+
+## Differences between Secondary Index and HUDI Record 
Level Index
+Before discussing secondary index, let's take a look at Record Level Index. 
Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets),
 ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, 
but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey 
predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please 
refer to

Review Comment:
   Read or write path here means record level index and secondary index are 
used in different scenarios. Record level index mainly be used with 
``tagLocation`` when 

[GitHub] [hudi] nsivabalan commented on pull request #7005: Utilities and spark ci gh actions

2022-10-20 Thread GitBox


nsivabalan commented on PR #7005:
URL: https://github.com/apache/hudi/pull/7005#issuecomment-1286408416

   if its ready for review, please add the jira link. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #7001: [HUDI-5061] bulk insert operation don't throw other exception except IOE Exception

2022-10-20 Thread GitBox


nsivabalan commented on code in PR #7001:
URL: https://github.com/apache/hudi/pull/7001#discussion_r1001322701


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bucket/BucketBulkInsertWriterHelper.java:
##
@@ -70,7 +70,7 @@ public void write(RowData tuple) throws IOException {
   handle.write(recordKey, partitionPath, record);
 } catch (Throwable throwable) {
   LOG.error("Global error thrown while trying to write records in 
HoodieRowDataCreateHandle", throwable);
-  throw throwable;
+  throw new IOException(throwable);

Review Comment:
   can you add the same msg as with L72 within new IOExeption as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan closed pull request #6997: [HUDI-6000] testing out spark3.2 tests for utilities

2022-10-20 Thread GitBox


nsivabalan closed pull request #6997: [HUDI-6000] testing out spark3.2 tests 
for utilities
URL: https://github.com/apache/hudi/pull/6997


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7012: [minor]add commit_action output in show_commits

2022-10-20 Thread GitBox


hudi-bot commented on PR #7012:
URL: https://github.com/apache/hudi/pull/7012#issuecomment-1286401718

   
   ## CI report:
   
   * 8ea6016beff64e73570d434931813bb8cee911d1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12385)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12411)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] scxwhite commented on pull request #7012: [minor]add commit_action output in show_commits

2022-10-20 Thread GitBox


scxwhite commented on PR #7012:
URL: https://github.com/apache/hudi/pull/7012#issuecomment-1286399506

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7016: [MINOR] fix cdc flake ut

2022-10-20 Thread GitBox


hudi-bot commented on PR #7016:
URL: https://github.com/apache/hudi/pull/7016#issuecomment-1286398575

   
   ## CI report:
   
   * 2d3cc75bfcf73bf62ff20ee8eb960754b4e56931 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12410)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7014: [HUDI-4971] Remove direct use of kryo from `SerDeUtils`

2022-10-20 Thread GitBox


hudi-bot commented on PR #7014:
URL: https://github.com/apache/hudi/pull/7014#issuecomment-1286398547

   
   ## CI report:
   
   * 1a66c178d59a5fef0064589d4a44eba5b6eb7137 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12400)
 
   * c1f774b8ef781cc64b3ff96b82afeec27f4b5265 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12409)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7016: [MINOR] fix cdc flake ut

2022-10-20 Thread GitBox


hudi-bot commented on PR #7016:
URL: https://github.com/apache/hudi/pull/7016#issuecomment-1286395122

   
   ## CI report:
   
   * 2d3cc75bfcf73bf62ff20ee8eb960754b4e56931 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7014: [HUDI-4971] Remove direct use of kryo from `SerDeUtils`

2022-10-20 Thread GitBox


hudi-bot commented on PR #7014:
URL: https://github.com/apache/hudi/pull/7014#issuecomment-1286395101

   
   ## CI report:
   
   * 1316c4a23620dc7ae4df8530a345548d14d595e8 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12398)
 
   * 1a66c178d59a5fef0064589d4a44eba5b6eb7137 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12400)
 
   * c1f774b8ef781cc64b3ff96b82afeec27f4b5265 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6643: [HUDI-4823]Add read_optimize spark_session config to use in spark-sql

2022-10-20 Thread GitBox


hudi-bot commented on PR #6643:
URL: https://github.com/apache/hudi/pull/6643#issuecomment-1286394561

   
   ## CI report:
   
   * 180e976b0e06e21469fcf3e7f946e451c45744a8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12389)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] SteNicholas commented on pull request #6991: [HUDI-5049] HoodieCatalog supports the implementation of dropPartition

2022-10-20 Thread GitBox


SteNicholas commented on PR #6991:
URL: https://github.com/apache/hudi/pull/6991#issuecomment-1286383820

   @danny0405, I have already applied the above patch and fixes the 
`HoodieHiveCatalog`. PTAL.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5046) Support all the hive sync options for flink sql

2022-10-20 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-5046:
-
Fix Version/s: 0.13.0

> Support all the hive sync options for flink sql
> ---
>
> Key: HUDI-5046
> URL: https://issues.apache.org/jira/browse/HUDI-5046
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-5046) Support all the hive sync options for flink sql

2022-10-20 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-5046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621446#comment-17621446
 ] 

Danny Chen commented on HUDI-5046:
--

Fixed via master branch: 3452876f855c84e3adba34f125e1f58677c66ce8

> Support all the hive sync options for flink sql
> ---
>
> Key: HUDI-5046
> URL: https://issues.apache.org/jira/browse/HUDI-5046
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] danny0405 commented on pull request #6976: [HUDI-5042]fix clustering schedule problem in flink

2022-10-20 Thread GitBox


danny0405 commented on PR #6976:
URL: https://github.com/apache/hudi/pull/6976#issuecomment-1286371651

   There are many test failures, can you please check it again ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (HUDI-5046) Support all the hive sync options for flink sql

2022-10-20 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen resolved HUDI-5046.
--

> Support all the hive sync options for flink sql
> ---
>
> Key: HUDI-5046
> URL: https://issues.apache.org/jira/browse/HUDI-5046
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated (10f000285a -> 3452876f85)

2022-10-20 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 10f000285a [HUDI-4960] Upgrade jetty version for timeline server 
(#6844)
 add 3452876f85 [HUDI-5046] Support all the hive sync options for flink sql 
(#6985)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/configuration/FlinkOptions.java |  3 ++-
 .../org/apache/hudi/sink/utils/HiveSyncContext.java |  3 ++-
 .../apache/hudi/sink/utils/TestHiveSyncContext.java | 21 ++---
 3 files changed, 18 insertions(+), 9 deletions(-)



[GitHub] [hudi] danny0405 merged pull request #6985: [HUDI-5046] Support all the hive sync options for flink sql

2022-10-20 Thread GitBox


danny0405 merged PR #6985:
URL: https://github.com/apache/hudi/pull/6985


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #6985: [HUDI-5046] Support all the hive sync options for flink sql

2022-10-20 Thread GitBox


danny0405 commented on PR #6985:
URL: https://github.com/apache/hudi/pull/6985#issuecomment-1286370909

   
   
![image](https://user-images.githubusercontent.com/7644508/197095884-879673f1-7f53-4c4b-9b3d-334a158dcdba.png)
   
   The modified test is passed and the failed tests are not related, would 
merge it soon ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] huberylee commented on pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

2022-10-20 Thread GitBox


huberylee commented on PR #5370:
URL: https://github.com/apache/hudi/pull/5370#issuecomment-1286370430

   > Great start. Overall I also think we need to think about the abstraction 
API more carefully here.
   
   The current implementation provides an abstract framework upon which we can 
easily extend other types of secondary indexes, and this document is a little 
out of date, I will update it later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] huberylee commented on pull request #6712: [HUDI-4770][Stacked on 4294] Adapt spark query engine to use secondary index when querying data

2022-10-20 Thread GitBox


huberylee commented on PR #6712:
URL: https://github.com/apache/hudi/pull/6712#issuecomment-1286366948

   > hey @huberylee : Is there an RFC around this.
   
   Yes, this PR is part of the secondary index implementation. RFC is in 
another PR: https://github.com/apache/hudi/pull/5370


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] huberylee commented on pull request #6677: [HUDI-4294][Stacked on 4293] Introduce build action to actually perform index data generation

2022-10-20 Thread GitBox


huberylee commented on PR #6677:
URL: https://github.com/apache/hudi/pull/6677#issuecomment-1286366767

   > @huberylee : seems interesting idea. but do we have an RFC on this already 
?
   
   Yes, this PR is part of the secondary index implementation. RFC is in 
another PR: https://github.com/apache/hudi/pull/5370


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] YannByron opened a new pull request, #7016: [MINOR] fix cdc flake ut

2022-10-20 Thread GitBox


YannByron opened a new pull request, #7016:
URL: https://github.com/apache/hudi/pull/7016

   ### Change Logs
   
   fix cdc flake ut
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6986: [HUDI-5047] Add partition value in HoodieLogRecordReader when hoodie.datasource.write.drop.partition.columns=true

2022-10-20 Thread GitBox


hudi-bot commented on PR #6986:
URL: https://github.com/apache/hudi/pull/6986#issuecomment-1286357763

   
   ## CI report:
   
   * 088cfee9a091af9f0327d4b432fcb3aa9a6e22ca Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12383)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12407)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6705: [HUDI-4868] Fixed the issue that compaction is invalid when the last commit action is replace commit.

2022-10-20 Thread GitBox


hudi-bot commented on PR #6705:
URL: https://github.com/apache/hudi/pull/6705#issuecomment-1286357440

   
   ## CI report:
   
   * 666b6690b531d1540b196f0f97e175c7da38166d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12379)
 
   * 212fbe84247891d11bb0e1a12fe454b34a270c75 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12406)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xicm commented on pull request #6986: [HUDI-5047] Add partition value in HoodieLogRecordReader when hoodie.datasource.write.drop.partition.columns=true

2022-10-20 Thread GitBox


xicm commented on PR #6986:
URL: https://github.com/apache/hudi/pull/6986#issuecomment-1286354627

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6705: [HUDI-4868] Fixed the issue that compaction is invalid when the last commit action is replace commit.

2022-10-20 Thread GitBox


hudi-bot commented on PR #6705:
URL: https://github.com/apache/hudi/pull/6705#issuecomment-1286354207

   
   ## CI report:
   
   * 666b6690b531d1540b196f0f97e175c7da38166d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12379)
 
   * 212fbe84247891d11bb0e1a12fe454b34a270c75 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6632: [HUDI-4753] more accurate record size estimation for log writing and spillable map

2022-10-20 Thread GitBox


hudi-bot commented on PR #6632:
URL: https://github.com/apache/hudi/pull/6632#issuecomment-1286354124

   
   ## CI report:
   
   * d9e12ddf962b670b8ec1e2260d5389c688e16001 UNKNOWN
   * ba3513d5b65e39f7cbb71e851ddd34cfe9d846a0 UNKNOWN
   * d566707b1aeb573b26e109e473851bcc4918daef Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12372)
 
   * 4365fc6b980d7d54643e1fcc7b7206342aa8b8e0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12404)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6448: [HUDI-4647] Keep the hive sync settings in spark sql consistent

2022-10-20 Thread GitBox


hudi-bot commented on PR #6448:
URL: https://github.com/apache/hudi/pull/6448#issuecomment-1286354007

   
   ## CI report:
   
   * 49cacedc9fab4cf37c1b93416b95bf9dd503a60c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12387)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6227: [HUDI-4496] Fixing Orc support broken for Spark 3.x and more

2022-10-20 Thread GitBox


hudi-bot commented on PR #6227:
URL: https://github.com/apache/hudi/pull/6227#issuecomment-1286353847

   
   ## CI report:
   
   * 4a6c6a488d8ff2307833653d5f45df7ff0247ad3 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12402)
 
   * 454cd1dc4e04aadfb8836553a45e201dc9abb8dc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12405)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #6989: [HUDI-5000] Support schema evolution for Hive/presto

2022-10-20 Thread GitBox


xiarixiaoyao commented on code in PR #6989:
URL: https://github.com/apache/hudi/pull/6989#discussion_r1001276202


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java:
##
@@ -76,6 +76,9 @@ public RecordReader 
getRecordReader(final InputSpli
   return createBootstrappingRecordReader(split, job, reporter);
 }
 
+// adapt schema evolution
+new SchemaEvolutionContext(split, job).doEvolutionForParquetFormat();

Review Comment:
   Do we need to fallback, if schemEvolution failed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6632: [HUDI-4753] more accurate record size estimation for log writing and spillable map

2022-10-20 Thread GitBox


hudi-bot commented on PR #6632:
URL: https://github.com/apache/hudi/pull/6632#issuecomment-1286350941

   
   ## CI report:
   
   * d9e12ddf962b670b8ec1e2260d5389c688e16001 UNKNOWN
   * ba3513d5b65e39f7cbb71e851ddd34cfe9d846a0 UNKNOWN
   * d566707b1aeb573b26e109e473851bcc4918daef Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12372)
 
   * 4365fc6b980d7d54643e1fcc7b7206342aa8b8e0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6227: [HUDI-4496] Fixing Orc support broken for Spark 3.x and more

2022-10-20 Thread GitBox


hudi-bot commented on PR #6227:
URL: https://github.com/apache/hudi/pull/6227#issuecomment-1286350693

   
   ## CI report:
   
   * cb79510068baa4769fe6496867c1e26754502f40 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10588)
 
   * 4a6c6a488d8ff2307833653d5f45df7ff0247ad3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12402)
 
   * 454cd1dc4e04aadfb8836553a45e201dc9abb8dc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6976: [HUDI-5042]fix clustering schedule problem in flink

2022-10-20 Thread GitBox


hudi-bot commented on PR #6976:
URL: https://github.com/apache/hudi/pull/6976#issuecomment-1286347957

   
   ## CI report:
   
   * 89e792414d09daaa8a367ecc4011450cc21e7069 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12386)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated (5155a1716c -> 10f000285a)

2022-10-20 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 5155a1716c [MINOR] Update GitHub setting for branch protection (#7008)
 add 10f000285a [HUDI-4960] Upgrade jetty version for timeline server 
(#6844)

No new revisions were added by this update.

Summary of changes:
 hudi-timeline-service/pom.xml  |   2 +-
 .../hudi/timeline/service/RequestHandler.java  | 249 ++---
 .../hudi/timeline/service/TimelineService.java |  12 +-
 .../timeline/service/handlers/MarkerHandler.java   |   2 +-
 .../handlers/marker/MarkerCreationFuture.java  |   2 +-
 pom.xml|   2 +-
 6 files changed, 130 insertions(+), 139 deletions(-)



[GitHub] [hudi] yihua merged pull request #6844: [HUDI-4960] Upgrade jetty version for timeline server

2022-10-20 Thread GitBox


yihua merged PR #6844:
URL: https://github.com/apache/hudi/pull/6844


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4496) ORC fails w/ Spark 3.1

2022-10-20 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4496:
--
Description: 
After running TestHoodieSparkSqlWriter test for different Spark versions, 
discovered that Orc version was incorrectly put as compile time dep on the 
classpath, breaking Orc writing in Hudi in Spark 3.1:

[https://github.com/apache/hudi/runs/7567326789?check_suite_focus=true]

 

*--- UPDATE ---*

Unfortunately, it turned out that b/w Spark 2.4 and Spark 3.0, Spark ratcheted 
the Orc dependency from "nohive" classifier (which was dependent on its own 
cloned versions of the interfaces) onto a standard one (which depends on Hive's 
interfaces), and that makes compatibility w/ Orc for both Spark 2 and Spark >= 
3.x very complicated. 

After extensive deliberations and gauging the interest for Orc support in Spark 
2.4.x branch of Hudi we took hard decision to drop Orc support in Hudi's 0.13 
release (for Spark 2.x) and instead fix it to be working in Spark 3.x module.

  was:
After running TestHoodieSparkSqlWriter test for different Spark versions, 
discovered that Orc version was incorrectly put as compile time dep on the 
classpath, breaking Orc writing in Hudi in Spark 3.1:

https://github.com/apache/hudi/runs/7567326789?check_suite_focus=true


> ORC fails w/ Spark 3.1
> --
>
> Key: HUDI-4496
> URL: https://issues.apache.org/jira/browse/HUDI-4496
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> After running TestHoodieSparkSqlWriter test for different Spark versions, 
> discovered that Orc version was incorrectly put as compile time dep on the 
> classpath, breaking Orc writing in Hudi in Spark 3.1:
> [https://github.com/apache/hudi/runs/7567326789?check_suite_focus=true]
>  
> *--- UPDATE ---*
> Unfortunately, it turned out that b/w Spark 2.4 and Spark 3.0, Spark 
> ratcheted the Orc dependency from "nohive" classifier (which was dependent on 
> its own cloned versions of the interfaces) onto a standard one (which depends 
> on Hive's interfaces), and that makes compatibility w/ Orc for both Spark 2 
> and Spark >= 3.x very complicated. 
> After extensive deliberations and gauging the interest for Orc support in 
> Spark 2.4.x branch of Hudi we took hard decision to drop Orc support in 
> Hudi's 0.13 release (for Spark 2.x) and instead fix it to be working in Spark 
> 3.x module.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] (HUDI-4496) ORC fails w/ Spark 3.1

2022-10-20 Thread Alexey Kudinkin (Jira)


[ https://issues.apache.org/jira/browse/HUDI-4496 ]


Alexey Kudinkin deleted comment on HUDI-4496:
---

was (Author: alexey.kudinkin):
[https://github.com/apache/hudi/pull/6227]

> ORC fails w/ Spark 3.1
> --
>
> Key: HUDI-4496
> URL: https://issues.apache.org/jira/browse/HUDI-4496
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> After running TestHoodieSparkSqlWriter test for different Spark versions, 
> discovered that Orc version was incorrectly put as compile time dep on the 
> classpath, breaking Orc writing in Hudi in Spark 3.1:
> https://github.com/apache/hudi/runs/7567326789?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #7012: [minor]add commit_action output in show_commits

2022-10-20 Thread GitBox


hudi-bot commented on PR #7012:
URL: https://github.com/apache/hudi/pull/7012#issuecomment-1286304713

   
   ## CI report:
   
   * 8ea6016beff64e73570d434931813bb8cee911d1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12385)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7011: source operator(monitor and reader) support user uid

2022-10-20 Thread GitBox


hudi-bot commented on PR #7011:
URL: https://github.com/apache/hudi/pull/7011#issuecomment-1286304680

   
   ## CI report:
   
   * 4e780b9f2024918093bedfe9aaa9dd9687e5dd0e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12384)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 closed pull request #6507: [DO NOT MERGE] 0.12.0 release patch branch

2022-10-20 Thread GitBox


danny0405 closed pull request #6507: [DO NOT MERGE] 0.12.0 release patch branch 
URL: https://github.com/apache/hudi/pull/6507


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6227: [HUDI-4496] Fixing Orc support broken for Spark 3.x and more

2022-10-20 Thread GitBox


hudi-bot commented on PR #6227:
URL: https://github.com/apache/hudi/pull/6227#issuecomment-1286264362

   
   ## CI report:
   
   * cb79510068baa4769fe6496867c1e26754502f40 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10588)
 
   * 4a6c6a488d8ff2307833653d5f45df7ff0247ad3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12402)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6986: [HUDI-5047] Add partition value in HoodieLogRecordReader when hoodie.datasource.write.drop.partition.columns=true

2022-10-20 Thread GitBox


hudi-bot commented on PR #6986:
URL: https://github.com/apache/hudi/pull/6986#issuecomment-1286258240

   
   ## CI report:
   
   * 088cfee9a091af9f0327d4b432fcb3aa9a6e22ca Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12383)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] umehrot2 commented on issue #6900: [SUPPORT]Hudi Failed to read MARKERS file

2022-10-20 Thread GitBox


umehrot2 commented on issue #6900:
URL: https://github.com/apache/hudi/issues/6900#issuecomment-1286257672

   Latest stack trace, when using structured streaming with the following 
properties:
   ```
   hoodie.datasource.hive_sync.enable=false
   hoodie.upsert.shuffle.parallelism=20
   hoodie.insert.shuffle.parallelism=20
   hoodie.keep.min.commits=6
   hoodie.keep.max.commits=7
   hoodie.parquet.small.file.limit=52428800
   hoodie.index.type=GLOBAL_BLOOM
   
hoodie.datasource.write.payload.class=org.apache.hudi.common.model.DefaultHoodieRecordPayload
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
   hoodie.metadata.enable=true
   hoodie.cleaner.commits.retained=3
   hoodie.clean.max.commits=5
   hoodie.clean.async=false
   hoodie.clean.automatic=true
   hoodie.archive.async=false
   hoodie.datasource.compaction.async.enable=true
   hoodie.write.markers.type=DIRECT
   hoodie.embed.timeline.server=true
   hoodie.embed.timeline.server.async=false
   hoodie.compact.schedule.inline=false
   hoodie.compact.inline.max.delta.commits=2
   ```
   
   Stacktrace:
   ```
   22/10/19 15:36:18 ERROR UpsertPartitioner: Error trying to compute average 
bytes/record 
   org.apache.hudi.exception.HoodieIOException: Could not read commit details 
from 
s3://app-util/hudi-bloom/multi-stream-105/cdc_test_db/dhdata_15/.hoodie/20221019152438682.commit
   at 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.readDataFromPath(HoodieActiveTimeline.java:761)
   at 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantDetails(HoodieActiveTimeline.java:266)
   at 
org.apache.hudi.common.table.timeline.HoodieDefaultTimeline.getInstantDetails(HoodieDefaultTimeline.java:372)
   at 
org.apache.hudi.table.action.commit.UpsertPartitioner.averageBytesPerRecord(UpsertPartitioner.java:373)
   at 
org.apache.hudi.table.action.commit.UpsertPartitioner.assignInserts(UpsertPartitioner.java:162)
   at 
org.apache.hudi.table.action.commit.UpsertPartitioner.(UpsertPartitioner.java:95)
   at 
org.apache.hudi.table.action.deltacommit.SparkUpsertDeltaCommitPartitioner.(SparkUpsertDeltaCommitPartitioner.java:50)
   at 
org.apache.hudi.table.action.deltacommit.BaseSparkDeltaCommitActionExecutor.getUpsertPartitioner(BaseSparkDeltaCommitActionExecutor.java:69)
   at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.getPartitioner(BaseSparkCommitActionExecutor.java:217)
   at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:163)
   at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:85)
   at 
org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:57)
   at 
org.apache.hudi.table.action.deltacommit.SparkUpsertDeltaCommitActionExecutor.execute(SparkUpsertDeltaCommitActionExecutor.java:46)
   at 
org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsert(HoodieSparkMergeOnReadTable.java:89)
   at 
org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsert(HoodieSparkMergeOnReadTable.java:76)
   at 
org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:157)
   at 
org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:213)
   at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:304)
   at 
org.apache.hudi.HoodieStreamingSink.$anonfun$addBatch$2(HoodieStreamingSink.scala:91)
   at scala.util.Try$.apply(Try.scala:213)
   at 
org.apache.hudi.HoodieStreamingSink.$anonfun$addBatch$1(HoodieStreamingSink.scala:90)
   at 
org.apache.hudi.HoodieStreamingSink.retry(HoodieStreamingSink.scala:166)
   at 
org.apache.hudi.HoodieStreamingSink.addBatch(HoodieStreamingSink.scala:89)
   at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$17(MicroBatchExecution.scala:600)
   at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
   at 
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
   at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
   at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
   

[jira] [Created] (HUDI-5064) Improve docs around concurrency control and deployment models

2022-10-20 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-5064:
---

 Summary: Improve docs around concurrency control and deployment 
models
 Key: HUDI-5064
 URL: https://issues.apache.org/jira/browse/HUDI-5064
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #7005: Utilities and spark ci gh actions

2022-10-20 Thread GitBox


hudi-bot commented on PR #7005:
URL: https://github.com/apache/hudi/pull/7005#issuecomment-1286206453

   
   ## CI report:
   
   * 56032a44e18867e8ae21f47383ee1273d5c9c806 UNKNOWN
   * 5cebd12dbcd4f87c35fb39c504141f0b1147a8b2 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12399)
 
   * c3c4376ccec55470031fd3c516722547cd553ee5 UNKNOWN
   * 03ffae125238e63048a248b3cb916b1397c882b5 UNKNOWN
   * 3ad93ee2ab40eb85811b7cd82f592451eb48b432 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12401)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

2022-10-20 Thread GitBox


yihua commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r1001154957


##
rfc/rfc-52/rfc-52.md:
##
@@ -0,0 +1,284 @@
+
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level 
Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+- [KV Mapping](#impl-index-layer-kv-mapping)
+- [Build Index](#impl-index-layer-build-index)
+- [Read Index](#impl-index-layer-read-index)
+- [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## Abstract
+In query processing, we need to scan many data blocks in HUDI table. However, 
most of them may not
+match the query predicate even after using column statistic info in the 
metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to 
save IO has become
+the key point to improving query performance.
+
+## Background
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level 
statistics info can be used
+to filter data, and the process of reading data can be described as follows(Process A):
+- Step1: Comparing the inclusion relation of row group data's middle position 
and task split info
+   to decided which row groups should be handled by current task. If the row 
group data's middle
+   position is contained by task split, the row group should be handled by 
this task
+- Step2: Using pushed down predicates and row group level column statistics 
info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, 
then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched 
rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## Insufficiency
+Although page level statistics can greatly save IO cost, there is still some 
irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of 
reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care 
about to
+speed up query performance.
+
+## Architecture
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for 
managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using 
RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer 
to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations 
for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage 
the underlying index
+
+![](architecture.jpg)
+
+
+## Differences between Secondary Index and HUDI Record 
Level Index
+Before discussing secondary index, let's take a look at Record Level Index. 
Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets),
 ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, 
but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey 
predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please 
refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## Implementation
+
+### SQL Layer
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including 
create/drop/alter index

[GitHub] [hudi] hudi-bot commented on pull request #7005: Utilities and spark ci gh actions

2022-10-20 Thread GitBox


hudi-bot commented on PR #7005:
URL: https://github.com/apache/hudi/pull/7005#issuecomment-1286201505

   
   ## CI report:
   
   * 56032a44e18867e8ae21f47383ee1273d5c9c806 UNKNOWN
   * 5cebd12dbcd4f87c35fb39c504141f0b1147a8b2 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12399)
 
   * c3c4376ccec55470031fd3c516722547cd553ee5 UNKNOWN
   * 03ffae125238e63048a248b3cb916b1397c882b5 UNKNOWN
   * 3ad93ee2ab40eb85811b7cd82f592451eb48b432 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6871: Bump protobuf-java from 3.21.5 to 3.21.7

2022-10-20 Thread GitBox


hudi-bot commented on PR #6871:
URL: https://github.com/apache/hudi/pull/6871#issuecomment-1286196531

   
   ## CI report:
   
   * 2121007b7a400a1f9b0a0bef170667d5b68f539e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12380)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6680: [HUDI-4812] lazy fetching partition path & file slice for HoodieFileIndex

2022-10-20 Thread GitBox


hudi-bot commented on PR #6680:
URL: https://github.com/apache/hudi/pull/6680#issuecomment-1286196069

   
   ## CI report:
   
   * a65056adaa4e9fabda4205c1b5c7be2e48bdd67f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12382)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] pavimotorq opened a new issue, #7015: [SUPPORT] - BaseSparkCommitActionExecutor: Error upserting bucketType UPDATE for partition :0 org.apache.hudi.exception.HoodieAppendException: Fai

2022-10-20 Thread GitBox


pavimotorq opened a new issue, #7015:
URL: https://github.com/apache/hudi/issues/7015

   
   **Describe the problem you faced**
   
   I'm trying to read a file containing around 200K records in json format, 
partition it based on a field "purpose" and store in the local HDFS cluster. 
But it's always failing with the following error but it works if I specify it 
to be a non-HDFS location.
   
   
   
   Error logs:
   
   00:48  WARN: Timeline-server-based markers are not supported for HDFS: base 
path hdfs://localhost:9000/user/hive/warehouse/local_cow.  Falling back to 
direct markers.
   22/10/20 21:22:55 WARN DataStreamer: DataStreamer Exception
   java.io.IOException: Failed to replace a bad datanode on the existing 
pipeline due to no more good datanodes being available to try. (Nodes: 
current=[DatanodeInfoWithStorage[127.0.0.1:9866,DS-9d9e9d11-c1d2-4ed0-bdf3-05709740ab9d,DISK]],
 
original=[DatanodeInfoWithStorage[127.0.0.1:9866,DS-9d9e9d11-c1d2-4ed0-bdf3-05709740ab9d,DISK]]).
 The current failed datanode replacement policy is DEFAULT, and a client may 
configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' 
in its configuration.
at 
org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1352)
at 
org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1420)
at 
org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1646)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1547)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1529)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:717)
   
   
   **To Reproduce**
   
   
   import findspark
   findspark.init('/home/pavithran/DFSHudi/spark')
   
   from pyspark.sql import SparkSession
   
   spark = SparkSession.builder \
   .appName("DFSHudi") \
   .config('spark.jars.packages', 
'org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.0,org.apache.hadoop:hadoop-azure:3.3.4')
 \
   .config('spark.serializer', 
'org.apache.spark.serializer.KryoSerializer') \
   .config('spark.sql.catalog.spark_catalog', 
'org.apache.spark.sql.hudi.catalog.HoodieCatalog') \
   .config('spark.sql.extensions', 
'org.apache.spark.sql.hudi.HoodieSparkSessionExtension') \
   .getOrCreate()
   
   filtered_df = spark.read.option("multiLine", 
"true").json("hdfs://localhost:9000/data/dataset_batch1.json")
   print((filtered_df.count(), len(filtered_df.columns)))
   
   # Execute this before resuming session.
   basePath = "hdfs://localhost:9000/user/hive/warehouse/local_cow"
   tableName = "hudi_dfs_data"
   
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.12
   
   * Spark version : 3.3.0
   
   * Hive version :
   
   * Hadoop version : 3.3.4
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```00:48  WARN: Timeline-server-based markers are not supported for HDFS: 
base path hdfs://localhost:9000/user/hive/warehouse/local_cow.  Falling back to 
direct markers.
   22/10/20 21:22:55 WARN DataStreamer: DataStreamer Exception
   java.io.IOException: Failed to replace a bad datanode on the existing 
pipeline due to no more good datanodes being available to try. (Nodes: 
current=[DatanodeInfoWithStorage[127.0.0.1:9866,DS-9d9e9d11-c1d2-4ed0-bdf3-05709740ab9d,DISK]],
 
original=[DatanodeInfoWithStorage[127.0.0.1:9866,DS-9d9e9d11-c1d2-4ed0-bdf3-05709740ab9d,DISK]]).
 The current failed datanode replacement policy is DEFAULT, and a client may 
configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' 
in its configuration.
at 
org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1352)
at 
org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1420)
at 
org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1646)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1547)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1529)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:717)
   22/10/20 21:22:55 WARN DFSClient: Error while syncing
   java.io.IOException: Failed to replace a bad datanode on the existing 
pipeline due to no more good datanodes being available to try. (Nodes: 
current=[DatanodeInfoWithStorage[127.0.0.1:9866,DS-9d9e9d11-c1d2-4ed0-bdf3-05709740ab9d,DISK]],
 
original=[DatanodeInfoWithStorage[127.0.0.1:9866,DS-9d9e9d11-c1d2-4ed0-bdf3-05709740ab9d,DISK]]).
 The current failed datanode replacement policy is DEFAULT, and a cl

[GitHub] [hudi] hudi-bot commented on pull request #7005: Utilities and spark ci gh actions

2022-10-20 Thread GitBox


hudi-bot commented on PR #7005:
URL: https://github.com/apache/hudi/pull/7005#issuecomment-1286137583

   
   ## CI report:
   
   * 56032a44e18867e8ae21f47383ee1273d5c9c806 UNKNOWN
   * 5cebd12dbcd4f87c35fb39c504141f0b1147a8b2 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12399)
 
   * c3c4376ccec55470031fd3c516722547cd553ee5 UNKNOWN
   * 03ffae125238e63048a248b3cb916b1397c882b5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7005: Utilities and spark ci gh actions

2022-10-20 Thread GitBox


hudi-bot commented on PR #7005:
URL: https://github.com/apache/hudi/pull/7005#issuecomment-1286131599

   
   ## CI report:
   
   * 7c19a482abfae743281ebbbd51842a034159295c Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12393)
 
   * 56032a44e18867e8ae21f47383ee1273d5c9c806 UNKNOWN
   * 5cebd12dbcd4f87c35fb39c504141f0b1147a8b2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12399)
 
   * c3c4376ccec55470031fd3c516722547cd553ee5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7014: [HUDI-4971] Remove direct use of kryo from `SerDeUtils`

2022-10-20 Thread GitBox


hudi-bot commented on PR #7014:
URL: https://github.com/apache/hudi/pull/7014#issuecomment-1286124903

   
   ## CI report:
   
   * 1316c4a23620dc7ae4df8530a345548d14d595e8 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12398)
 
   * 1a66c178d59a5fef0064589d4a44eba5b6eb7137 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12400)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7005: Utilities and spark ci gh actions

2022-10-20 Thread GitBox


hudi-bot commented on PR #7005:
URL: https://github.com/apache/hudi/pull/7005#issuecomment-1286124731

   
   ## CI report:
   
   * 7c19a482abfae743281ebbbd51842a034159295c Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12393)
 
   * 56032a44e18867e8ae21f47383ee1273d5c9c806 UNKNOWN
   * 5cebd12dbcd4f87c35fb39c504141f0b1147a8b2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7014: [HUDI-4971] Remove direct use of kryo from `SerDeUtils`

2022-10-20 Thread GitBox


hudi-bot commented on PR #7014:
URL: https://github.com/apache/hudi/pull/7014#issuecomment-1286117204

   
   ## CI report:
   
   * 96e2955878b9ae7780aff34645c98bf09974071a Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12394)
 
   * 1316c4a23620dc7ae4df8530a345548d14d595e8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12398)
 
   * 1a66c178d59a5fef0064589d4a44eba5b6eb7137 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7005: Utilities and spark ci gh actions

2022-10-20 Thread GitBox


hudi-bot commented on PR #7005:
URL: https://github.com/apache/hudi/pull/7005#issuecomment-1286117033

   
   ## CI report:
   
   * 8d571a216ccbeb61d6365785b215291fc0d1d899 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12357)
 
   * 7c19a482abfae743281ebbbd51842a034159295c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12393)
 
   * 56032a44e18867e8ae21f47383ee1273d5c9c806 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6705: [HUDI-4868] Fixed the issue that compaction is invalid when the last commit action is replace commit.

2022-10-20 Thread GitBox


hudi-bot commented on PR #6705:
URL: https://github.com/apache/hudi/pull/6705#issuecomment-1286107027

   
   ## CI report:
   
   * 666b6690b531d1540b196f0f97e175c7da38166d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12379)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] 02/02: Rebased `ByteBuffer` cloning onto the new utility

2022-10-20 Thread akudinkin
This is an automated email from the ASF dual-hosted git repository.

akudinkin pushed a commit to branch HUDI-4971-cancel-relocation
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit a3017d2773c14e1400380e92d415a183518604a0
Author: Alexey Kudinkin 
AuthorDate: Thu Oct 20 13:01:13 2022 -0700

Rebased `ByteBuffer` cloning onto the new utility
---
 .../src/main/java/org/apache/hudi/common/util/AvroOrcUtils.java | 4 ++--
 hudi-common/src/main/java/org/apache/hudi/common/util/OrcUtils.java | 5 +++--
 .../src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala| 6 ++
 .../src/main/scala/org/apache/spark/sql/hudi/SerDeUtils.scala   | 6 ++
 4 files changed, 9 insertions(+), 12 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/util/AvroOrcUtils.java 
b/hudi-common/src/main/java/org/apache/hudi/common/util/AvroOrcUtils.java
index ca59c301c8..c83ec68976 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/util/AvroOrcUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/util/AvroOrcUtils.java
@@ -54,6 +54,7 @@ import org.apache.hudi.exception.HoodieIOException;
 import org.apache.orc.TypeDescription;
 
 import static org.apache.avro.JsonProperties.NULL_VALUE;
+import static org.apache.hudi.common.util.BinaryUtils.toBytes;
 
 /**
  * Methods including addToVector, addUnionValue, createOrcSchema are 
originally from
@@ -221,8 +222,7 @@ public class AvroOrcUtils {
   binaryBytes = ((GenericData.Fixed)value).bytes();
 } else if (value instanceof ByteBuffer) {
   final ByteBuffer byteBuffer = (ByteBuffer) value;
-  binaryBytes = new byte[byteBuffer.remaining()];
-  byteBuffer.get(binaryBytes);
+  binaryBytes = toBytes(byteBuffer);
 } else if (value instanceof byte[]) {
   binaryBytes = (byte[]) value;
 } else {
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/util/OrcUtils.java 
b/hudi-common/src/main/java/org/apache/hudi/common/util/OrcUtils.java
index 0cc4059197..4cb55f3790 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/util/OrcUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/util/OrcUtils.java
@@ -52,6 +52,8 @@ import java.util.Map;
 import java.util.Set;
 import java.util.stream.Collectors;
 
+import static org.apache.hudi.common.util.BinaryUtils.toBytes;
+
 /**
  * Utility functions for ORC files.
  */
@@ -238,8 +240,7 @@ public class OrcUtils extends BaseFileUtils {
 try (Reader reader = OrcFile.createReader(orcFilePath, 
OrcFile.readerOptions(conf))) {
   if (reader.hasMetadataValue("orc.avro.schema")) {
 ByteBuffer metadataValue = reader.getMetadataValue("orc.avro.schema");
-byte[] bytes = new byte[metadataValue.remaining()];
-metadataValue.get(bytes);
+byte[] bytes = toBytes(metadataValue);
 return new Schema.Parser().parse(new String(bytes));
   } else {
 TypeDescription orcSchema = reader.getSchema();
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala
index 58511f791e..dc413afff1 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala
@@ -29,6 +29,7 @@ import org.apache.hudi.common.data.HoodieData
 import org.apache.hudi.common.model.HoodieRecord
 import org.apache.hudi.common.table.HoodieTableMetaClient
 import org.apache.hudi.common.table.view.FileSystemViewStorageConfig
+import org.apache.hudi.common.util.BinaryUtils.toBytes
 import org.apache.hudi.common.util.ValidationUtils.checkState
 import org.apache.hudi.common.util.collection
 import org.apache.hudi.common.util.hash.ColumnIndexID
@@ -469,10 +470,7 @@ object ColumnStatsIndexSupport {
 }
   case BinaryType =>
 value match {
-  case b: ByteBuffer =>
-val bytes = new Array[Byte](b.remaining)
-b.get(bytes)
-bytes
+  case b: ByteBuffer => toBytes(b)
   case other => other
 }
 
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/SerDeUtils.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/SerDeUtils.scala
index 19d0a0a98b..294d282e3d 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/SerDeUtils.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/SerDeUtils.scala
@@ -33,10 +33,8 @@ object SerDeUtils {
   }
 
   def toBytes(o: Any): Array[Byte] = {
-val bb: ByteBuffer = SERIALIZER_THREAD_LOCAL.get.serialize(o)
-val bytes = new Array[Byte](bb.capacity())
-bb.get(bytes)
-bytes
+val buf = SERIAL

[hudi] branch HUDI-4971-cancel-relocation created (now a3017d2773)

2022-10-20 Thread akudinkin
This is an automated email from the ASF dual-hosted git repository.

akudinkin pushed a change to branch HUDI-4971-cancel-relocation
in repository https://gitbox.apache.org/repos/asf/hudi.git


  at a3017d2773 Rebased `ByteBuffer` cloning onto the new utility

This branch includes the following new commits:

 new 82d78409a3 `BinaryUtil` > `BinaryUtils`; Added utility to extract 
bytes from `ByteBuffer`
 new a3017d2773 Rebased `ByteBuffer` cloning onto the new utility

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.




[hudi] 01/02: `BinaryUtil` > `BinaryUtils`; Added utility to extract bytes from `ByteBuffer`

2022-10-20 Thread akudinkin
This is an automated email from the ASF dual-hosted git repository.

akudinkin pushed a commit to branch HUDI-4971-cancel-relocation
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 82d78409a375e80de73002744be85229e1ecfc8a
Author: Alexey Kudinkin 
AuthorDate: Thu Oct 20 12:59:38 2022 -0700

`BinaryUtil` > `BinaryUtils`;
Added utility to extract bytes from `ByteBuffer`
---
 .../apache/hudi/sort/SpaceCurveSortingHelper.java  | 34 +++---
 .../spark/sql/hudi/execution/RangeSample.scala | 10 +++
 .../hudi/common/table/HoodieTableConfig.java   |  4 +--
 .../util/{BinaryUtil.java => BinaryUtils.java} | 12 +++-
 .../apache/hudi/common/util/SpillableMapUtils.java |  2 +-
 .../common/util/collection/BitCaskDiskMap.java |  2 +-
 .../{TestBinaryUtil.java => TestBinaryUtils.java}  | 22 +++---
 7 files changed, 48 insertions(+), 38 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/sort/SpaceCurveSortingHelper.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/sort/SpaceCurveSortingHelper.java
index 496168e844..1ff54773c4 100644
--- 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/sort/SpaceCurveSortingHelper.java
+++ 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/sort/SpaceCurveSortingHelper.java
@@ -18,7 +18,7 @@
 
 package org.apache.hudi.sort;
 
-import org.apache.hudi.common.util.BinaryUtil;
+import org.apache.hudi.common.util.BinaryUtils;
 import org.apache.hudi.common.util.CollectionUtils;
 import org.apache.hudi.config.HoodieClusteringConfig;
 import org.apache.hudi.optimize.HilbertCurveUtils;
@@ -158,7 +158,7 @@ public class SpaceCurveSortingHelper {
 .toArray(byte[][]::new);
 
   // Interleave received bytes to produce Z-curve ordinal
-  byte[] zOrdinalBytes = BinaryUtil.interleaving(zBytes, 8);
+  byte[] zOrdinalBytes = BinaryUtils.interleaving(zBytes, 8);
   return appendToRow(row, zOrdinalBytes);
 })
   .sortBy(f -> new ByteArraySorting((byte[]) f.get(fieldNum)), true, 
fileNum);
@@ -206,30 +206,30 @@ public class SpaceCurveSortingHelper {
   @Nonnull
   private static byte[] mapColumnValueTo8Bytes(Row row, int index, DataType 
dataType) {
 if (dataType instanceof LongType) {
-  return BinaryUtil.longTo8Byte(row.isNullAt(index) ? Long.MAX_VALUE : 
row.getLong(index));
+  return BinaryUtils.longTo8Byte(row.isNullAt(index) ? Long.MAX_VALUE : 
row.getLong(index));
 } else if (dataType instanceof DoubleType) {
-  return BinaryUtil.doubleTo8Byte(row.isNullAt(index) ? Double.MAX_VALUE : 
row.getDouble(index));
+  return BinaryUtils.doubleTo8Byte(row.isNullAt(index) ? Double.MAX_VALUE 
: row.getDouble(index));
 } else if (dataType instanceof IntegerType) {
-  return BinaryUtil.intTo8Byte(row.isNullAt(index) ? Integer.MAX_VALUE : 
row.getInt(index));
+  return BinaryUtils.intTo8Byte(row.isNullAt(index) ? Integer.MAX_VALUE : 
row.getInt(index));
 } else if (dataType instanceof FloatType) {
-  return BinaryUtil.doubleTo8Byte(row.isNullAt(index) ? Float.MAX_VALUE : 
row.getFloat(index));
+  return BinaryUtils.doubleTo8Byte(row.isNullAt(index) ? Float.MAX_VALUE : 
row.getFloat(index));
 } else if (dataType instanceof StringType) {
-  return BinaryUtil.utf8To8Byte(row.isNullAt(index) ? "" : 
row.getString(index));
+  return BinaryUtils.utf8To8Byte(row.isNullAt(index) ? "" : 
row.getString(index));
 } else if (dataType instanceof DateType) {
-  return BinaryUtil.longTo8Byte(row.isNullAt(index) ? Long.MAX_VALUE : 
row.getDate(index).getTime());
+  return BinaryUtils.longTo8Byte(row.isNullAt(index) ? Long.MAX_VALUE : 
row.getDate(index).getTime());
 } else if (dataType instanceof TimestampType) {
-  return BinaryUtil.longTo8Byte(row.isNullAt(index) ? Long.MAX_VALUE : 
row.getTimestamp(index).getTime());
+  return BinaryUtils.longTo8Byte(row.isNullAt(index) ? Long.MAX_VALUE : 
row.getTimestamp(index).getTime());
 } else if (dataType instanceof ByteType) {
-  return BinaryUtil.byteTo8Byte(row.isNullAt(index) ? Byte.MAX_VALUE : 
row.getByte(index));
+  return BinaryUtils.byteTo8Byte(row.isNullAt(index) ? Byte.MAX_VALUE : 
row.getByte(index));
 } else if (dataType instanceof ShortType) {
-  return BinaryUtil.intTo8Byte(row.isNullAt(index) ? Short.MAX_VALUE : 
row.getShort(index));
+  return BinaryUtils.intTo8Byte(row.isNullAt(index) ? Short.MAX_VALUE : 
row.getShort(index));
 } else if (dataType instanceof DecimalType) {
-  return BinaryUtil.longTo8Byte(row.isNullAt(index) ? Long.MAX_VALUE : 
row.getDecimal(index).longValue());
+  return BinaryUtils.longTo8Byte(row.isNullAt(index) ? Long.MAX_VALUE : 
row.getDecimal(index).longValue());
 } else if (dataType instanceof BooleanType) {
   boolean value = row.isNullAt(index) ? false : row.getBoolean(index);
-  return BinaryUtil.intTo8Byte(value ? 1 : 0);
+  ret

  1   2   >