Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-11-07 Thread via GitHub


yihua merged PR #9888:
URL: https://github.com/apache/hudi/pull/9888


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-31 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1788299657

   
   ## CI report:
   
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * 18f548a7df24f79a7abe8f1065db719562ba14c2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20570)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-30 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1785820744

   
   ## CI report:
   
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * b25d25b4f3543bddfd4a138c1031d7f608e734ef Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20567)
 
   * 18f548a7df24f79a7abe8f1065db719562ba14c2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-30 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1785664085

   
   ## CI report:
   
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * ad753318ae00d66dd7b05c3d0b021d32ae7a0808 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20563)
 
   * b25d25b4f3543bddfd4a138c1031d7f608e734ef Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20567)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-30 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1785646016

   
   ## CI report:
   
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * ad753318ae00d66dd7b05c3d0b021d32ae7a0808 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20563)
 
   * b25d25b4f3543bddfd4a138c1031d7f608e734ef UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-30 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1785500240

   
   ## CI report:
   
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * ad753318ae00d66dd7b05c3d0b021d32ae7a0808 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20563)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-30 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1785263013

   
   ## CI report:
   
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * d59f64bdeb8cc5582a4fa6383dee98bf8b72a082 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20539)
 
   * ad753318ae00d66dd7b05c3d0b021d32ae7a0808 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20563)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-30 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1785241327

   
   ## CI report:
   
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * d59f64bdeb8cc5582a4fa6383dee98bf8b72a082 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20539)
 
   * ad753318ae00d66dd7b05c3d0b021d32ae7a0808 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-28 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1783735562

   
   ## CI report:
   
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * d59f64bdeb8cc5582a4fa6383dee98bf8b72a082 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20539)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-27 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1783725481

   
   ## CI report:
   
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * 00c91e2e52ef18c5880de00c450ad059090efc7d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20537)
 
   * d59f64bdeb8cc5582a4fa6383dee98bf8b72a082 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-27 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1783724053

   
   ## CI report:
   
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * 00c91e2e52ef18c5880de00c450ad059090efc7d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20537)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-27 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1783714779

   
   ## CI report:
   
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * 955944c19aa182a5231741fbf20888e517f6dafd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20486)
 
   * 00c91e2e52ef18c5880de00c450ad059090efc7d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20537)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-27 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1783713220

   
   ## CI report:
   
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * 955944c19aa182a5231741fbf20888e517f6dafd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20486)
 
   * 00c91e2e52ef18c5880de00c450ad059090efc7d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1779765638

   
   ## CI report:
   
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * 955944c19aa182a5231741fbf20888e517f6dafd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20486)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1779751539

   
   ## CI report:
   
   * 43fcb4679d5e5dd9dfa92390c7408a1797f5a7fb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20483)
 
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * 955944c19aa182a5231741fbf20888e517f6dafd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20486)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1779680860

   
   ## CI report:
   
   * 43fcb4679d5e5dd9dfa92390c7408a1797f5a7fb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20483)
 
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * 955944c19aa182a5231741fbf20888e517f6dafd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1779667511

   
   ## CI report:
   
   * 43fcb4679d5e5dd9dfa92390c7408a1797f5a7fb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20483)
 
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1779563369

   
   ## CI report:
   
   * 43fcb4679d5e5dd9dfa92390c7408a1797f5a7fb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20483)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1779433402

   
   ## CI report:
   
   * 49758850653ddc671bb130d3a22558c4230c8ee0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20480)
 
   * 43fcb4679d5e5dd9dfa92390c7408a1797f5a7fb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20483)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


linliu-code commented on code in PR #9888:
URL: https://github.com/apache/hudi/pull/9888#discussion_r1371813541


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala:
##
@@ -119,6 +120,35 @@ case class MergeOnReadIncrementalRelation(override val 
sqlContext: SQLContext,
 }
   }
 
+  def listFileSplits(partitionFilters: Seq[Expression], dataFilters: 
Seq[Expression]): Map[InternalRow, Seq[FileSlice]] = {

Review Comment:
   This function is created to extract the file splits from the relation, and 
will not be used by the file format.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


linliu-code commented on code in PR #9888:
URL: https://github.com/apache/hudi/pull/9888#discussion_r1371811100


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodiePartitionCDCFileGroupMapping.scala:
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.model.HoodieFileGroupId
+import org.apache.hudi.common.table.cdc.HoodieCDCFileSplit
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.util.{ArrayData, MapData}
+import org.apache.spark.sql.types.{DataType, Decimal}
+import org.apache.spark.unsafe.types.{CalendarInterval, UTF8String}
+
+import java.util
+
+case class HoodiePartitionCDCFileGroupMapping(partitionValues: InternalRow,

Review Comment:
   Right.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1779334130

   
   ## CI report:
   
   * 49758850653ddc671bb130d3a22558c4230c8ee0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20480)
 
   * 43fcb4679d5e5dd9dfa92390c7408a1797f5a7fb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


linliu-code commented on code in PR #9888:
URL: https://github.com/apache/hudi/pull/9888#discussion_r1371808655


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodiePartitionCDCFileGroupMapping.scala:
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.model.HoodieFileGroupId
+import org.apache.hudi.common.table.cdc.HoodieCDCFileSplit
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.util.{ArrayData, MapData}
+import org.apache.spark.sql.types.{DataType, Decimal}
+import org.apache.spark.unsafe.types.{CalendarInterval, UTF8String}
+
+import java.util
+
+case class HoodiePartitionCDCFileGroupMapping(partitionValues: InternalRow,
+  fileGroups: 
Map[HoodieFileGroupId, List[HoodieCDCFileSplit]]
+ ) extends InternalRow {
+
+  def getFileSplitsFor(fileGroupId: HoodieFileGroupId): 
Option[List[HoodieCDCFileSplit]] = {
+fileGroups.get(fileGroupId)
+  }
+
+  override def numFields: Int = {
+partitionValues.numFields
+  }
+
+  override def setNullAt(i: Int): Unit = {
+partitionValues.setNullAt(i)
+  }
+
+  override def update(i: Int, value: Any): Unit = {
+partitionValues.update(i, value)
+  }
+
+  override def copy(): InternalRow = {
+HoodiePartitionCDCFileGroupMapping(partitionValues.copy(), fileGroups)
+  }
+
+  override def isNullAt(ordinal: Int): Boolean = {
+partitionValues.isNullAt(ordinal)
+  }
+
+  override def getBoolean(ordinal: Int): Boolean = {
+partitionValues.getBoolean(ordinal)
+  }
+
+  override def getByte(ordinal: Int): Byte = {
+partitionValues.getByte(ordinal)
+  }
+
+  override def getShort(ordinal: Int): Short = {
+partitionValues.getShort(ordinal)
+  }
+
+  override def getInt(ordinal: Int): Int = {
+partitionValues.getInt(ordinal)
+  }
+
+  override def getLong(ordinal: Int): Long = {
+partitionValues.getLong(ordinal)
+  }
+
+  override def getFloat(ordinal: Int): Float = {
+partitionValues.getFloat(ordinal)
+  }
+
+  override def getDouble(ordinal: Int): Double = {
+partitionValues.getDouble(ordinal)
+  }
+
+  override def getDecimal(ordinal: Int, precision: Int, scale: Int): Decimal = 
{
+partitionValues.getDecimal(ordinal, precision, scale)
+  }
+
+  override def getUTF8String(ordinal: Int): UTF8String = {
+partitionValues.getUTF8String(ordinal)
+  }
+
+  override def getBinary(ordinal: Int): Array[Byte] = {
+partitionValues.getBinary(ordinal)
+  }
+
+  override def getInterval(ordinal: Int): CalendarInterval = {
+partitionValues.getInterval(ordinal)
+  }
+
+  override def getStruct(ordinal: Int, numFields: Int): InternalRow = {
+partitionValues.getStruct(ordinal, numFields)
+  }
+
+  override def getArray(ordinal: Int): ArrayData = {
+partitionValues.getArray(ordinal)
+  }
+
+  override def getMap(ordinal: Int): MapData = {
+partitionValues.getMap(ordinal)
+  }
+
+  override def get(ordinal: Int, dataType: DataType): AnyRef = {
+partitionValues.getMap(ordinal)

Review Comment:
   Right. Let me fix.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


linliu-code commented on code in PR #9888:
URL: https://github.com/apache/hudi/pull/9888#discussion_r1371808164


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala:
##
@@ -141,12 +145,37 @@ class 
HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState,
   case _ => baseFileReader(file)
 }
   }
+// CDC queries.
+case hoodiePartitionCDCFileGroupSliceMapping: 
HoodiePartitionCDCFileGroupMapping =>
+  val filePath: Path = 
sparkAdapter.getSparkPartitionedFileUtils.getPathFromPartitionedFile(file)
+  val fileGroupId: HoodieFileGroupId = new 
HoodieFileGroupId(filePath.getParent.toString, filePath.getName)
+  val fileSplits = 
hoodiePartitionCDCFileGroupSliceMapping.getFileSplitsFor(fileGroupId).get.toArray
+  val fileGroupSplit: HoodieCDCFileGroupSplit = 
HoodieCDCFileGroupSplit(fileSplits)
+  buildCDCRecordIterator(fileGroupSplit, preMergeBaseFileReader, 
hadoopConf, requiredSchema, props)
 // TODO: Use FileGroupReader here: HUDI-6942.
 case _ => baseFileReader(file)
   }
 }
   }
 
+  protected def buildCDCRecordIterator(cdcFileGroupSplit: 
HoodieCDCFileGroupSplit,
+   preMergeBaseFileReader: PartitionedFile 
=> Iterator[InternalRow],
+   hadoopConf: Configuration,
+   requiredSchema: StructType,
+   props: TypedProperties): 
Iterator[InternalRow] = {
+val metaClient = 
HoodieTableMetaClient.initTableAndGetMetaClient(hadoopConf, 
tableState.tablePath, props)
+val cdcSchema = CDCRelation.FULL_CDC_SPARK_SCHEMA
+new CDCFileGroupIterator(

Review Comment:
   Right, it can not right now due to its complex logic and our tight schedule. 
We should definitely do that later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


linliu-code commented on code in PR #9888:
URL: https://github.com/apache/hudi/pull/9888#discussion_r1371807037


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/CDCFileGroupIterator.scala:
##
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.cdc
+
+import org.apache.avro.Schema
+import org.apache.avro.generic.{GenericData, GenericRecord, IndexedRecord}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hudi.HoodieBaseRelation.BaseFileReader
+import org.apache.hudi.HoodieConversionUtils.toScalaOption
+import org.apache.hudi.HoodieDataSourceHelper.AvroDeserializerSupport
+import org.apache.hudi.avro.HoodieAvroUtils
+import org.apache.hudi.{AvroConversionUtils, AvroProjection, HoodieFileIndex, 
HoodieMergeOnReadFileSplit, HoodieTableSchema, HoodieTableState, 
LogFileIterator, RecordMergingFileIterator, SparkAdapterSupport}
+import org.apache.hudi.common.config.{HoodieMetadataConfig, TypedProperties}
+import org.apache.hudi.common.model.{FileSlice, HoodieAvroRecordMerger, 
HoodieLogFile, HoodieRecord, HoodieRecordMerger, HoodieRecordPayload}
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import org.apache.hudi.common.table.cdc.{HoodieCDCFileSplit, HoodieCDCUtils}
+import org.apache.hudi.common.table.cdc.HoodieCDCInferenceCase._
+import org.apache.hudi.common.table.log.HoodieCDCLogRecordIterator
+import org.apache.hudi.common.table.cdc.HoodieCDCOperation._
+import org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode._
+import org.apache.hudi.common.util.ValidationUtils.checkState
+import org.apache.hudi.config.HoodiePayloadConfig
+import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory
+import 
org.apache.spark.sql.HoodieCatalystExpressionUtils.generateUnsafeProjection
+import org.apache.spark.sql.avro.HoodieAvroDeserializer
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.Projection
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.unsafe.types.UTF8String
+
+import java.io.Closeable
+import java.util.Properties
+import java.util.stream.Collectors
+import scala.annotation.tailrec
+import scala.collection.mutable
+import scala.jdk.CollectionConverters.asScalaBufferConverter
+
+class CDCFileGroupIterator(split: HoodieCDCFileGroupSplit,

Review Comment:
   Yeah, that is what I want to discuss actually. This iterator contains the 
logic that can not be implemented by FileGroupReader currently, and we need to 
migrate some of these logics into the FileGroupReader later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1779296702

   
   ## CI report:
   
   * 49758850653ddc671bb130d3a22558c4230c8ee0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20480)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1779192289

   
   ## CI report:
   
   * 96eb0439735a32895d88c661b3a9da78eb3c8af5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20479)
 
   * 49758850653ddc671bb130d3a22558c4230c8ee0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20480)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1779010705

   
   ## CI report:
   
   * 0870d758b843dc8ae0d4f80d57297193dcced4a0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20460)
 
   * 96eb0439735a32895d88c661b3a9da78eb3c8af5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20479)
 
   * 49758850653ddc671bb130d3a22558c4230c8ee0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20480)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1778996402

   
   ## CI report:
   
   * 0870d758b843dc8ae0d4f80d57297193dcced4a0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20460)
 
   * 96eb0439735a32895d88c661b3a9da78eb3c8af5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20479)
 
   * 49758850653ddc671bb130d3a22558c4230c8ee0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1778914675

   
   ## CI report:
   
   * 0870d758b843dc8ae0d4f80d57297193dcced4a0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20460)
 
   * 96eb0439735a32895d88c661b3a9da78eb3c8af5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20479)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1778898618

   
   ## CI report:
   
   * 0870d758b843dc8ae0d4f80d57297193dcced4a0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20460)
 
   * 96eb0439735a32895d88c661b3a9da78eb3c8af5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-24 Thread via GitHub


yihua commented on code in PR #9888:
URL: https://github.com/apache/hudi/pull/9888#discussion_r1370761534


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodiePartitionCDCFileGroupMapping.scala:
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.model.HoodieFileGroupId
+import org.apache.hudi.common.table.cdc.HoodieCDCFileSplit
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.util.{ArrayData, MapData}
+import org.apache.spark.sql.types.{DataType, Decimal}
+import org.apache.spark.unsafe.types.{CalendarInterval, UTF8String}
+
+import java.util
+
+case class HoodiePartitionCDCFileGroupMapping(partitionValues: InternalRow,
+  fileGroups: 
Map[HoodieFileGroupId, List[HoodieCDCFileSplit]]
+ ) extends InternalRow {
+
+  def getFileSplitsFor(fileGroupId: HoodieFileGroupId): 
Option[List[HoodieCDCFileSplit]] = {
+fileGroups.get(fileGroupId)
+  }
+
+  override def numFields: Int = {
+partitionValues.numFields
+  }
+
+  override def setNullAt(i: Int): Unit = {
+partitionValues.setNullAt(i)
+  }
+
+  override def update(i: Int, value: Any): Unit = {
+partitionValues.update(i, value)
+  }
+
+  override def copy(): InternalRow = {
+HoodiePartitionCDCFileGroupMapping(partitionValues.copy(), fileGroups)
+  }
+
+  override def isNullAt(ordinal: Int): Boolean = {
+partitionValues.isNullAt(ordinal)
+  }
+
+  override def getBoolean(ordinal: Int): Boolean = {
+partitionValues.getBoolean(ordinal)
+  }
+
+  override def getByte(ordinal: Int): Byte = {
+partitionValues.getByte(ordinal)
+  }
+
+  override def getShort(ordinal: Int): Short = {
+partitionValues.getShort(ordinal)
+  }
+
+  override def getInt(ordinal: Int): Int = {
+partitionValues.getInt(ordinal)
+  }
+
+  override def getLong(ordinal: Int): Long = {
+partitionValues.getLong(ordinal)
+  }
+
+  override def getFloat(ordinal: Int): Float = {
+partitionValues.getFloat(ordinal)
+  }
+
+  override def getDouble(ordinal: Int): Double = {
+partitionValues.getDouble(ordinal)
+  }
+
+  override def getDecimal(ordinal: Int, precision: Int, scale: Int): Decimal = 
{
+partitionValues.getDecimal(ordinal, precision, scale)
+  }
+
+  override def getUTF8String(ordinal: Int): UTF8String = {
+partitionValues.getUTF8String(ordinal)
+  }
+
+  override def getBinary(ordinal: Int): Array[Byte] = {
+partitionValues.getBinary(ordinal)
+  }
+
+  override def getInterval(ordinal: Int): CalendarInterval = {
+partitionValues.getInterval(ordinal)
+  }
+
+  override def getStruct(ordinal: Int, numFields: Int): InternalRow = {
+partitionValues.getStruct(ordinal, numFields)
+  }
+
+  override def getArray(ordinal: Int): ArrayData = {
+partitionValues.getArray(ordinal)
+  }
+
+  override def getMap(ordinal: Int): MapData = {
+partitionValues.getMap(ordinal)
+  }
+
+  override def get(ordinal: Int, dataType: DataType): AnyRef = {
+partitionValues.getMap(ordinal)

Review Comment:
   this should be `partitionValues.get(ordinal, dataType)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-24 Thread via GitHub


yihua commented on code in PR #9888:
URL: https://github.com/apache/hudi/pull/9888#discussion_r1370759016


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala:
##
@@ -141,12 +145,37 @@ class 
HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState,
   case _ => baseFileReader(file)
 }
   }
+// CDC queries.
+case hoodiePartitionCDCFileGroupSliceMapping: 
HoodiePartitionCDCFileGroupMapping =>
+  val filePath: Path = 
sparkAdapter.getSparkPartitionedFileUtils.getPathFromPartitionedFile(file)
+  val fileGroupId: HoodieFileGroupId = new 
HoodieFileGroupId(filePath.getParent.toString, filePath.getName)
+  val fileSplits = 
hoodiePartitionCDCFileGroupSliceMapping.getFileSplitsFor(fileGroupId).get.toArray
+  val fileGroupSplit: HoodieCDCFileGroupSplit = 
HoodieCDCFileGroupSplit(fileSplits)
+  buildCDCRecordIterator(fileGroupSplit, preMergeBaseFileReader, 
hadoopConf, requiredSchema, props)
 // TODO: Use FileGroupReader here: HUDI-6942.
 case _ => baseFileReader(file)
   }
 }
   }
 
+  protected def buildCDCRecordIterator(cdcFileGroupSplit: 
HoodieCDCFileGroupSplit,
+   preMergeBaseFileReader: PartitionedFile 
=> Iterator[InternalRow],
+   hadoopConf: Configuration,
+   requiredSchema: StructType,
+   props: TypedProperties): 
Iterator[InternalRow] = {
+val metaClient = 
HoodieTableMetaClient.initTableAndGetMetaClient(hadoopConf, 
tableState.tablePath, props)
+val cdcSchema = CDCRelation.FULL_CDC_SPARK_SCHEMA
+new CDCFileGroupIterator(

Review Comment:
   This does not seem to leverage `HoodieFileGroupReader`.



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala:
##
@@ -119,6 +120,35 @@ case class MergeOnReadIncrementalRelation(override val 
sqlContext: SQLContext,
 }
   }
 
+  def listFileSplits(partitionFilters: Seq[Expression], dataFilters: 
Seq[Expression]): Map[InternalRow, Seq[FileSlice]] = {

Review Comment:
   Could this be extracted out as a util method instead of sitting inside the 
MOR incremental relation, which will not be used by the new Hudi parquet file 
format class?



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/CDCFileGroupIterator.scala:
##
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.cdc
+
+import org.apache.avro.Schema
+import org.apache.avro.generic.{GenericData, GenericRecord, IndexedRecord}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hudi.HoodieBaseRelation.BaseFileReader
+import org.apache.hudi.HoodieConversionUtils.toScalaOption
+import org.apache.hudi.HoodieDataSourceHelper.AvroDeserializerSupport
+import org.apache.hudi.avro.HoodieAvroUtils
+import org.apache.hudi.{AvroConversionUtils, AvroProjection, HoodieFileIndex, 
HoodieMergeOnReadFileSplit, HoodieTableSchema, HoodieTableState, 
LogFileIterator, RecordMergingFileIterator, SparkAdapterSupport}
+import org.apache.hudi.common.config.{HoodieMetadataConfig, TypedProperties}
+import org.apache.hudi.common.model.{FileSlice, HoodieAvroRecordMerger, 
HoodieLogFile, HoodieRecord, HoodieRecordMerger, HoodieRecordPayload}
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import org.apache.hudi.common.table.cdc.{HoodieCDCFileSplit, HoodieCDCUtils}
+import org.apache.hudi.common.table.cdc.HoodieCDCInferenceCase._
+import org.apache.hudi.common.table.log.HoodieCDCLogRecordIterator
+import org.apache.hudi.common.table.cdc.HoodieCDCOperation._
+import org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode._
+import org.apache.hudi.common.util.ValidationUtils.checkState
+import org.apache.hudi.config.HoodiePayloadConfig
+import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory
+import 
org.apache.spark.sql.HoodieCatalystEx

Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-24 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1777811563

   
   ## CI report:
   
   * 0870d758b843dc8ae0d4f80d57297193dcced4a0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20460)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-24 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-136503

   
   ## CI report:
   
   * 512aabead021aed3817215d1c6aecf567cd0a575 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20401)
 
   * 0870d758b843dc8ae0d4f80d57297193dcced4a0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org