[hudi] branch hudi_test_suite_refactor updated (de6ec05 -> ff13b2a)

2020-07-06 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/hudi.git.


 discard de6ec05  [HUDI-394] Provide a basic implementation of test suite
 add ff13b2a  [HUDI-394] Provide a basic implementation of test suite

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (de6ec05)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (ff13b2a)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/integ/testsuite/dag/nodes/HiveQueryNode.java | 5 +++--
 .../hudi/integ/testsuite/reader/DFSHoodieDatasetInputReader.java | 2 +-
 .../apache/hudi/integ/testsuite/writer/AvroFileDeltaInputWriter.java | 4 ++--
 3 files changed, 6 insertions(+), 5 deletions(-)



[GitHub] [hudi] lw309637554 commented on a change in pull request #1756: [HUDI-839] Adding unit test for MarkerFiles,RollbackUtils, RollbackActionExecutor for markers and filelisting

2020-07-06 Thread GitBox


lw309637554 commented on a change in pull request #1756:
URL: https://github.com/apache/hudi/pull/1756#discussion_r449947875



##
File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
##
@@ -97,28 +98,9 @@ public Path makeNewPath(String partitionPath) {
*
* @param partitionPath Partition path
*/
-  protected void createMarkerFile(String partitionPath) {
-Path markerPath = makeNewMarkerPath(partitionPath);
-try {
-  LOG.info("Creating Marker Path=" + markerPath);
-  fs.create(markerPath, false).close();
-} catch (IOException e) {
-  throw new HoodieException("Failed to create marker file " + markerPath, 
e);
-}
-  }
-
-  /**
-   * THe marker path will be 
/.hoodie/.temp//2019/04/25/filename.
-   */
-  private Path makeNewMarkerPath(String partitionPath) {
-Path markerRootPath = new 
Path(hoodieTable.getMetaClient().getMarkerFolderPath(instantTime));
-Path path = FSUtils.getPartitionPath(markerRootPath, partitionPath);
-try {
-  fs.mkdirs(path); // create a new partition as needed.
-} catch (IOException e) {
-  throw new HoodieIOException("Failed to make dir " + path, e);
-}
-return new Path(path.toString(), FSUtils.makeMarkerFile(instantTime, 
writeToken, fileId));
+  protected void createMarkerFile(String partitionPath, String dataFileName) {
+MarkerFiles markerFiles = new MarkerFiles(hoodieTable, instantTime);
+markerFiles.createMarkerFile(partitionPath, dataFileName, getIOType());

Review comment:
   hi, i think to be createMarkerFile will more clearness





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash edited a comment on pull request #1704: [HUDI-115] Enhance OverwriteWithLatestAvroPayload to also respect ordering value of record in storage

2020-07-06 Thread GitBox


n3nash edited a comment on pull request #1704:
URL: https://github.com/apache/hudi/pull/1704#issuecomment-654623082


   @bhasudha The PR looks good to me. Looks like the same ordering field will 
be honored in all places. One high level question before I accept it -> If 
`preCombine` & `combineAndGetUpdateValue` are using the same `orderingVal`, I'm 
guessing it is expected from the user to use the constructor with the 
`orderingVal` and up to the user to ensure the `orderingVal` used in the 
constructor is the same as the one passed in `Map<..>`. If this is true, does 
`HoodieDeltaStreamer` allow for this kind of constructor invocation ?
   Also, please rebase and push the PR, once the build succeeds can merge it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on pull request #1797: [HUDI-1069] Remove duplicate assertNoWriteErrors()

2020-07-06 Thread GitBox


xushiyan commented on pull request #1797:
URL: https://github.com/apache/hudi/pull/1797#issuecomment-654623835


   @yanghua this is ready for review. Thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on pull request #1704: [HUDI-115] Enhance OverwriteWithLatestAvroPayload to also respect ordering value of record in storage

2020-07-06 Thread GitBox


n3nash commented on pull request #1704:
URL: https://github.com/apache/hudi/pull/1704#issuecomment-654623082


   @bhasudha The PR looks good to me. Looks like the same ordering field will 
be honored in all places. One high level question before I accept it -> If 
`preCombine` & `combineAndGetUpdateValue` are using the same `orderingVal`, I'm 
guessing it is expected from the user to use the constructor with the 
`orderingVal` and up to the user to ensure the `orderingVal` used in the 
constructor is the same as the one passed in `Map<..>`. Can you put a comment 
around this please ?
   Also, please rebase and push the PR, once the build succeeds can merge it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] RajasekarSribalan commented on issue #1794: [SUPPORT] Hudi delete operation but HiveSync failed

2020-07-06 Thread GitBox


RajasekarSribalan commented on issue #1794:
URL: https://github.com/apache/hudi/issues/1794#issuecomment-654588991


   @bhasudha  just an update, our jobs are not failing but we get this error 
for hard delete operation and below is the command that we use on dataframe for 
delete operation.
   
   My concern is, whether hudi is doing rollback before the error? I hope it is 
not,  pls confirm.
   
   ERROR hive.HiveSyncTool: Got runtime exception when hive syncing
   18039 java.lang.IllegalArgumentException: Could not find any data file 
written for commit [20200705101913__commit__COMPLETED], could not get schema 
for table /user/admin/hudi/users, Metadata 
:HoodieCommitMetadata{partitionToWriteStats={}, compacted=false, 
extraMetadata={ROLLING_STAT={
   18040 "partitionToRollingStats" : {
   18041 "" : {
   18042 "d398058e-f8f4-4772-9fcb-012318ac8f47-0" : {
   18043 "fileId" : "d398058e-f8f4-4772-9fcb-012318ac8f47-0",
   18044 "inserts" : 989333,
   
   deleteDataframe.write
   .format("hudi")
   .options(getQuickstartWriteConfigs)
   .option(OPERATION_OPT_KEY, "delete")
   .option(PRECOMBINE_FIELD_OPT_KEY, hudi_precombine_key)
   .option(RECORDKEY_FIELD_OPT_KEY, hudi_key)
   .option(PARTITIONPATH_FIELD_OPT_KEY, "")
   .option(TABLE_NAME, tablename)
   .option(TABLE_TYPE_OPT_KEY, "COPY_ON_WRITE")
   .option(HIVE_SYNC_ENABLED_OPT_KEY, "true")
   .option(HIVE_URL_OPT_KEY, 
"jdbc:hive2://XX:1/;principal=/XXX)
   .option(HIVE_DATABASE_OPT_KEY, hudi_db)
   .option(HIVE_TABLE_OPT_KEY, tablename)
   .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
classOf[NonPartitionedExtractor].getName)
   .option(PAYLOAD_CLASS_OPT_KEY, 
"org.apache.hudi.EmptyHoodieRecordPayload")
   .mode(Append)
   .save("/user/X/hudi/" + tablename)
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] RajasekarSribalan edited a comment on issue #1794: [SUPPORT] Hudi delete operation but HiveSync failed

2020-07-06 Thread GitBox


RajasekarSribalan edited a comment on issue #1794:
URL: https://github.com/apache/hudi/issues/1794#issuecomment-654588991


   @bhasudha  just an update, our jobs are not failing but we get this error 
for hard delete operation and below is the command that we use on dataframe for 
delete operation.
   
   My concern is, whether hudi is doing rollback because of the error? I hope 
it is not,  pls confirm.
   
   ERROR hive.HiveSyncTool: Got runtime exception when hive syncing
   18039 java.lang.IllegalArgumentException: Could not find any data file 
written for commit [20200705101913__commit__COMPLETED], could not get schema 
for table /user/admin/hudi/users, Metadata 
:HoodieCommitMetadata{partitionToWriteStats={}, compacted=false, 
extraMetadata={ROLLING_STAT={
   18040 "partitionToRollingStats" : {
   18041 "" : {
   18042 "d398058e-f8f4-4772-9fcb-012318ac8f47-0" : {
   18043 "fileId" : "d398058e-f8f4-4772-9fcb-012318ac8f47-0",
   18044 "inserts" : 989333,
   
   deleteDataframe.write
   .format("hudi")
   .options(getQuickstartWriteConfigs)
   .option(OPERATION_OPT_KEY, "delete")
   .option(PRECOMBINE_FIELD_OPT_KEY, hudi_precombine_key)
   .option(RECORDKEY_FIELD_OPT_KEY, hudi_key)
   .option(PARTITIONPATH_FIELD_OPT_KEY, "")
   .option(TABLE_NAME, tablename)
   .option(TABLE_TYPE_OPT_KEY, "COPY_ON_WRITE")
   .option(HIVE_SYNC_ENABLED_OPT_KEY, "true")
   .option(HIVE_URL_OPT_KEY, 
"jdbc:hive2://XX:1/;principal=/XXX)
   .option(HIVE_DATABASE_OPT_KEY, hudi_db)
   .option(HIVE_TABLE_OPT_KEY, tablename)
   .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
classOf[NonPartitionedExtractor].getName)
   .option(PAYLOAD_CLASS_OPT_KEY, 
"org.apache.hudi.EmptyHoodieRecordPayload")
   .mode(Append)
   .save("/user/X/hudi/" + tablename)
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Build failed in Jenkins: hudi-snapshot-deployment-0.5 #331

2020-07-06 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.35 KB...]

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark-bundle_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities-bundle_${s

[jira] [Commented] (HUDI-691) hoodie.*.consume.* should be set whitelist in hive-site.xml

2020-07-06 Thread GarudaGuo (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152462#comment-17152462
 ] 

GarudaGuo commented on HUDI-691:


[~bhavanisudha] which doc should I append the issue. thx.

> hoodie.*.consume.* should be set whitelist in hive-site.xml
> ---
>
> Key: HUDI-691
> URL: https://issues.apache.org/jira/browse/HUDI-691
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Docs, newbie
>Reporter: Bhavani Sudha
>Assignee: GarudaGuo
>Priority: Minor
> Fix For: 0.6.0
>
>
> More details in this GH issue - 
> https://github.com/apache/incubator-hudi/issues/910



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] zherenyu831 edited a comment on issue #1798: Question reading partition path with less level is more faster than what document mentioned

2020-07-06 Thread GitBox


zherenyu831 edited a comment on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-654575639


   @bhasudha 
   It is a very simple query for testing
   ```
   //val df = 
spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*/*")
   val df = spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*")
   val updatedDf  = df.filter("_hoodie_commit_time between 
'${_hoodieCommitTimeStart}' and '${_hoodieCommitTimeEnd}'")
   ```
   and we found it cost a lot of time on resolve relation of parquet files.
   so we do the test as I mentioned 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zherenyu831 edited a comment on issue #1798: Question reading partition path with less level is more faster than what document mentioned

2020-07-06 Thread GitBox


zherenyu831 edited a comment on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-654575639


   @bhasudha 
   It is a very simple query for testing
   ```
   //val df = 
spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*/*")
   val df = spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*")
   val updatedDf  = df.filter("_hoodie_commit_time between 
'${_hoodieCommitTimeStart}' and '${_hoodieCommitTimeEnd}'")
   ```
   and we found it cost a lot of time on resolving relation of parquet files.
   so we do the test as I mentioned 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zherenyu831 edited a comment on issue #1798: Question reading partition path with less level is more faster than what document mentioned

2020-07-06 Thread GitBox


zherenyu831 edited a comment on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-654575639


   @bhasudha 
   It is very simple query 
   ```
   //val df = 
spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*/*")
   val df = spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*")
   val updatedDf  = df.filter("_hoodie_commit_time between 
'${_hoodieCommitTimeStart}' and '${_hoodieCommitTimeEnd}'")
   ```
   and we found it cost a lot of time on resolve relation of parquet files.
   so we do the test as I mentioned 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zherenyu831 edited a comment on issue #1798: Question reading partition path with less level is more faster than what document mentioned

2020-07-06 Thread GitBox


zherenyu831 edited a comment on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-654575639


   @bhasudha 
   It is a very simple query 
   ```
   //val df = 
spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*/*")
   val df = spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*")
   val updatedDf  = df.filter("_hoodie_commit_time between 
'${_hoodieCommitTimeStart}' and '${_hoodieCommitTimeEnd}'")
   ```
   and we found it cost a lot of time on resolve relation of parquet files.
   so we do the test as I mentioned 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zherenyu831 commented on issue #1798: Question reading partition path with less level is more faster than what document mentioned

2020-07-06 Thread GitBox


zherenyu831 commented on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-654575639


   @bhasudha 
   It is very simple query 
   ```
   val df = spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*")
   val updatedDf  = df.filter("_hoodie_commit_time between 
'${_hoodieCommitTimeStart}' and '${_hoodieCommitTimeEnd}'")
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on a change in pull request #1722: [HUDI-69] Support Spark Datasource for MOR table

2020-07-06 Thread GitBox


garyli1019 commented on a change in pull request #1722:
URL: https://github.com/apache/hudi/pull/1722#discussion_r450584388



##
File path: 
hudi-spark/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieMergedParquetRowIterator.scala
##
@@ -0,0 +1,178 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import org.apache.hadoop.mapred.JobConf
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.log.{HoodieMergedLogRecordScanner, 
LogReaderUtils}
+import org.apache.hudi.hadoop.config.HoodieRealtimeConfig
+import org.apache.hudi.hadoop.realtime.HoodieRealtimeFileSplit
+import org.apache.parquet.hadoop.ParquetRecordReader
+import org.apache.avro.Schema
+import 
org.apache.hudi.hadoop.utils.HoodieInputFormatUtils.HOODIE_RECORD_KEY_COL_POS
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.avro.{AvroDeserializer, SchemaConverters}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{UnsafeProjection, UnsafeRow}
+import org.apache.spark.sql.types.StructType
+
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.model.HoodieRecordPayload
+
+import java.io.Closeable
+import java.util
+import scala.util.Try
+
+/**
+ * This class is the iterator for Hudi MOR table.
+ * Log files are scanned on initialization.
+ * This iterator will read the parquet file first and skip the record if it 
present in the log file.
+ * Then read the log file.
+ * Custom payload is not supported yet. This combining logic is matching with 
[OverwriteWithLatestAvroPayload]
+ * @param rowReader ParquetRecordReader
+ */
+class HoodieMergedParquetRowIterator(private[this] var rowReader: 
ParquetRecordReader[UnsafeRow],

Review comment:
   
https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/RecordReaderIterator.scala#L32





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on a change in pull request #1722: [HUDI-69] Support Spark Datasource for MOR table

2020-07-06 Thread GitBox


garyli1019 commented on a change in pull request #1722:
URL: https://github.com/apache/hudi/pull/1722#discussion_r450584271



##
File path: 
hudi-spark/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetRealtimeFileFormat.scala
##
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import org.apache.hudi.hadoop.realtime.HoodieRealtimeFileSplit
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.mapred.{FileSplit, JobConf}
+import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+import org.apache.hadoop.mapreduce.{JobID, TaskAttemptID, TaskID, TaskType}
+import org.apache.parquet.filter2.compat.FilterCompat
+import org.apache.parquet.filter2.predicate.FilterApi
+import 
org.apache.parquet.format.converter.ParquetMetadataConverter.SKIP_ROW_GROUPS
+import org.apache.parquet.hadoop.{ParquetFileReader, ParquetInputFormat, 
ParquetRecordReader}
+import org.apache.spark.TaskContext
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
+import org.apache.spark.sql.catalyst.expressions.{JoinedRow, UnsafeRow}
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableConfiguration
+
+import java.net.URI
+import scala.collection.JavaConverters._
+
+/**
+ * This class is an extension of ParquetFileFormat from Spark SQL.
+ * The file split, record reader, record reader iterator are customized to 
read Hudi MOR table.
+ */
+class HoodieParquetRealtimeFileFormat extends ParquetFileFormat {

Review comment:
   
https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L295





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zuyanton commented on issue #1790: [SUPPORT] Querying MoR tables with DecimalType columns via Spark SQL fails

2020-07-06 Thread GitBox


zuyanton commented on issue #1790:
URL: https://github.com/apache/hudi/issues/1790#issuecomment-654566959


   @bhasudha Thank you for your reply . If I read code correctly ,I believe 
that handling decimal is missing here 
https://github.com/apache/hudi/blob/release-0.5.3/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java#L303
  
   and it looks like it was fixed in master with this PR: 
https://github.com/apache/hudi/commit/37838cea6094ddc66191df42e8b2c84f132d1623#diff-68b6e6f1a2c961fea254a2fc3b93ac23R209
 . ... let me checkout masters and verify 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashanthpdesai commented on issue #1775: INCREMETNAL QUERY-Null value Exception

2020-07-06 Thread GitBox


prashanthpdesai commented on issue #1775:
URL: https://github.com/apache/hudi/issues/1775#issuecomment-654559192


   @bhasudha : Hi , tried with same packages which you have mentioned above , 
we see diff kind of error .
   
   Please find the trace below .
   
   **spark-shell --queue queue_q1 --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'**
   Warning: Master yarn-client is deprecated since 2.0. Please use master 
"yarn" with specified deploy mode instead.
   The jars for the packages stored in: /home/edzmmprd/.ivy2/jars
   :: loading settings :: url = 
jar:file:/opt/mapr/spark/spark-2.2.1/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
   org.apache.hudi#hudi-spark-bundle_2.11 added as a dependency
   org.apache.spark#spark-avro_2.11 added as a dependency
   :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
   confs: [default]
   found org.apache.hudi#hudi-spark-bundle_2.11;0.5.3 in central
   found org.apache.spark#spark-avro_2.11;2.4.4 in central
   found org.apache.spark#spark-tags_2.11;2.4.4 in central
   found org.spark-project.spark#unused;1.0.0 in central
   downloading 
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark-bundle_2.11/0.5.3/hudi-spark-bundle_2.11-0.5.3.jar
 ...
   [SUCCESSFUL ] 
org.apache.hudi#hudi-spark-bundle_2.11;0.5.3!hudi-spark-bundle_2.11.jar (787ms)
   downloading 
https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.11/2.4.4/spark-avro_2.11-2.4.4.jar
 ...
   [SUCCESSFUL ] 
org.apache.spark#spark-avro_2.11;2.4.4!spark-avro_2.11.jar (13ms)
   downloading 
https://repo1.maven.org/maven2/org/apache/spark/spark-tags_2.11/2.4.4/spark-tags_2.11-2.4.4.jar
 ...
   [SUCCESSFUL ] 
org.apache.spark#spark-tags_2.11;2.4.4!spark-tags_2.11.jar (6ms)
   downloading 
https://repo1.maven.org/maven2/org/spark-project/spark/unused/1.0.0/unused-1.0.0.jar
 ...
   [SUCCESSFUL ] 
org.spark-project.spark#unused;1.0.0!unused.jar (5ms)
   :: resolution report :: resolve 2006ms :: artifacts dl 817ms
   :: modules in use:
   org.apache.hudi#hudi-spark-bundle_2.11;0.5.3 from central in 
[default]
   org.apache.spark#spark-avro_2.11;2.4.4 from central in 
[default]
   org.apache.spark#spark-tags_2.11;2.4.4 from central in 
[default]
   org.spark-project.spark#unused;1.0.0 from central in 
[default]
   
-
   |  |modules||   
artifacts   |
   |   conf   | number| search|dwnlded|evicted|| 
number|dwnlded|
   
-
   |  default |   4   |   4   |   4   |   0   ||   4   
|   4   |
   
-
   :: retrieving :: org.apache.spark#spark-submit-parent
   confs: [default]
   4 artifacts copied, 0 already retrieved (20789kB/44ms)


   scala> import org.apache.hudi.QuickstartUtils._
   import org.apache.hudi.QuickstartUtils._

   scala> import scala.collection.JavaConversions._
   import scala.collection.JavaConversions._

   scala> import org.apache.spark.sql.SaveMode._
   import org.apache.spark.sql.SaveMode._

   scala> import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceReadOptions._

   scala> import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.DataSourceWriteOptions._

   scala> import org.apache.hudi.config.HoodieWriteConfig._
   import org.apache.hudi.config.HoodieWriteConfig._

   scala> val basepath= "/datalake/uhclake/edz/prd/mcm/mcm_hudi_cow_dedup_fix"
   basepath: String = /datalake/uhclake/edz/prd/mcm/mcm_hudi_cow_dedup_fix

   scala> spark.read.format("org.apache.hudi").load(basepath + 
"/*").createOrReplaceTempView("hudi_tab")
   
   scala> val commits = spark.sql("select distinct(_hoodie_commit_time) as 
commitTime from  hudi_tab order by commitTime").map(k => 
k.getString(0)).take(50)
   commits: Array[String] = Array(20200703000922, 20200703002654, 
20200703010757, 20200703020715, 20200703030709, 20200703041422, 20200703051419, 
20200703060728, 20200703070921, 20200703080801, 20200703090728, 20200703101459, 
20200703110839, 20200703120708, 20200703131249, 20200703140738, 20200703151235, 
20200703160723, 20200703170659, 20200703181223, 20200703211557, 20200703220646, 
20200703231410, 20200704001432, 20200704010736, 20200704020754, 20200704030729, 
20200704040836, 20200704050652, 20200704060650, 20200704070749, 20200704080711, 
20200704090720, 20200

[jira] [Commented] (HUDI-979) AWSDMSPayload delete handling with MOR

2020-07-06 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152414#comment-17152414
 ] 

sivabalan narayanan commented on HUDI-979:
--

[~309637554] : sure. go ahead :+1: 

> AWSDMSPayload delete handling with MOR
> --
>
> Key: HUDI-979
> URL: https://issues.apache.org/jira/browse/HUDI-979
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> [https://github.com/apache/hudi/issues/1549] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-979) AWSDMSPayload delete handling with MOR

2020-07-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-979:


Assignee: liwei  (was: sivabalan narayanan)

> AWSDMSPayload delete handling with MOR
> --
>
> Key: HUDI-979
> URL: https://issues.apache.org/jira/browse/HUDI-979
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Vinoth Chandar
>Assignee: liwei
>Priority: Blocker
> Fix For: 0.6.0
>
>
> [https://github.com/apache/hudi/issues/1549] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-979) AWSDMSPayload delete handling with MOR

2020-07-06 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152411#comment-17152411
 ] 

liwei commented on HUDI-979:


[~xleesf] [~shivnarayan]  i can do this issue. we have fix in our inner branch

> AWSDMSPayload delete handling with MOR
> --
>
> Key: HUDI-979
> URL: https://issues.apache.org/jira/browse/HUDI-979
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> [https://github.com/apache/hudi/issues/1549] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] tooptoop4 commented on issue #1802: [SUPPORT] Delete gives Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file

2020-07-06 Thread GitBox


tooptoop4 commented on issue #1802:
URL: https://github.com/apache/hudi/issues/1802#issuecomment-654539478


   int64 from upsert but binary from delete. once binary input data had int64 
it worked



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tooptoop4 closed issue #1802: [SUPPORT] Delete gives Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file

2020-07-06 Thread GitBox


tooptoop4 closed issue #1802:
URL: https://github.com/apache/hudi/issues/1802


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tooptoop4 closed issue #1799: [SUPPORT] NPE at org.apache.hudi.table.HoodieCommitArchiveLog.lambda$getInstantsToArchive

2020-07-06 Thread GitBox


tooptoop4 closed issue #1799:
URL: https://github.com/apache/hudi/issues/1799


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tooptoop4 commented on issue #1801: [SUPPORT] org.apache.avro.AvroTypeException: Found com.uber.hoodie.avro.model.HoodieCleanMetadata, expecting org.apache.hudi.avro.model.HoodieCleane

2020-07-06 Thread GitBox


tooptoop4 commented on issue #1801:
URL: https://github.com/apache/hudi/issues/1801#issuecomment-654534604


   just warning, closing



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tooptoop4 closed issue #1801: [SUPPORT] org.apache.avro.AvroTypeException: Found com.uber.hoodie.avro.model.HoodieCleanMetadata, expecting org.apache.hudi.avro.model.HoodieCleanerPlan,

2020-07-06 Thread GitBox


tooptoop4 closed issue #1801:
URL: https://github.com/apache/hudi/issues/1801


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-472) Make sortBy() inside bulkInsertInternal() configurable for bulk_insert

2020-07-06 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-472:
---

Assignee: Ethan Guo  (was: sivabalan narayanan)

> Make sortBy() inside bulkInsertInternal() configurable for bulk_insert
> --
>
> Key: HUDI-472
> URL: https://issues.apache.org/jira/browse/HUDI-472
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Performance
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1014) Design and Implement upgrade-downgrade infrastrucutre

2020-07-06 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-1014:


Assignee: sivabalan narayanan  (was: Balaji Varadarajan)

> Design and Implement upgrade-downgrade infrastrucutre
> -
>
> Key: HUDI-1014
> URL: https://issues.apache.org/jira/browse/HUDI-1014
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, Writer Core
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-802) AWSDmsTransformer does not handle insert -> delete of a row in a single batch correctly

2020-07-06 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-802:

Status: Patch Available  (was: In Progress)

> AWSDmsTransformer does not handle insert -> delete of a row in a single batch 
> correctly
> ---
>
> Key: HUDI-802
> URL: https://issues.apache.org/jira/browse/HUDI-802
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Christopher Weaver
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> The provided AWSDmsAvroPayload class 
> ([https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java])
>  currently handles cases where the "Op" column is a "D" for updates, and 
> successfully removes the row from the resulting table. 
> However, when an insert is quickly followed by a delete on the row (e.g. DMS 
> processes them together and puts the update records together in the same 
> parquet file), the row incorrectly appears in the resulting table. In this 
> case, the record is not in the table and getInsertValue is called rather than 
> combineAndGetUpdateValue. Since the logic to check for a delete is in 
> combineAndGetUpdateValue, it is skipped and the delete is missed. Something 
> like this could fix this issue: 
> [https://github.com/Weves/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/java/org/apache/hudi/payload/CustomAWSDmsAvroPayload.java].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] bhasudha commented on issue #1803: [SUPPORT] hoodie.datasource.write.precombine.field is ignored

2020-07-06 Thread GitBox


bhasudha commented on issue #1803:
URL: https://github.com/apache/hudi/issues/1803#issuecomment-654521251


   @joaqs190 quick questions:
   
   1. could you describe what is the precombine field here ? 
   2. Hudi uses two way of writing - Spark datasource writer and Deltastreamer. 
For Deltastreamer we use the config `--source-ordering-field` to configure the 
precombine field. Can you ensure if this is what you are configuring too ? 
 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1054) Address performance issues with finalizing writes on S3

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1054:
-
Status: Patch Available  (was: In Progress)

> Address performance issues with finalizing writes on S3
> ---
>
> Key: HUDI-1054
> URL: https://issues.apache.org/jira/browse/HUDI-1054
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: bootstrap, Common Core, Performance
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> I have identified 3 performance bottleneck in the 
> [finalizeWrite|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L378]
>  function, that are manifesting and becoming more prominent with the new 
> bootstrap mechanism on S3:
>  * 
> [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L425]
>   is a serial operation performed at the driver and it can take a long time 
> when you have several partitions and large number of files.
>  * The invalid data paths are being stored in a List instead of Set and as a 
> result the following operation becomes N^2 taking significant time to compute 
> at the driver: 
> [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L429]
>  * 
> [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L473]
>  does a recursive delete of the marker directory at the driver. This is again 
> extremely expensive when you have large number of partitions and files.
>  
> Upon testing with a 1 TB data set, having 8000 partitions and approximately 
> 19 files this whole process consumes *35 minutes*. There is scope to 
> address these performance issues with spark parallelization and using 
> appropriate data structures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1021) [Bug] Unable to update bootstrapped table using rows from the written bootstrapped table

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1021:
-
Status: Open  (was: New)

> [Bug] Unable to update bootstrapped table using rows from the written 
> bootstrapped table
> 
>
> Key: HUDI-1021
> URL: https://issues.apache.org/jira/browse/HUDI-1021
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: bootstrap
>Reporter: Udit Mehrotra
>Assignee: Balaji Varadarajan
>Priority: Major
>
> Reproduction Steps:
>  
> {code:java}
> import spark.implicits._
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.HoodieDataSourceHelpers
> import org.apache.hudi.common.model.HoodieTableType
> import org.apache.spark.sql.SaveMode
> val sourcePath = 
> "s3://uditme-iad/hudi/tables/events/events_data_partitioned_non_null"
> val sourceDf = spark.read.parquet(sourcePath + "/*")
> var tableName = "events_data_partitioned_non_null_00"
> var tablePath = "s3://emr-users/uditme/hudi/tables/events/" + tableName
> sourceDf.write.format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, tableName)
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>  .mode(SaveMode.Overwrite)
>  .save(tablePath)
> val readDf = spark.read.format("org.apache.hudi").load(tablePath + "/*")
> val updateDf = readDf.filter($"event_id" === "106")
>  .withColumn("event_name", lit("udit_event_106"))
>  
> updateDf.write.format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, tableName)
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>  .mode(SaveMode.Append)
>  .save(tablePath)
> {code}
>  
> Full Stack trace:
> {noformat}
> Caused by: org.apache.hudi.exception.HoodieUpsertException: Error upserting 
> bucketType UPDATE for partition :0
>  at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:276)
>  at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:102)
>  at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>  at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:359)
>  at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:357)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1181)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1155)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1090)
>  at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1155)
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:881)
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:308)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>  at org.apache.spark.scheduler.Task.run(Task.scala:123)
>  at 
> org.apache.spark.e

[jira] [Updated] (HUDI-1054) Address performance issues with finalizing writes on S3

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1054:
-
Status: Open  (was: New)

> Address performance issues with finalizing writes on S3
> ---
>
> Key: HUDI-1054
> URL: https://issues.apache.org/jira/browse/HUDI-1054
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: bootstrap, Common Core, Performance
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> I have identified 3 performance bottleneck in the 
> [finalizeWrite|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L378]
>  function, that are manifesting and becoming more prominent with the new 
> bootstrap mechanism on S3:
>  * 
> [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L425]
>   is a serial operation performed at the driver and it can take a long time 
> when you have several partitions and large number of files.
>  * The invalid data paths are being stored in a List instead of Set and as a 
> result the following operation becomes N^2 taking significant time to compute 
> at the driver: 
> [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L429]
>  * 
> [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L473]
>  does a recursive delete of the marker directory at the driver. This is again 
> extremely expensive when you have large number of partitions and files.
>  
> Upon testing with a 1 TB data set, having 8000 partitions and approximately 
> 19 files this whole process consumes *35 minutes*. There is scope to 
> address these performance issues with spark parallelization and using 
> appropriate data structures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-999) Parallelize listing of Source dataset partitions

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-999:

Status: In Progress  (was: Open)

> Parallelize listing of Source dataset partitions 
> -
>
> Key: HUDI-999
> URL: https://issues.apache.org/jira/browse/HUDI-999
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: bootstrap
>Reporter: Balaji Varadarajan
>Assignee: Udit Mehrotra
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Currently, we are using single thread in driver to list all partitions in 
> Source dataset. This is a bottleneck when doing metadata bootstrap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1021) [Bug] Unable to update bootstrapped table using rows from the written bootstrapped table

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1021:
-
Fix Version/s: 0.6.0

> [Bug] Unable to update bootstrapped table using rows from the written 
> bootstrapped table
> 
>
> Key: HUDI-1021
> URL: https://issues.apache.org/jira/browse/HUDI-1021
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: bootstrap
>Reporter: Udit Mehrotra
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> Reproduction Steps:
>  
> {code:java}
> import spark.implicits._
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.HoodieDataSourceHelpers
> import org.apache.hudi.common.model.HoodieTableType
> import org.apache.spark.sql.SaveMode
> val sourcePath = 
> "s3://uditme-iad/hudi/tables/events/events_data_partitioned_non_null"
> val sourceDf = spark.read.parquet(sourcePath + "/*")
> var tableName = "events_data_partitioned_non_null_00"
> var tablePath = "s3://emr-users/uditme/hudi/tables/events/" + tableName
> sourceDf.write.format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, tableName)
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>  .mode(SaveMode.Overwrite)
>  .save(tablePath)
> val readDf = spark.read.format("org.apache.hudi").load(tablePath + "/*")
> val updateDf = readDf.filter($"event_id" === "106")
>  .withColumn("event_name", lit("udit_event_106"))
>  
> updateDf.write.format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, tableName)
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>  .mode(SaveMode.Append)
>  .save(tablePath)
> {code}
>  
> Full Stack trace:
> {noformat}
> Caused by: org.apache.hudi.exception.HoodieUpsertException: Error upserting 
> bucketType UPDATE for partition :0
>  at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:276)
>  at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:102)
>  at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>  at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:359)
>  at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:357)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1181)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1155)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1090)
>  at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1155)
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:881)
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:308)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>  at org.apache.spark.scheduler.Task.run(Task.scala:123)

[jira] [Updated] (HUDI-1021) [Bug] Unable to update bootstrapped table using rows from the written bootstrapped table

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1021:
-
Priority: Blocker  (was: Major)

> [Bug] Unable to update bootstrapped table using rows from the written 
> bootstrapped table
> 
>
> Key: HUDI-1021
> URL: https://issues.apache.org/jira/browse/HUDI-1021
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: bootstrap
>Reporter: Udit Mehrotra
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Reproduction Steps:
>  
> {code:java}
> import spark.implicits._
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.HoodieDataSourceHelpers
> import org.apache.hudi.common.model.HoodieTableType
> import org.apache.spark.sql.SaveMode
> val sourcePath = 
> "s3://uditme-iad/hudi/tables/events/events_data_partitioned_non_null"
> val sourceDf = spark.read.parquet(sourcePath + "/*")
> var tableName = "events_data_partitioned_non_null_00"
> var tablePath = "s3://emr-users/uditme/hudi/tables/events/" + tableName
> sourceDf.write.format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, tableName)
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>  .mode(SaveMode.Overwrite)
>  .save(tablePath)
> val readDf = spark.read.format("org.apache.hudi").load(tablePath + "/*")
> val updateDf = readDf.filter($"event_id" === "106")
>  .withColumn("event_name", lit("udit_event_106"))
>  
> updateDf.write.format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, tableName)
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>  .mode(SaveMode.Append)
>  .save(tablePath)
> {code}
>  
> Full Stack trace:
> {noformat}
> Caused by: org.apache.hudi.exception.HoodieUpsertException: Error upserting 
> bucketType UPDATE for partition :0
>  at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:276)
>  at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:102)
>  at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>  at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:359)
>  at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:357)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1181)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1155)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1090)
>  at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1155)
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:881)
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:308)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>  at org.apache.spark.scheduler.Task.run(Tas

[jira] [Updated] (HUDI-1054) Address performance issues with finalizing writes on S3

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1054:
-
Status: In Progress  (was: Open)

> Address performance issues with finalizing writes on S3
> ---
>
> Key: HUDI-1054
> URL: https://issues.apache.org/jira/browse/HUDI-1054
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: bootstrap, Common Core, Performance
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> I have identified 3 performance bottleneck in the 
> [finalizeWrite|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L378]
>  function, that are manifesting and becoming more prominent with the new 
> bootstrap mechanism on S3:
>  * 
> [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L425]
>   is a serial operation performed at the driver and it can take a long time 
> when you have several partitions and large number of files.
>  * The invalid data paths are being stored in a List instead of Set and as a 
> result the following operation becomes N^2 taking significant time to compute 
> at the driver: 
> [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L429]
>  * 
> [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L473]
>  does a recursive delete of the marker directory at the driver. This is again 
> extremely expensive when you have large number of partitions and files.
>  
> Upon testing with a 1 TB data set, having 8000 partitions and approximately 
> 19 files this whole process consumes *35 minutes*. There is scope to 
> address these performance issues with spark parallelization and using 
> appropriate data structures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-991) Bootstrap Implementation Bugs

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan resolved HUDI-991.
-
Resolution: Duplicate

> Bootstrap Implementation Bugs
> -
>
> Key: HUDI-991
> URL: https://issues.apache.org/jira/browse/HUDI-991
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Priority: Major
>
> This story tracks all the bugs we encounter while testing bootstrap changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-992) For hive-style partitioned source data, partition columns synced with Hive will always have String type

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-992:
---

Assignee: Udit Mehrotra

> For hive-style partitioned source data, partition columns synced with Hive 
> will always have String type
> ---
>
> Key: HUDI-992
> URL: https://issues.apache.org/jira/browse/HUDI-992
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Currently bootstrap implementation is not able to handle partition columns 
> correctly when the source data has *hive-style partitioning*, as is also 
> mentioned in https://jira.apache.org/jira/browse/HUDI-915
> The schema inferred while performing bootstrap and stored in the commit 
> metadata does not have partition column schema(in case of hive partitioned 
> data). As a result during hive-sync when hudi tries to determine the type of 
> partition column from that schema, it would not find it and assume the 
> default data type *string*.
> Here is where partition column schema is determined for hive-sync:
> [https://github.com/apache/hudi/blob/master/hudi-hive-sync/src/main/java/org/apache/hudi/hive/util/HiveSchemaUtil.java#L417]
>  
> Thus no matter what the data type of partition column is in the source data 
> (atleast what spark infers it as from the path), it will always be synced as 
> string.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-991) Bootstrap Implementation Bugs

2020-07-06 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152373#comment-17152373
 ] 

Balaji Varadarajan commented on HUDI-991:
-

This is a umbrella ticket which is not needed. Closing as dup

> Bootstrap Implementation Bugs
> -
>
> Key: HUDI-991
> URL: https://issues.apache.org/jira/browse/HUDI-991
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Priority: Major
>
> This story tracks all the bugs we encounter while testing bootstrap changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1001) Add implementation to translate source partition paths when doing metadata bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1001:
-
Fix Version/s: (was: 0.6.0)
   0.6.1

> Add implementation to translate source partition paths when doing metadata 
> bootstrap
> 
>
> Key: HUDI-1001
> URL: https://issues.apache.org/jira/browse/HUDI-1001
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: bootstrap
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> While doing metadata bootstrap, we can provide ability to change the 
> partition-path name. It will still be 1-1 between source and bootstrapped 
> table but we can make the partition-path adhere to hive style.
> For e:g /src_base_path/2020/06/05/ can be mapped to 
> /bootstrap_base_path/ds=2020%2F06%2F05/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-971) Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-971:

Status: Open  (was: New)

> Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean 
> partition name
> ---
>
> Key: HUDI-971
> URL: https://issues.apache.org/jira/browse/HUDI-971
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Wenning Ding
>Priority: Blocker
> Fix For: 0.6.0
>
>
> When calling HFileBootstrapIndexReader.getIndexedPartitions(), it will return 
> unclean partitions because of 
> [https://github.com/apache/hbase/blob/rel/1.2.3/hbase-common/src/main/java/org/apache/hadoop/hbase/CellUtil.java#L768].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-991) Bootstrap Implementation Bugs

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-991:

Status: Open  (was: New)

> Bootstrap Implementation Bugs
> -
>
> Key: HUDI-991
> URL: https://issues.apache.org/jira/browse/HUDI-991
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Priority: Major
>
> This story tracks all the bugs we encounter while testing bootstrap changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-971) Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-971:
---

Assignee: Balaji Varadarajan

> Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean 
> partition name
> ---
>
> Key: HUDI-971
> URL: https://issues.apache.org/jira/browse/HUDI-971
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Wenning Ding
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> When calling HFileBootstrapIndexReader.getIndexedPartitions(), it will return 
> unclean partitions because of 
> [https://github.com/apache/hbase/blob/rel/1.2.3/hbase-common/src/main/java/org/apache/hadoop/hbase/CellUtil.java#L768].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-619) Investigate and implement mechanism to have hive/presto/sparksql queries avoid stitching and return null values for hoodie columns

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-619:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Investigate and implement mechanism to have hive/presto/sparksql queries 
> avoid stitching and return null values for hoodie columns 
> ---
>
> Key: HUDI-619
> URL: https://issues.apache.org/jira/browse/HUDI-619
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Hive Integration, Presto Integration, Spark Integration
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> This idea is suggested by Vinoth during RFC review. This ticket is to track 
> the feasibility and implementation of it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-808) Support for cleaning source data

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-808:

Priority: Blocker  (was: Major)

> Support for cleaning source data
> 
>
> Key: HUDI-808
> URL: https://issues.apache.org/jira/browse/HUDI-808
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Assignee: Wenning Ding
>Priority: Blocker
> Fix For: 0.6.0
>
>
> This is an important requirement from GDPR perspective. When performing 
> deletion on a metadata only bootstrapped partition, users should have the 
> ability to tell to clean up the original data from the source location 
> because as per this new bootstrapping mechanism the original data serves as 
> the data in original commit for Hudi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-954) Test COW : Presto Read Optimized Query with metadata bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-954:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Test COW : Presto Read Optimized Query with metadata bootstrap
> --
>
> Key: HUDI-954
> URL: https://issues.apache.org/jira/browse/HUDI-954
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Balaji Varadarajan
>Assignee: Wenning Ding
>Priority: Blocker
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-808) Support for cleaning source data

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-808:

Status: Open  (was: New)

> Support for cleaning source data
> 
>
> Key: HUDI-808
> URL: https://issues.apache.org/jira/browse/HUDI-808
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Major
>
> This is an important requirement from GDPR perspective. When performing 
> deletion on a metadata only bootstrapped partition, users should have the 
> ability to tell to clean up the original data from the source location 
> because as per this new bootstrapping mechanism the original data serves as 
> the data in original commit for Hudi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-808) Support for cleaning source data

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-808:
---

Assignee: Wenning Ding  (was: Udit Mehrotra)

> Support for cleaning source data
> 
>
> Key: HUDI-808
> URL: https://issues.apache.org/jira/browse/HUDI-808
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Assignee: Wenning Ding
>Priority: Major
>
> This is an important requirement from GDPR perspective. When performing 
> deletion on a metadata only bootstrapped partition, users should have the 
> ability to tell to clean up the original data from the source location 
> because as per this new bootstrapping mechanism the original data serves as 
> the data in original commit for Hudi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-956) Test COW : Presto Realtime Query with metadata bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-956:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Test COW : Presto Realtime Query with metadata bootstrap
> 
>
> Key: HUDI-956
> URL: https://issues.apache.org/jira/browse/HUDI-956
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Balaji Varadarajan
>Assignee: Wenning Ding
>Priority: Blocker
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-955) Test MOR : Presto Read Optimized Query with metadata bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-955:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Test MOR : Presto Read Optimized Query with metadata bootstrap
> --
>
> Key: HUDI-955
> URL: https://issues.apache.org/jira/browse/HUDI-955
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Balaji Varadarajan
>Assignee: Wenning Ding
>Priority: Major
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-808) Support for cleaning source data

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-808:

Fix Version/s: 0.6.0

> Support for cleaning source data
> 
>
> Key: HUDI-808
> URL: https://issues.apache.org/jira/browse/HUDI-808
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Assignee: Wenning Ding
>Priority: Major
> Fix For: 0.6.0
>
>
> This is an important requirement from GDPR perspective. When performing 
> deletion on a metadata only bootstrapped partition, users should have the 
> ability to tell to clean up the original data from the source location 
> because as per this new bootstrapping mechanism the original data serves as 
> the data in original commit for Hudi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-621) Presto Integration for supporting Bootstrapped table

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-621:

Fix Version/s: 0.6.1

> Presto Integration for supporting Bootstrapped table
> 
>
> Key: HUDI-621
> URL: https://issues.apache.org/jira/browse/HUDI-621
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Balaji Varadarajan
>Assignee: Udit Mehrotra
>Priority: Major
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-828) Open Questions before merging Bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-828:

Fix Version/s: 0.6.0

> Open Questions before merging Bootstrap 
> 
>
> Key: HUDI-828
> URL: https://issues.apache.org/jira/browse/HUDI-828
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> This ticket tracks open questions that needs to be resolved before we checkin 
> bootstrap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-828) Open Questions before merging Bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan resolved HUDI-828.
-
Resolution: Fixed

> Open Questions before merging Bootstrap 
> 
>
> Key: HUDI-828
> URL: https://issues.apache.org/jira/browse/HUDI-828
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> This ticket tracks open questions that needs to be resolved before we checkin 
> bootstrap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-828) Open Questions before merging Bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan resolved HUDI-828.
-
Resolution: Duplicate

> Open Questions before merging Bootstrap 
> 
>
> Key: HUDI-828
> URL: https://issues.apache.org/jira/browse/HUDI-828
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> This ticket tracks open questions that needs to be resolved before we checkin 
> bootstrap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-828) Open Questions before merging Bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reopened HUDI-828:
-

> Open Questions before merging Bootstrap 
> 
>
> Key: HUDI-828
> URL: https://issues.apache.org/jira/browse/HUDI-828
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> This ticket tracks open questions that needs to be resolved before we checkin 
> bootstrap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-828) Open Questions before merging Bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-828:

Status: In Progress  (was: Open)

> Open Questions before merging Bootstrap 
> 
>
> Key: HUDI-828
> URL: https://issues.apache.org/jira/browse/HUDI-828
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>
> This ticket tracks open questions that needs to be resolved before we checkin 
> bootstrap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-428) Web documentation for explaining how to bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-428:

Priority: Blocker  (was: Major)

> Web documentation for explaining how to bootstrap 
> --
>
> Key: HUDI-428
> URL: https://issues.apache.org/jira/browse/HUDI-428
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Need to provide examples (demo) to document bootstrapping



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-953) Test COW : Spark Data Source Read Optimized Queries

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-953:

Status: In Progress  (was: Open)

> Test COW : Spark Data Source Read Optimized Queries
> ---
>
> Key: HUDI-953
> URL: https://issues.apache.org/jira/browse/HUDI-953
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Udit Mehrotra
>Priority: Blocker
> Fix For: 0.60
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-828) Open Questions before merging Bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152371#comment-17152371
 ] 

Balaji Varadarajan commented on HUDI-828:
-

This is part of the PR now and has been discussed.

> Open Questions before merging Bootstrap 
> 
>
> Key: HUDI-828
> URL: https://issues.apache.org/jira/browse/HUDI-828
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>
> This ticket tracks open questions that needs to be resolved before we checkin 
> bootstrap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-953) Test COW : Spark Data Source Read Optimized Queries

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-953:

Fix Version/s: 0.60

> Test COW : Spark Data Source Read Optimized Queries
> ---
>
> Key: HUDI-953
> URL: https://issues.apache.org/jira/browse/HUDI-953
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Udit Mehrotra
>Priority: Major
> Fix For: 0.60
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-953) Test COW : Spark Data Source Read Optimized Queries

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-953:

Priority: Blocker  (was: Major)

> Test COW : Spark Data Source Read Optimized Queries
> ---
>
> Key: HUDI-953
> URL: https://issues.apache.org/jira/browse/HUDI-953
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Udit Mehrotra
>Priority: Blocker
> Fix For: 0.60
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-953) Test COW : Spark Data Source Read Optimized Queries

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan resolved HUDI-953.
-
Resolution: Fixed

> Test COW : Spark Data Source Read Optimized Queries
> ---
>
> Key: HUDI-953
> URL: https://issues.apache.org/jira/browse/HUDI-953
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Udit Mehrotra
>Priority: Blocker
> Fix For: 0.60
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-949) Test MOR : Hive Realtime Query with metadata bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-949:

Priority: Blocker  (was: Major)

> Test MOR : Hive Realtime Query with metadata bootstrap
> --
>
> Key: HUDI-949
> URL: https://issues.apache.org/jira/browse/HUDI-949
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Balaji Varadarajan
>Assignee: Wenning Ding
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 72h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-950) Test COW : Spark SQL Read Optimized Query with metadata bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-950:

Fix Version/s: 0.6.0

> Test COW : Spark SQL Read Optimized Query with metadata bootstrap
> -
>
> Key: HUDI-950
> URL: https://issues.apache.org/jira/browse/HUDI-950
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Wenning Ding
>Priority: Major
> Fix For: 0.6.0
>
>  Time Spent: 72h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-950) Test COW : Spark SQL Read Optimized Query with metadata bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-950:

Priority: Blocker  (was: Major)

> Test COW : Spark SQL Read Optimized Query with metadata bootstrap
> -
>
> Key: HUDI-950
> URL: https://issues.apache.org/jira/browse/HUDI-950
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Wenning Ding
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 72h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-952) Test MOR : Spark SQL Realtime Query with metadata bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-952:

Priority: Blocker  (was: Major)

> Test MOR : Spark SQL Realtime Query with metadata bootstrap
> ---
>
> Key: HUDI-952
> URL: https://issues.apache.org/jira/browse/HUDI-952
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-949) Test MOR : Hive Realtime Query with metadata bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-949:

Fix Version/s: 0.6.0

> Test MOR : Hive Realtime Query with metadata bootstrap
> --
>
> Key: HUDI-949
> URL: https://issues.apache.org/jira/browse/HUDI-949
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Balaji Varadarajan
>Assignee: Wenning Ding
>Priority: Major
> Fix For: 0.6.0
>
>  Time Spent: 72h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-951) Test MOR : Spark SQL Read Optimized Query with metadata bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-951:

Priority: Blocker  (was: Major)

> Test MOR : Spark SQL Read Optimized Query with metadata bootstrap
> -
>
> Key: HUDI-951
> URL: https://issues.apache.org/jira/browse/HUDI-951
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-951) Test MOR : Spark SQL Read Optimized Query with metadata bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-951:

Fix Version/s: 0.6.0

> Test MOR : Spark SQL Read Optimized Query with metadata bootstrap
> -
>
> Key: HUDI-951
> URL: https://issues.apache.org/jira/browse/HUDI-951
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-915) Partition Columns missing in files upserted after Metadata Bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-915:

Priority: Blocker  (was: Major)

> Partition Columns missing in files upserted after Metadata Bootstrap
> 
>
> Key: HUDI-915
> URL: https://issues.apache.org/jira/browse/HUDI-915
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Common Core
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Blocker
>
> This issue happens in when the source data is partitioned using _*hive-style 
> partitioning*_ which is also the default behavior of spark when it writes the 
> data. With this partitioning, the partition column/schema is never stored in 
> the files but instead retrieved on the fly from the file paths which have 
> partition folder in the form *_partition_key=partition_value_*.
> Now, during metadata bootstrap we store only the metadata columns in the hudi 
> table folder. Also the *bootstrap schema* we are computing directly reads 
> schema from the source data file which does not have the *partition column 
> schema* in it. Thus it is not complete.
> All this manifests into issues when we ultimately do *upserts* on these 
> bootstrapped files and they are fully bootstrapped. During upsert time the 
> schema evolves because the upsert dataframe needs to have partition column in 
> it for performing upserts. Thus ultimately the *upserted rows* have the 
> correct partition column value stored, while the other records which are 
> simply copied over from the metadata bootstrap file have missing partition 
> column in them. Thus, we observe a different behavior here with 
> *bootstrapped* vs *non-bootstrapped* tables.
> While this is not at the moment creating issues with *Hive* because it is 
> able to determine the partition columns becuase of all the metadata it 
> stores, however it creates a problem with other engines like *Spark* where 
> the partition columns will show up as *null* when the upserted files are read.
> Thus, the proposal is to fix the following issues:
>  * When performing bootstrap, figure out the partition schema and store it in 
> the *bootstrap schema* in the commit metadata file. This would provide the 
> following benefits:
>  ** From a completeness perspective this is good so that there is no 
> behavioral changes between bootstrapped vs non-bootstrapped tables.
>  ** In spark bootstrap relation and incremental query relation where we need 
> to figure out the latest schema, once can simply get the accurate schema from 
> the commit metadata file instead of having to determine whether or not 
> partition column is present in the schema obtained from the metadata file and 
> if not figure out the partition schema everytime and merge (which can be 
> expensive).
>  * When doing upsert on files that are metadata bootstrapped, the partition 
> column values should be correctly determined and copied to the upserted file 
> to avoid missing and null values.
>  ** Again this is consistent behavior with non-bootstrapped tables and even 
> though Hive seems to somehow handle this, we should consider other engines 
> like *Spark* where it cannot be automatically handled.
>  ** Without this it will be significantly more complicated to be able to 
> provide the partition value on read side in spark, to be able to determine 
> everytime whether partition value is null and somehow filling it in.
>  ** Once the table is fully bootstrapped at some point in future, and the 
> bootstrap commit is say cleaned up and spark querying happens through 
> *parquet* datasource instead of *new bootstrapped datasource*, the *parquet 
> datasource* will return null values wherever it find the missing partition 
> values. In that case, we have no control over the *parquet* datasource as it 
> is simply reading from the file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-915) Partition Columns missing in files upserted after Metadata Bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-915:

Fix Version/s: 0.6.0

> Partition Columns missing in files upserted after Metadata Bootstrap
> 
>
> Key: HUDI-915
> URL: https://issues.apache.org/jira/browse/HUDI-915
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Common Core
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Blocker
> Fix For: 0.6.0
>
>
> This issue happens in when the source data is partitioned using _*hive-style 
> partitioning*_ which is also the default behavior of spark when it writes the 
> data. With this partitioning, the partition column/schema is never stored in 
> the files but instead retrieved on the fly from the file paths which have 
> partition folder in the form *_partition_key=partition_value_*.
> Now, during metadata bootstrap we store only the metadata columns in the hudi 
> table folder. Also the *bootstrap schema* we are computing directly reads 
> schema from the source data file which does not have the *partition column 
> schema* in it. Thus it is not complete.
> All this manifests into issues when we ultimately do *upserts* on these 
> bootstrapped files and they are fully bootstrapped. During upsert time the 
> schema evolves because the upsert dataframe needs to have partition column in 
> it for performing upserts. Thus ultimately the *upserted rows* have the 
> correct partition column value stored, while the other records which are 
> simply copied over from the metadata bootstrap file have missing partition 
> column in them. Thus, we observe a different behavior here with 
> *bootstrapped* vs *non-bootstrapped* tables.
> While this is not at the moment creating issues with *Hive* because it is 
> able to determine the partition columns becuase of all the metadata it 
> stores, however it creates a problem with other engines like *Spark* where 
> the partition columns will show up as *null* when the upserted files are read.
> Thus, the proposal is to fix the following issues:
>  * When performing bootstrap, figure out the partition schema and store it in 
> the *bootstrap schema* in the commit metadata file. This would provide the 
> following benefits:
>  ** From a completeness perspective this is good so that there is no 
> behavioral changes between bootstrapped vs non-bootstrapped tables.
>  ** In spark bootstrap relation and incremental query relation where we need 
> to figure out the latest schema, once can simply get the accurate schema from 
> the commit metadata file instead of having to determine whether or not 
> partition column is present in the schema obtained from the metadata file and 
> if not figure out the partition schema everytime and merge (which can be 
> expensive).
>  * When doing upsert on files that are metadata bootstrapped, the partition 
> column values should be correctly determined and copied to the upserted file 
> to avoid missing and null values.
>  ** Again this is consistent behavior with non-bootstrapped tables and even 
> though Hive seems to somehow handle this, we should consider other engines 
> like *Spark* where it cannot be automatically handled.
>  ** Without this it will be significantly more complicated to be able to 
> provide the partition value on read side in spark, to be able to determine 
> everytime whether partition value is null and somehow filling it in.
>  ** Once the table is fully bootstrapped at some point in future, and the 
> bootstrap commit is say cleaned up and spark querying happens through 
> *parquet* datasource instead of *new bootstrapped datasource*, the *parquet 
> datasource* will return null values wherever it find the missing partition 
> values. In that case, we have no control over the *parquet* datasource as it 
> is simply reading from the file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-952) Test MOR : Spark SQL Realtime Query with metadata bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-952:

Fix Version/s: 0.6.0

> Test MOR : Spark SQL Realtime Query with metadata bootstrap
> ---
>
> Key: HUDI-952
> URL: https://issues.apache.org/jira/browse/HUDI-952
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-948) Test MOR : Hive Read Optimized Query with metadata bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-948:

Priority: Blocker  (was: Major)

> Test MOR : Hive Read Optimized Query with metadata bootstrap
> 
>
> Key: HUDI-948
> URL: https://issues.apache.org/jira/browse/HUDI-948
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-947) Test COW : Hive Read Optimized Query with metadata bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-947:

Fix Version/s: 0.6.0

> Test COW : Hive Read Optimized Query with metadata bootstrap
> 
>
> Key: HUDI-947
> URL: https://issues.apache.org/jira/browse/HUDI-947
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> Test Hive Queries as described in 
> [https://docs.google.com/spreadsheets/d/1xVfatk-6-fekwuCCZ-nTHQkewcHSEk89y-ReVV5vHQU/edit#gid=1813901684]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-948) Test MOR : Hive Read Optimized Query with metadata bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-948:

Fix Version/s: 0.6.0

> Test MOR : Hive Read Optimized Query with metadata bootstrap
> 
>
> Key: HUDI-948
> URL: https://issues.apache.org/jira/browse/HUDI-948
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-947) Test COW : Hive Read Optimized Query with metadata bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-947:

Priority: Blocker  (was: Major)

> Test COW : Hive Read Optimized Query with metadata bootstrap
> 
>
> Key: HUDI-947
> URL: https://issues.apache.org/jira/browse/HUDI-947
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Test Hive Queries as described in 
> [https://docs.google.com/spreadsheets/d/1xVfatk-6-fekwuCCZ-nTHQkewcHSEk89y-ReVV5vHQU/edit#gid=1813901684]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-900) Metadata Bootstrap Key Generator needs to handle complex keys correctly

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-900:

Priority: Blocker  (was: Major)

> Metadata Bootstrap Key Generator needs to handle complex keys correctly
> ---
>
> Key: HUDI-900
> URL: https://issues.apache.org/jira/browse/HUDI-900
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 24h
>  Remaining Estimate: 0h
>
> Look at ComplexKeyGenerator. Make sure MetadataBootstrap is of same format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-946) Metadata Bootstrap Query Testing Master TIcket

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-946:

Priority: Blocker  (was: Major)

> Metadata Bootstrap Query Testing  Master TIcket
> ---
>
> Key: HUDI-946
> URL: https://issues.apache.org/jira/browse/HUDI-946
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Hive Integration, Presto Integration, Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
>  
> Query Pattern used for testing : 
> [https://docs.google.com/spreadsheets/d/1xVfatk-6-fekwuCCZ-nTHQkewcHSEk89y-ReVV5vHQU/edit?usp=sharing]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-899) Add a knob to change partition-path style while performing metadata bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-899:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Add a knob to change partition-path style while performing metadata bootstrap
> -
>
> Key: HUDI-899
> URL: https://issues.apache.org/jira/browse/HUDI-899
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>  Time Spent: 24h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-429) Long Running Testing to certify Bootstrapping

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan resolved HUDI-429.
-
Resolution: Fixed

> Long Running Testing to certify Bootstrapping
> -
>
> Key: HUDI-429
> URL: https://issues.apache.org/jira/browse/HUDI-429
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 168h
>  Remaining Estimate: 0h
>
> It would be great if we run long running tests to perform bootstrapping.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-429) Long Running Testing to certify Bootstrapping

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-429:

Priority: Blocker  (was: Major)

> Long Running Testing to certify Bootstrapping
> -
>
> Key: HUDI-429
> URL: https://issues.apache.org/jira/browse/HUDI-429
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 168h
>  Remaining Estimate: 0h
>
> It would be great if we run long running tests to perform bootstrapping.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-620) Hive Sync Integration of bootstrapped table

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-620:

Priority: Blocker  (was: Major)

> Hive Sync Integration of bootstrapped table
> ---
>
> Key: HUDI-620
> URL: https://issues.apache.org/jira/browse/HUDI-620
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Assignee: Udit Mehrotra
>Priority: Blocker
>  Time Spent: 72h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-427) Implement CLI support for performing bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-427:

Priority: Blocker  (was: Major)

> Implement CLI support for performing bootstrap
> --
>
> Key: HUDI-427
> URL: https://issues.apache.org/jira/browse/HUDI-427
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: CLI
>Reporter: Balaji Varadarajan
>Assignee: Wenning Ding
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 168h
>  Remaining Estimate: 0h
>
> Need CLI to perform bootstrap as described in 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-418) Bootstrap Index - Implementation

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-418:

Priority: Blocker  (was: Major)

> Bootstrap Index - Implementation
> 
>
> Key: HUDI-418
> URL: https://issues.apache.org/jira/browse/HUDI-418
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> An implementation for 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+:+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi#RFC-12:EfficientMigrationofLargeParquetTablestoApacheHudi-BootstrapIndex:]
>  is present in 
> [https://github.com/bvaradar/hudi/blob/vb_bootstrap/hudi-common/src/main/java/org/apache/hudi/common/consolidated/CompositeMapFile.java]
>  
> We need to make it solid with unit-tests and cleanup. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-420) Automated end to end Integration Test

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-420:

Priority: Blocker  (was: Major)

> Automated end to end Integration Test
> -
>
> Key: HUDI-420
> URL: https://issues.apache.org/jira/browse/HUDI-420
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 72h
>  Remaining Estimate: 0h
>
> We need end to end test as part ITTestHoodieDemo to also include bootstrap 
> table cases.
> We can have a new table bootstrapped from the Hoodie table build in the demo 
> and ensure queries work and return same responses



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-422) Cleanup bootstrap code and create write APIs for supporting bootstrap

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-422:

Priority: Blocker  (was: Major)

> Cleanup bootstrap code and create write APIs for supporting bootstrap 
> --
>
> Key: HUDI-422
> URL: https://issues.apache.org/jira/browse/HUDI-422
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 96h
>  Remaining Estimate: 0h
>
> Once refactor for HoodieWriteClient is done, we can cleanup and introduce 
> HoodieBootstrapClient as a separate PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-424) Implement Hive Query Side Integration for querying tables containing bootstrap file slices

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-424:

Priority: Blocker  (was: Major)

> Implement Hive Query Side Integration for querying tables containing 
> bootstrap file slices
> --
>
> Key: HUDI-424
> URL: https://issues.apache.org/jira/browse/HUDI-424
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 336h
>  Remaining Estimate: 0h
>
> Support for Hive read-optimized and realtime queries 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-426) Implement Spark DataSource Support for querying bootstrapped tables

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-426:

Priority: Blocker  (was: Major)

> Implement Spark DataSource Support for querying bootstrapped tables
> ---
>
> Key: HUDI-426
> URL: https://issues.apache.org/jira/browse/HUDI-426
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Udit Mehrotra
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We need ability in SparkDataSource to query COW table which is bootstrapped 
> as per 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+:+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi#RFC-12:EfficientMigrationofLargeParquetTablestoApacheHudi-BootstrapIndex:]
>  
> Current implementation delegates to Parquet DataSource but this wont work as 
> we need ability to stitch the columns externally.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-423) Implement upsert functionality for handling updates to these bootstrap file slices

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-423:

Priority: Blocker  (was: Major)

> Implement upsert functionality for handling updates to these bootstrap file 
> slices
> --
>
> Key: HUDI-423
> URL: https://issues.apache.org/jira/browse/HUDI-423
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Common Core, Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 168h
>  Remaining Estimate: 0h
>
> Needs support to handle upsert of these file-slices. For MOR tables, also 
> need compaction support. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-425) Implement support for bootstrapping in HoodieDeltaStreamer

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-425:

Priority: Blocker  (was: Major)

> Implement support for bootstrapping in HoodieDeltaStreamer
> --
>
> Key: HUDI-425
> URL: https://issues.apache.org/jira/browse/HUDI-425
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: help-wanted
> Fix For: 0.6.0
>
>  Time Spent: 168h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-421) Cleanup bootstrap code and create PR for FileStystemView changes

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-421:

Priority: Blocker  (was: Major)

> Cleanup bootstrap code and create PR for  FileStystemView changes
> -
>
> Key: HUDI-421
> URL: https://issues.apache.org/jira/browse/HUDI-421
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 240h
>  Remaining Estimate: 0h
>
> FileSystemView needs changes to identify and handle bootstrap file slices. 
> Code changes are present in 
> [https://github.com/bvaradar/hudi/tree/vb_bootstrap] Needs cleanup before 
> they are ready to become PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-417) Refactor HoodieWriteClient so that commit logic can be shareable by both bootstrap and normal write operations

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-417:

Priority: Blocker  (was: Major)

> Refactor HoodieWriteClient so that commit logic can be shareable by both 
> bootstrap and normal write operations
> --
>
> Key: HUDI-417
> URL: https://issues.apache.org/jira/browse/HUDI-417
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
>  
> Basic Code Changes are present in the fork : 
> [https://github.com/bvaradar/hudi/tree/vb_bootstrap]
>  
> The current implementation of HoodieBootstrapClient has duplicate code for 
> committing bootstrap. 
> [https://github.com/bvaradar/hudi/blob/vb_bootstrap/hudi-client/src/main/java/org/apache/hudi/bootstrap/HoodieBootstrapClient.java]
>  
>  
> We can have an independent PR which would move these commit functionality 
> from HoodieWriteClient to a new base class AbstractHoodieWriteClient which 
> HoodieBootstrapClient can inherit.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-419) Basic Implementation for verifying if bootstrapping works end to end

2020-07-06 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-419:

Priority: Blocker  (was: Major)

> Basic Implementation for verifying if bootstrapping works end to end
> 
>
> Key: HUDI-419
> URL: https://issues.apache.org/jira/browse/HUDI-419
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Common Core, Hive Integration, Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As part of prototyping, I have most of the core functionalities in 
> [https://github.com/bvaradar/hudi/tree/vb_bootstrap]
>  
> This includes:
>  # Timeline and FileSystem View changes
>  # New Bootstrap Client to perform Bootstrap
>  # DeltaStreamer Integration
>  # Hive Parquet Read Optimized reader integration
>  
> Needs to be done:
>  # Merge Handle changes to support upsert over bootstrap file slice (Read 
> part similar to that of (4) functionally and write part same as that of 
> current Hoodie MergeHandle.
>  # Unit Testing 
>  # Code cleanup as the current implementation has duplicated code.
>  # Automated integration test
>  # Hoodie CLI and Spark DataSource Write integration



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1077) Integration tests to validate clustering

2020-07-06 Thread satish (Jira)
satish created HUDI-1077:


 Summary: Integration tests to validate clustering
 Key: HUDI-1077
 URL: https://issues.apache.org/jira/browse/HUDI-1077
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: satish


extend test-suite module to validate clustering



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1076) CLI tools to support clustering

2020-07-06 Thread satish (Jira)
satish created HUDI-1076:


 Summary: CLI tools to support clustering
 Key: HUDI-1076
 URL: https://issues.apache.org/jira/browse/HUDI-1076
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: satish


1) schedule clustering
2) complete clustering
3) cancel clustering
4) rollback clustering



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1075) Implement a simple merge clustering strategy

2020-07-06 Thread satish (Jira)
satish created HUDI-1075:


 Summary: Implement a simple merge clustering strategy 
 Key: HUDI-1075
 URL: https://issues.apache.org/jira/browse/HUDI-1075
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: satish


Provide action to merge N small parquet files into M parquet files (M < N). 
Avoid serializing and deserializing records and just copy parquet blocks when 
possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1074) implement merge-sort based clustering strategy

2020-07-06 Thread satish (Jira)
satish created HUDI-1074:


 Summary: implement merge-sort based clustering strategy
 Key: HUDI-1074
 URL: https://issues.apache.org/jira/browse/HUDI-1074
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: satish


implement a merge-sort based clustering algorithm. Example: i) sort all small 
files by specified column(s)  ii) merge N small files into M larger files by 
respecting sort order (M < N)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1073) Implement skeleton to support multiple clustering strategies

2020-07-06 Thread satish (Jira)
satish created HUDI-1073:


 Summary: Implement skeleton to support multiple clustering 
strategies
 Key: HUDI-1073
 URL: https://issues.apache.org/jira/browse/HUDI-1073
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: satish


Implement skeleton to support
* scheduling clustering
* completing action

Support should include high level API to sort by specific columns



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >