[GitHub] [hudi] PhatakN1 commented on issue #1549: Potential issue when using Deltastreamer with DMS

2020-07-05 Thread GitBox


PhatakN1 commented on issue #1549:
URL: https://github.com/apache/hudi/issues/1549#issuecomment-654000154


   No – My issue was specifically related to MoR format when I use 
AwsDMSPayload class.
   
   From: Sivabalan Narayanan 
   Sent: Sunday, July 5, 2020 1:56 AM
   To: apache/hudi 
   Cc: Phatak, Ninad Vidyadhar ; Mention 

   Subject: Re: [apache/hudi] Potential issue when using Deltastreamer with DMS 
(#1549)
   
   
   @PhatakN1 : was 
this something you were also facing.
   
   —
   You are receiving this because you were mentioned.
   Reply to this email directly, view it on 
GitHub, or 
unsubscribe.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Build failed in Jenkins: hudi-snapshot-deployment-0.5 #330

2020-07-05 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.34 KB...]

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark-bundle_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities-bundle_${s

[GitHub] [hudi] garyli1019 commented on issue #1786: [SUPPORT] Bulk insert slow on MOR

2020-07-05 Thread GitBox


garyli1019 commented on issue #1786:
URL: https://github.com/apache/hudi/issues/1786#issuecomment-653989028


   Hi @rvd8345 , are you referring `shuffle parallelism` to 
`spark.shuffle.partition` or hudi parallelism.
   For bulk insert, the Hudi parallelism seems too large for 9.7 GB data. With 
this config, it will create a lot of small files.
   Also, the screenshot of stage 5 details would be helpful as well.
   Would you try to tune the following config:
   `hoodie.bulkinsert.shuffle.parallelism` to `100` and leave the file size 
limit to default?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on a change in pull request #1702: Bootstrap datasource changes

2020-07-05 Thread GitBox


garyli1019 commented on a change in pull request #1702:
URL: https://github.com/apache/hudi/pull/1702#discussion_r449934912



##
File path: hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##
@@ -54,29 +58,54 @@ class DefaultSource extends RelationProvider
 val parameters = Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL) ++ 
translateViewTypesToQueryTypes(optParams)
 
 val path = parameters.get("path")
-if (path.isEmpty) {
-  throw new HoodieException("'path' must be specified.")
-}
 
 if (parameters(QUERY_TYPE_OPT_KEY).equals(QUERY_TYPE_SNAPSHOT_OPT_VAL)) {
-  // this is just effectively RO view only, where `path` can contain a mix 
of
-  // non-hoodie/hoodie path files. set the path filter up
-  sqlContext.sparkContext.hadoopConfiguration.setClass(
-"mapreduce.input.pathFilter.class",
-classOf[HoodieROTablePathFilter],
-classOf[org.apache.hadoop.fs.PathFilter])
-
-  log.info("Constructing hoodie (as parquet) data source with options :" + 
parameters)
-  log.warn("Snapshot view not supported yet via data source, for 
MERGE_ON_READ tables. " +
-"Please query the Hive table registered using Spark SQL.")
-  // simply return as a regular parquet relation
-  DataSource.apply(
-sparkSession = sqlContext.sparkSession,
-userSpecifiedSchema = Option(schema),
-className = "parquet",
-options = parameters)
-.resolveRelation()
+  val readPathsStr = 
parameters.get(DataSourceReadOptions.READ_PATHS_OPT_KEY)

Review comment:
   Are these additional paths on top of the `path`? Any example of the use 
cases?

##
File path: hudi-spark/src/main/scala/org/apache/hudi/HudiBootstrapRDD.scala
##
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.spark.{Partition, TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.vectorized.ColumnarBatch
+
+class HudiBootstrapRDD(@transient spark: SparkSession,
+   dataReadFunction: PartitionedFile => Iterator[Any],
+   skeletonReadFunction: PartitionedFile => Iterator[Any],
+   regularReadFunction: PartitionedFile => Iterator[Any],
+   dataSchema: StructType,
+   skeletonSchema: StructType,
+   requiredColumns: Array[String],
+   tableState: HudiBootstrapTableState)
+  extends RDD[InternalRow](spark.sparkContext, Nil) {
+
+  override def compute(split: Partition, context: TaskContext): 
Iterator[InternalRow] = {
+val bootstrapPartition = split.asInstanceOf[HudiBootstrapPartition]
+
+if (log.isDebugEnabled) {
+  if (bootstrapPartition.split.skeletonFile.isDefined) {
+logDebug("Got Split => Index: " + bootstrapPartition.index + ", Data 
File: "
+  + bootstrapPartition.split.dataFile.filePath + ", Skeleton File: "
+  + bootstrapPartition.split.skeletonFile.get.filePath)
+  } else {
+logDebug("Got Split => Index: " + bootstrapPartition.index + ", Data 
File: "
+  + bootstrapPartition.split.dataFile.filePath)
+  }
+}
+
+var partitionedFileIterator: Iterator[InternalRow] = null
+
+if (bootstrapPartition.split.skeletonFile.isDefined) {
+  // It is a bootstrap split. Check both skeleton and data files.
+  if (dataSchema.isEmpty) {
+// No data column to fetch, hence fetch only from skeleton file
+partitionedFileIterator = 
read(bootstrapPartition.split.skeletonFile.get,  skeletonReadFunction)
+  } else if (skeletonSchema.isEmpty) {
+// No metadata column to fetch, hence fetch only from data file
+partitionedFileIterator = read(bootstrapPartition.split.dataFile, 
dataReadFunction)
+  } else {
+// Fetch from both data and skeleton file, and merge
+val dataFileIterator = read(bootstrapPartition.split.data

[GitHub] [hudi] lw309637554 commented on pull request #1756: [HUDI-839] Adding unit test for MarkerFiles,RollbackUtils, RollbackActionExecutor for markers and filelisting

2020-07-05 Thread GitBox


lw309637554 commented on pull request #1756:
URL: https://github.com/apache/hudi/pull/1756#issuecomment-653975959


   > Thanks for adding the tests.. left some comments..
   > 
   > I am trying to cross check the deletion of the marker dir once more (the 
issue you mentioned before in prev iteration).. I will push an update to the 
branch.. and it should be good to go.
   
   thanks , i will fix the issue as the comment. More comprehensive solution 
about delete markerfile will depend on you 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] lw309637554 commented on a change in pull request #1756: [HUDI-839] Adding unit test for MarkerFiles,RollbackUtils, RollbackActionExecutor for markers and filelisting

2020-07-05 Thread GitBox


lw309637554 commented on a change in pull request #1756:
URL: https://github.com/apache/hudi/pull/1756#discussion_r449948302



##
File path: 
hudi-client/src/test/java/org/apache/hudi/table/action/rollback/TestMergeOnReadRollbackActionExecutor.java
##
@@ -0,0 +1,211 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.rollback;
+
+import org.apache.hudi.client.HoodieWriteClient;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.HoodieRollbackStat;
+
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieFileGroup;
+import org.apache.hudi.common.model.HoodieLogFile;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.testutils.HoodieClientTestBase;
+import org.apache.hudi.testutils.HoodieTestDataGenerator;
+import org.apache.spark.api.java.JavaRDD;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.stream.Collectors;
+
+import static 
org.apache.hudi.testutils.HoodieTestDataGenerator.DEFAULT_FIRST_PARTITION_PATH;
+import static 
org.apache.hudi.testutils.HoodieTestDataGenerator.DEFAULT_SECOND_PARTITION_PATH;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+
+public class TestMergeOnReadRollbackActionExecutor extends 
HoodieClientTestBase {
+  @Override
+  protected HoodieTableType getTableType() {
+return HoodieTableType.MERGE_ON_READ;
+  }
+
+  @BeforeEach
+  public void setUp() throws Exception {
+initPath();
+initSparkContexts();
+//just generate tow partitions
+dataGen = new HoodieTestDataGenerator(new 
String[]{DEFAULT_FIRST_PARTITION_PATH, DEFAULT_SECOND_PARTITION_PATH});
+initFileSystem();
+initMetaClient();
+  }
+
+  @AfterEach
+  public void tearDown() throws Exception {
+cleanupResources();
+  }
+
+  private void twoUpsertCommitDataRollBack(boolean isUsingMarkers) throws 
IOException, InterruptedException {

Review comment:
   yes  make sense





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] lw309637554 commented on a change in pull request #1756: [HUDI-839] Adding unit test for MarkerFiles,RollbackUtils, RollbackActionExecutor for markers and filelisting

2020-07-05 Thread GitBox


lw309637554 commented on a change in pull request #1756:
URL: https://github.com/apache/hudi/pull/1756#discussion_r449948082



##
File path: 
hudi-client/src/test/java/org/apache/hudi/table/TestHoodieMergeOnReadTable.java
##
@@ -445,10 +442,20 @@ public void 
testCOWToMORConvertedTableRollback(HoodieFileFormat baseFileFormat)
 
   @ParameterizedTest
   @MethodSource("argumentsProvider")
-  public void testRollbackWithDeltaAndCompactionCommit(HoodieFileFormat 
baseFileFormat) throws Exception {
+  public void testCOWToMORConvertedTableRollbackUsingFileList(HoodieFileFormat 
baseFileFormat) throws Exception {

Review comment:
   ok





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] lw309637554 commented on a change in pull request #1756: [HUDI-839] Adding unit test for MarkerFiles,RollbackUtils, RollbackActionExecutor for markers and filelisting

2020-07-05 Thread GitBox


lw309637554 commented on a change in pull request #1756:
URL: https://github.com/apache/hudi/pull/1756#discussion_r449947972



##
File path: hudi-client/src/test/java/org/apache/hudi/table/TestCleaner.java
##
@@ -904,6 +901,19 @@ public void testCleanMarkerDataFilesOnRollback() throws 
IOException {
 assertEquals(0, getTotalTempFiles(), "All temp files are deleted.");
   }
 
+  /**
+   * Test Cleaning functionality of table.rollback() API.
+   */
+  @Test
+  public void testCleanMarkerDataFilesOnRollbackUsingFileList() throws 
IOException {

Review comment:
   ok





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] lw309637554 commented on a change in pull request #1756: [HUDI-839] Adding unit test for MarkerFiles,RollbackUtils, RollbackActionExecutor for markers and filelisting

2020-07-05 Thread GitBox


lw309637554 commented on a change in pull request #1756:
URL: https://github.com/apache/hudi/pull/1756#discussion_r449947922



##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/rollback/ListingBasedRollbackHelper.java
##
@@ -54,29 +53,28 @@
 /**
  * Performs Rollback of Hoodie Tables.
  */
-public class RollbackHelper implements Serializable {
+public class ListingBasedRollbackHelper implements Serializable {
 
-  private static final Logger LOG = LogManager.getLogger(RollbackHelper.class);
+  private static final Logger LOG = 
LogManager.getLogger(ListingBasedRollbackHelper.class);
 
   private final HoodieTableMetaClient metaClient;
   private final HoodieWriteConfig config;
 
-  public RollbackHelper(HoodieTableMetaClient metaClient, HoodieWriteConfig 
config) {
+  public ListingBasedRollbackHelper(HoodieTableMetaClient metaClient, 
HoodieWriteConfig config) {
 this.metaClient = metaClient;
 this.config = config;
   }
 
   /**
* Performs all rollback actions that we have collected in parallel.
*/
-  public List performRollback(JavaSparkContext jsc, 
HoodieInstant instantToRollback, List rollbackRequests) {
+  public List performRollback(JavaSparkContext jsc, 
HoodieInstant instantToRollback, List 
rollbackRequests) {
 
-String basefileExtension = 
metaClient.getTableConfig().getBaseFileFormat().getFileExtension();
 SerializablePathFilter filter = (path) -> {
-  if (path.toString().contains(basefileExtension)) {
+  if 
(path.toString().endsWith(HoodieFileFormat.PARQUET.getFileExtension())) {

Review comment:
   to be format from the table object will better

##
File path: hudi-client/src/test/java/org/apache/hudi/index/TestHoodieIndex.java
##
@@ -328,6 +330,18 @@ public void 
testSimpleTagLocationAndUpdateWithRollback(IndexType indexType) thro
 assert (javaRDD.filter(record -> record.getCurrentLocation() != 
null).collect().size() == 0);
   }
 
+  @ParameterizedTest
+  @EnumSource(value = IndexType.class, names = {"BLOOM", "GLOBAL_BLOOM", 
"SIMPLE", "GLOBAL_SIMPLE"})

Review comment:
   ok





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] lw309637554 commented on a change in pull request #1756: [HUDI-839] Adding unit test for MarkerFiles,RollbackUtils, RollbackActionExecutor for markers and filelisting

2020-07-05 Thread GitBox


lw309637554 commented on a change in pull request #1756:
URL: https://github.com/apache/hudi/pull/1756#discussion_r449947875



##
File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
##
@@ -97,28 +98,9 @@ public Path makeNewPath(String partitionPath) {
*
* @param partitionPath Partition path
*/
-  protected void createMarkerFile(String partitionPath) {
-Path markerPath = makeNewMarkerPath(partitionPath);
-try {
-  LOG.info("Creating Marker Path=" + markerPath);
-  fs.create(markerPath, false).close();
-} catch (IOException e) {
-  throw new HoodieException("Failed to create marker file " + markerPath, 
e);
-}
-  }
-
-  /**
-   * THe marker path will be 
/.hoodie/.temp//2019/04/25/filename.
-   */
-  private Path makeNewMarkerPath(String partitionPath) {
-Path markerRootPath = new 
Path(hoodieTable.getMetaClient().getMarkerFolderPath(instantTime));
-Path path = FSUtils.getPartitionPath(markerRootPath, partitionPath);
-try {
-  fs.mkdirs(path); // create a new partition as needed.
-} catch (IOException e) {
-  throw new HoodieIOException("Failed to make dir " + path, e);
-}
-return new Path(path.toString(), FSUtils.makeMarkerFile(instantTime, 
writeToken, fileId));
+  protected void createMarkerFile(String partitionPath, String dataFileName) {
+MarkerFiles markerFiles = new MarkerFiles(hoodieTable, instantTime);
+markerFiles.createMarkerFile(partitionPath, dataFileName, getIOType());

Review comment:
   hi, i think to be createMarkerFile will more clearness





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] lw309637554 commented on a change in pull request #1756: [HUDI-839] Adding unit test for MarkerFiles,RollbackUtils, RollbackActionExecutor for markers and filelisting

2020-07-05 Thread GitBox


lw309637554 commented on a change in pull request #1756:
URL: https://github.com/apache/hudi/pull/1756#discussion_r449947358



##
File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
##
@@ -113,8 +109,9 @@ private void init(String fileId, String partitionPath, 
HoodieBaseFile dataFileTo
   partitionMetadata.trySave(getPartitionId());
 
   oldFilePath = new Path(config.getBasePath() + "/" + partitionPath + "/" 
+ latestValidFilePath);
+  String newFileName = FSUtils.makeDataFileName(instantTime, writeToken, 
fileId);

Review comment:
   yes, it will be more common use. HUDI will support more base format





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] lw309637554 commented on a change in pull request #1756: [HUDI-839] Adding unit test for MarkerFiles,RollbackUtils, RollbackActionExecutor for markers and filelisting

2020-07-05 Thread GitBox


lw309637554 commented on a change in pull request #1756:
URL: https://github.com/apache/hudi/pull/1756#discussion_r449946532



##
File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
##
@@ -278,6 +286,11 @@ public WriteStatus getWriteStatus() {
 return writeStatus;
   }
 
+  @Override
+  public MarkerFiles.MarkerType getIOType() {

Review comment:
to IOType will be better





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] lw309637554 commented on a change in pull request #1756: [HUDI-839] Adding unit test for MarkerFiles,RollbackUtils, RollbackActionExecutor for markers and filelisting

2020-07-05 Thread GitBox


lw309637554 commented on a change in pull request #1756:
URL: https://github.com/apache/hudi/pull/1756#discussion_r449946413



##
File path: 
hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java
##
@@ -332,9 +333,11 @@ public static SparkConf registerClasses(SparkConf conf) {
   }
 
   @Override
-  protected void postCommit(HoodieCommitMetadata metadata, String instantTime,
-  Option> extraMetadata) {
+  protected void postCommit(HoodieTable table, HoodieCommitMetadata 
metadata, String instantTime, Option> extraMetadata) {
 try {
+  if (!config.getRollBackUsingMarkers()) {

Review comment:
   i  also think  it not so good. i think can do not delete markerfiles 
here.Can delete the unuseful   markerfile in pre-commit. when clean delete the 
old markerfiles





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch asf-site updated: Travis CI build asf-site

2020-07-05 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new d8f0d8e  Travis CI build asf-site
d8f0d8e is described below

commit d8f0d8e2fe38849892fe3657828fdb23ad6a4e19
Author: CI 
AuthorDate: Mon Jul 6 00:50:35 2020 +

Travis CI build asf-site
---
 content/cn/community.html | 6 +-
 content/community.html| 6 +-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/content/cn/community.html b/content/cn/community.html
index 335ae57..a4b8bdb 100644
--- a/content/cn/community.html
+++ b/content/cn/community.html
@@ -219,10 +219,14 @@
   
   
 
-  For any general questions, user support, development discussions
+  For development discussions
   Dev Mailing list (mailto:dev-subscr...@hudi.apache.org";>Subscribe, mailto:dev-unsubscr...@hudi.apache.org";>Unsubscribe, https://lists.apache.org/list.html?d...@hudi.apache.org";>Archives). 
Empty email works for subscribe/unsubscribe. Please use https://gist.github.com";>gists to share code/stacktraces on the 
email.
 
 
+  For any general questions, user support
+  Users Mailing list (mailto:users-subscr...@hudi.apache.org";>Subscribe, mailto:users-unsubscr...@hudi.apache.org";>Unsubscribe, https://lists.apache.org/list.html?us...@hudi.apache.org";>Archives). 
Empty email works for subscribe/unsubscribe. Please use https://gist.github.com";>gists to share code/stacktraces on the 
email.
+
+
   For reporting bugs or issues or discover known issues
   Please use https://issues.apache.org/jira/projects/HUDI/summary";>ASF Hudi JIRA. 
See #here for access
 
diff --git a/content/community.html b/content/community.html
index f4b6636..9ec95e4 100644
--- a/content/community.html
+++ b/content/community.html
@@ -219,10 +219,14 @@
   
   
 
-  For any general questions, user support, development discussions
+  For development discussions
   Dev Mailing list (mailto:dev-subscr...@hudi.apache.org";>Subscribe, mailto:dev-unsubscr...@hudi.apache.org";>Unsubscribe, https://lists.apache.org/list.html?d...@hudi.apache.org";>Archives). 
Empty email works for subscribe/unsubscribe. Please use https://gist.github.com";>gists to share code/stacktraces on the 
email.
 
 
+  For any general questions, user support
+  Users Mailing list (mailto:users-subscr...@hudi.apache.org";>Subscribe, mailto:users-unsubscr...@hudi.apache.org";>Unsubscribe, https://lists.apache.org/list.html?us...@hudi.apache.org";>Archives). 
Empty email works for subscribe/unsubscribe. Please use https://gist.github.com";>gists to share code/stacktraces on the 
email.
+
+
   For reporting bugs or issues or discover known issues
   Please use https://issues.apache.org/jira/projects/HUDI/summary";>ASF Hudi JIRA. 
See #here for access
 



[GitHub] [hudi] vinothchandar merged pull request #1796: [MINOR] Add the users@ mailing list to the community page

2020-07-05 Thread GitBox


vinothchandar merged pull request #1796:
URL: https://github.com/apache/hudi/pull/1796


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch asf-site updated: [MINOR] Add the users@ mailing list to the community page (#1796)

2020-07-05 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 823bb3e  [MINOR] Add the users@ mailing list to the community page 
(#1796)
823bb3e is described below

commit 823bb3e9429c2eaf1fc8021a4c6cca1914a783fb
Author: vinoyang 
AuthorDate: Mon Jul 6 08:48:40 2020 +0800

[MINOR] Add the users@ mailing list to the community page (#1796)
---
 docs/_pages/community.cn.md | 3 ++-
 docs/_pages/community.md| 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/docs/_pages/community.cn.md b/docs/_pages/community.cn.md
index 6a424e1..b06dd9a 100644
--- a/docs/_pages/community.cn.md
+++ b/docs/_pages/community.cn.md
@@ -12,7 +12,8 @@ There are several ways to get in touch with the Hudi 
community.
 
 | When? | Channel to use |
 |---||
-| For any general questions, user support, development discussions | Dev 
Mailing list ([Subscribe](mailto:dev-subscr...@hudi.apache.org), 
[Unsubscribe](mailto:dev-unsubscr...@hudi.apache.org), 
[Archives](https://lists.apache.org/list.html?d...@hudi.apache.org)). Empty 
email works for subscribe/unsubscribe. Please use 
[gists](https://gist.github.com) to share code/stacktraces on the email. |
+| For development discussions | Dev Mailing list 
([Subscribe](mailto:dev-subscr...@hudi.apache.org), 
[Unsubscribe](mailto:dev-unsubscr...@hudi.apache.org), 
[Archives](https://lists.apache.org/list.html?d...@hudi.apache.org)). Empty 
email works for subscribe/unsubscribe. Please use 
[gists](https://gist.github.com) to share code/stacktraces on the email. |
+| For any general questions, user support | Users Mailing list 
([Subscribe](mailto:users-subscr...@hudi.apache.org), 
[Unsubscribe](mailto:users-unsubscr...@hudi.apache.org), 
[Archives](https://lists.apache.org/list.html?us...@hudi.apache.org)). Empty 
email works for subscribe/unsubscribe. Please use 
[gists](https://gist.github.com) to share code/stacktraces on the email. |
 | For reporting bugs or issues or discover known issues | Please use [ASF Hudi 
JIRA](https://issues.apache.org/jira/projects/HUDI/summary). See 
[#here](#accounts) for access |
 | For quick pings & 1-1 chats | Join our [slack 
group](https://join.slack.com/t/apache-hudi/signup). In case your mail domain 
is not there in pre-approved list for joining slack group, please check out the 
[github issue](https://github.com/apache/hudi/issues/143) |
 | For proposing large features, changes | Start a RFC. Instructions 
[here](https://cwiki.apache.org/confluence/display/HUDI/RFC+Process).
diff --git a/docs/_pages/community.md b/docs/_pages/community.md
index b43d644..a3bc585 100644
--- a/docs/_pages/community.md
+++ b/docs/_pages/community.md
@@ -11,7 +11,8 @@ There are several ways to get in touch with the Hudi 
community.
 
 | When? | Channel to use |
 |---||
-| For any general questions, user support, development discussions | Dev 
Mailing list ([Subscribe](mailto:dev-subscr...@hudi.apache.org), 
[Unsubscribe](mailto:dev-unsubscr...@hudi.apache.org), 
[Archives](https://lists.apache.org/list.html?d...@hudi.apache.org)). Empty 
email works for subscribe/unsubscribe. Please use 
[gists](https://gist.github.com) to share code/stacktraces on the email. |
+| For development discussions | Dev Mailing list 
([Subscribe](mailto:dev-subscr...@hudi.apache.org), 
[Unsubscribe](mailto:dev-unsubscr...@hudi.apache.org), 
[Archives](https://lists.apache.org/list.html?d...@hudi.apache.org)). Empty 
email works for subscribe/unsubscribe. Please use 
[gists](https://gist.github.com) to share code/stacktraces on the email. |
+| For any general questions, user support | Users Mailing list 
([Subscribe](mailto:users-subscr...@hudi.apache.org), 
[Unsubscribe](mailto:users-unsubscr...@hudi.apache.org), 
[Archives](https://lists.apache.org/list.html?us...@hudi.apache.org)). Empty 
email works for subscribe/unsubscribe. Please use 
[gists](https://gist.github.com) to share code/stacktraces on the email. |
 | For reporting bugs or issues or discover known issues | Please use [ASF Hudi 
JIRA](https://issues.apache.org/jira/projects/HUDI/summary). See 
[#here](#accounts) for access |
 | For quick pings & 1-1 chats | Join our [slack 
group](https://join.slack.com/t/apache-hudi/signup). In case your mail domain 
is not there in pre-approved list for joining slack group, please check out the 
[github issue](https://github.com/apache/hudi/issues/143) |
 | For proposing large features, changes | Start a RFC. Instructions 
[here](https://cwiki.apache.org/confluence/display/HUDI/RFC+Process).



[GitHub] [hudi] vinothchandar commented on a change in pull request #1756: [HUDI-839] Adding unit test for MarkerFiles,RollbackUtils, RollbackActionExecutor for markers and filelisting

2020-07-05 Thread GitBox


vinothchandar commented on a change in pull request #1756:
URL: https://github.com/apache/hudi/pull/1756#discussion_r449786823



##
File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
##
@@ -113,8 +109,9 @@ private void init(String fileId, String partitionPath, 
HoodieBaseFile dataFileTo
   partitionMetadata.trySave(getPartitionId());
 
   oldFilePath = new Path(config.getBasePath() + "/" + partitionPath + "/" 
+ latestValidFilePath);
+  String newFileName = FSUtils.makeDataFileName(instantTime, writeToken, 
fileId);

Review comment:
   probably need to ensure we are getting the base file format extension 
from the hoodieTable instance? 

##
File path: 
hudi-client/src/test/java/org/apache/hudi/table/action/rollback/TestMergeOnReadRollbackActionExecutor.java
##
@@ -0,0 +1,211 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.rollback;
+
+import org.apache.hudi.client.HoodieWriteClient;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.HoodieRollbackStat;
+
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieFileGroup;
+import org.apache.hudi.common.model.HoodieLogFile;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.testutils.HoodieClientTestBase;
+import org.apache.hudi.testutils.HoodieTestDataGenerator;
+import org.apache.spark.api.java.JavaRDD;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.stream.Collectors;
+
+import static 
org.apache.hudi.testutils.HoodieTestDataGenerator.DEFAULT_FIRST_PARTITION_PATH;
+import static 
org.apache.hudi.testutils.HoodieTestDataGenerator.DEFAULT_SECOND_PARTITION_PATH;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+
+public class TestMergeOnReadRollbackActionExecutor extends 
HoodieClientTestBase {
+  @Override
+  protected HoodieTableType getTableType() {
+return HoodieTableType.MERGE_ON_READ;
+  }
+
+  @BeforeEach
+  public void setUp() throws Exception {
+initPath();
+initSparkContexts();
+//just generate tow partitions
+dataGen = new HoodieTestDataGenerator(new 
String[]{DEFAULT_FIRST_PARTITION_PATH, DEFAULT_SECOND_PARTITION_PATH});
+initFileSystem();
+initMetaClient();
+  }
+
+  @AfterEach
+  public void tearDown() throws Exception {
+cleanupResources();
+  }
+
+  private void twoUpsertCommitDataRollBack(boolean isUsingMarkers) throws 
IOException, InterruptedException {

Review comment:
   anyway to share code with the COW test?

##
File path: hudi-client/src/test/java/org/apache/hudi/index/TestHoodieIndex.java
##
@@ -328,6 +330,18 @@ public void 
testSimpleTagLocationAndUpdateWithRollback(IndexType indexType) thro
 assert (javaRDD.filter(record -> record.getCurrentLocation() != 
null).collect().size() == 0);
   }
 
+  @ParameterizedTest
+  @EnumSource(value = IndexType.class, names = {"BLOOM", "GLOBAL_BLOOM", 
"SIMPLE", "GLOBAL_SIMPLE"})

Review comment:
   Similar here.. this test probably does not need to test these two modes? 
for e.g: what additional testing are we getting over the test schema evolution 
test by doing this?

##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/rollback/MarkerBasedRollbackStrategy.java
##
@@ -0,0 +1,161 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.

[jira] [Updated] (HUDI-860) Ability to do small file handling without need for caching

2020-07-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-860:
-
Description: 
As of now, in upsert path,
 * hudi builds a workloadProfile to understand total inserts and updates(with 
location info) 
 * Following which, small files info are populated
 * Then buckets are populated with above info. 
 * These buckets are later used when getPartition(Object key) is invoked in 
UpsertPartitioner.

In step1: to build global workload profile, we had to do an action on entire 
JavaRDDs in the driver and hudi does save the workload profile as 
well. 

For large write intensive batch jobs(COW types), caching this incurs additional 
overhead. So, this effort is trying to see if we can avoid doing this by some 
means. 

 

 

> Ability to do small file handling without need for caching
> --
>
> Key: HUDI-860
> URL: https://issues.apache.org/jira/browse/HUDI-860
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> As of now, in upsert path,
>  * hudi builds a workloadProfile to understand total inserts and updates(with 
> location info) 
>  * Following which, small files info are populated
>  * Then buckets are populated with above info. 
>  * These buckets are later used when getPartition(Object key) is invoked in 
> UpsertPartitioner.
> In step1: to build global workload profile, we had to do an action on entire 
> JavaRDDs in the driver and hudi does save the workload profile 
> as well. 
> For large write intensive batch jobs(COW types), caching this incurs 
> additional overhead. So, this effort is trying to see if we can avoid doing 
> this by some means. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] garyli1019 commented on a change in pull request #1702: Bootstrap datasource changes

2020-07-05 Thread GitBox


garyli1019 commented on a change in pull request #1702:
URL: https://github.com/apache/hudi/pull/1702#discussion_r434962196



##
File path: hudi-spark/src/main/scala/org/apache/hudi/HudiBootstrapRelation.scala
##
@@ -0,0 +1,185 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hadoop.fs.Path
+import org.apache.hudi.common.model.HoodieBaseFile
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
+import org.apache.hudi.exception.HoodieException
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
+import org.apache.spark.sql.{Row, SQLContext}
+import org.apache.spark.sql.sources.{BaseRelation, Filter, PrunedFilteredScan}
+import org.apache.spark.sql.types.StructType
+
+import scala.collection.JavaConverters._
+
+/**
+  * This is Spark relation that can be used for querying metadata/fully 
bootstrapped query hudi tables, as well as
+  * non-bootstrapped tables. It implements PrunedFilteredScan interface in 
order to support column pruning and filter
+  * push-down. For metadata bootstrapped files, if we query columns from both 
metadata and actual data then it will
+  * perform a merge of both to return the result.
+  *
+  * Caveat: Filter push-down does not work when querying both metadata and 
actual data columns over metadata
+  * bootstrapped files, because then the metadata file and data file can 
return different number of rows causing errors
+  * merging.
+  *
+  * @param _sqlContext Spark SQL Context
+  * @param userSchema User specified schema in the datasource query
+  * @param globPaths Globbed paths obtained from the user provided path for 
querying
+  * @param metaClient Hudi table meta client
+  * @param optParams DataSource options passed by the user
+  */
+class HudiBootstrapRelation(@transient val _sqlContext: SQLContext,
+val userSchema: StructType,
+val globPaths: Seq[Path],
+val metaClient: HoodieTableMetaClient,
+val optParams: Map[String, String]) extends 
BaseRelation
+  with PrunedFilteredScan with Logging {
+
+  val skeletonSchema: StructType = HudiSparkUtils.getHudiMetadataSchema
+  var dataSchema: StructType = _
+  var fullSchema: StructType = _
+
+  val fileIndex: HudiBootstrapFileIndex = buildFileIndex()
+
+  override def sqlContext: SQLContext = _sqlContext
+
+  override val needConversion: Boolean = false
+
+  override def schema: StructType = inferFullSchema()
+
+  override def buildScan(requiredColumns: Array[String], filters: 
Array[Filter]): RDD[Row] = {
+logInfo("Starting scan..")
+
+// Compute splits
+val bootstrapSplits = fileIndex.files.map(hoodieBaseFile => {
+  var skeletonFile: Option[PartitionedFile] = Option.empty
+  var dataFile: PartitionedFile = null
+
+  if (hoodieBaseFile.getExternalBaseFile.isPresent) {
+skeletonFile = Option(PartitionedFile(InternalRow.empty, 
hoodieBaseFile.getPath, 0, hoodieBaseFile.getFileLen))
+dataFile = PartitionedFile(InternalRow.empty, 
hoodieBaseFile.getExternalBaseFile.get().getPath, 0,
+  hoodieBaseFile.getExternalBaseFile.get().getFileLen)
+  } else {
+dataFile = PartitionedFile(InternalRow.empty, hoodieBaseFile.getPath, 
0, hoodieBaseFile.getFileLen)
+  }
+  HudiBootstrapSplit(dataFile, skeletonFile)
+})
+val tableState = HudiBootstrapTableState(bootstrapSplits)
+
+// Get required schemas for column pruning
+var requiredDataSchema = StructType(Seq())
+var requiredSkeletonSchema = StructType(Seq())
+requiredColumns.foreach(col => {
+  var field = dataSchema.find(_.name == col)
+  if (field.isDefined) {
+requiredDataSchema = requiredDataSchema.add(field.get)
+  } else {
+field = skeletonSchema.find(_.name == col)
+requiredSkeletonSchema = requiredSkeletonSchem

[GitHub] [hudi] garyli1019 commented on a change in pull request #1702: Bootstrap datasource changes

2020-07-05 Thread GitBox


garyli1019 commented on a change in pull request #1702:
URL: https://github.com/apache/hudi/pull/1702#discussion_r449933225



##
File path: hudi-spark/src/main/scala/org/apache/hudi/HudiBootstrapRDD.scala
##
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.spark.{Partition, TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.vectorized.ColumnarBatch
+
+class HudiBootstrapRDD(@transient spark: SparkSession,
+   dataReadFunction: PartitionedFile => Iterator[Any],
+   skeletonReadFunction: PartitionedFile => Iterator[Any],
+   regularReadFunction: PartitionedFile => Iterator[Any],
+   dataSchema: StructType,
+   skeletonSchema: StructType,
+   requiredColumns: Array[String],
+   tableState: HudiBootstrapTableState)
+  extends RDD[InternalRow](spark.sparkContext, Nil) {
+
+  override def compute(split: Partition, context: TaskContext): 
Iterator[InternalRow] = {
+val bootstrapPartition = split.asInstanceOf[HudiBootstrapPartition]
+
+if (log.isDebugEnabled) {
+  if (bootstrapPartition.split.skeletonFile.isDefined) {
+logDebug("Got Split => Index: " + bootstrapPartition.index + ", Data 
File: "
+  + bootstrapPartition.split.dataFile.filePath + ", Skeleton File: "
+  + bootstrapPartition.split.skeletonFile.get.filePath)
+  } else {
+logDebug("Got Split => Index: " + bootstrapPartition.index + ", Data 
File: "
+  + bootstrapPartition.split.dataFile.filePath)
+  }
+}
+
+var partitionedFileIterator: Iterator[InternalRow] = null
+
+if (bootstrapPartition.split.skeletonFile.isDefined) {
+  // It is a bootstrap split. Check both skeleton and data files.
+  if (dataSchema.isEmpty) {
+// No data column to fetch, hence fetch only from skeleton file
+partitionedFileIterator = 
read(bootstrapPartition.split.skeletonFile.get,  skeletonReadFunction)
+  } else if (skeletonSchema.isEmpty) {
+// No metadata column to fetch, hence fetch only from data file
+partitionedFileIterator = read(bootstrapPartition.split.dataFile, 
dataReadFunction)
+  } else {
+// Fetch from both data and skeleton file, and merge
+val dataFileIterator = read(bootstrapPartition.split.dataFile, 
dataReadFunction)
+val skeletonFileIterator = 
read(bootstrapPartition.split.skeletonFile.get, skeletonReadFunction)
+partitionedFileIterator = merge(skeletonFileIterator, dataFileIterator)
+  }
+} else {
+  partitionedFileIterator = read(bootstrapPartition.split.dataFile, 
regularReadFunction)
+}
+partitionedFileIterator
+  }
+
+  def merge(skeletonFileIterator: Iterator[InternalRow], dataFileIterator: 
Iterator[InternalRow])

Review comment:
   I think this approach is better than extending the `FileFormat`. 
Ultimately, we can have a `HudiRDD` to handle all the file loading and 
merging(bootstrap files, parquet, orc, logs). `Union` will trigger shuffle and 
grouping files on the driver then use different `FileFormat` to read is not as 
clean as this approach. 
   I will add the `MOR` stuff on top of this PR after this merged.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-860) Ability to do small file handling without need for caching

2020-07-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-860:


Assignee: sivabalan narayanan  (was: Vinoth Chandar)

> Ability to do small file handling without need for caching
> --
>
> Key: HUDI-860
> URL: https://issues.apache.org/jira/browse/HUDI-860
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-860) Ability to do small file handling without need for caching

2020-07-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-860:
-
Status: Open  (was: New)

> Ability to do small file handling without need for caching
> --
>
> Key: HUDI-860
> URL: https://issues.apache.org/jira/browse/HUDI-860
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-860) Ability to do small file handling without need for caching

2020-07-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-860:
-
Status: In Progress  (was: Open)

> Ability to do small file handling without need for caching
> --
>
> Key: HUDI-860
> URL: https://issues.apache.org/jira/browse/HUDI-860
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[hudi] branch master updated (574dcf9 -> 3b9a305)

2020-07-05 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 574dcf9  [MINOR] Relocate jetty during shading/packaging for 
Databricks runtime (#1781)
 add 3b9a305  [HUDI-996] Add functional test suite for hudi-utilities 
(#1746)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/testutils/DFSProvider.java |  34 +++
 .../hudi/testutils/FunctionalTestHarness.java  | 122 
 .../org/apache/hudi/testutils/SparkProvider.java   |  55 
 .../testutils/minicluster/HdfsTestService.java |   4 +-
 hudi-utilities/pom.xml |  18 ++
 .../hudi/utilities/TestHoodieSnapshotExporter.java |  57 
 .../TestKafkaConnectHdfsProvider.java  |  30 +-
 .../TestAWSDatabaseMigrationServiceSource.java |  31 +-
 .../TestChainedTransformer.java|  41 +--
 .../functional/TestHDFSParquetImporter.java| 340 +
 .../functional/TestHoodieSnapshotCopier.java   |  36 +--
 .../functional/TestHoodieSnapshotExporter.java | 125 +++-
 .../functional/TestJdbcbasedSchemaProvider.java|  24 +-
 .../functional/UtilitiesFunctionalTestSuite.java   |  32 ++
 .../transform/TestChainedTransformer.java  |  52 
 pom.xml|  26 +-
 style/checkstyle.xml   |   2 +-
 17 files changed, 580 insertions(+), 449 deletions(-)
 create mode 100644 
hudi-client/src/test/java/org/apache/hudi/testutils/DFSProvider.java
 create mode 100644 
hudi-client/src/test/java/org/apache/hudi/testutils/FunctionalTestHarness.java
 create mode 100644 
hudi-client/src/test/java/org/apache/hudi/testutils/SparkProvider.java
 create mode 100644 
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieSnapshotExporter.java
 copy hudi-utilities/src/test/java/org/apache/hudi/utilities/{transform => 
functional}/TestChainedTransformer.java (62%)
 create mode 100644 
hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/UtilitiesFunctionalTestSuite.java



[GitHub] [hudi] vinothchandar merged pull request #1746: [HUDI-996] Add functional test suite for hudi-utilities

2020-07-05 Thread GitBox


vinothchandar merged pull request #1746:
URL: https://github.com/apache/hudi/pull/1746


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1746: [HUDI-996] Add functional test suite for hudi-utilities

2020-07-05 Thread GitBox


vinothchandar commented on a change in pull request #1746:
URL: https://github.com/apache/hudi/pull/1746#discussion_r449931988



##
File path: 
hudi-client/src/test/java/org/apache/hudi/testutils/FunctionalTestHarness.java
##
@@ -0,0 +1,122 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.testutils;
+
+import org.apache.hudi.client.HoodieWriteClient;
+import org.apache.hudi.common.testutils.minicluster.HdfsTestService;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hdfs.DistributedFileSystem;
+import org.apache.hadoop.hdfs.MiniDFSCluster;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.SparkSession;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.io.TempDir;
+
+import java.io.IOException;
+
+public class FunctionalTestHarness implements SparkProvider, DFSProvider {
+
+  private static transient SparkSession spark;
+  private static transient SQLContext sqlContext;
+  private static transient JavaSparkContext jsc;
+
+  private static transient HdfsTestService hdfsTestService;
+  private static transient MiniDFSCluster dfsCluster;
+  private static transient DistributedFileSystem dfs;
+
+  /**
+   * An indicator of the initialization status.
+   */
+  protected boolean initialized = false;

Review comment:
   Its a little confusing.. but okay to be fixed later.

##
File path: 
hudi-client/src/test/java/org/apache/hudi/testutils/FunctionalTestHarness.java
##
@@ -0,0 +1,122 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.testutils;
+
+import org.apache.hudi.client.HoodieWriteClient;
+import org.apache.hudi.common.testutils.minicluster.HdfsTestService;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hdfs.DistributedFileSystem;
+import org.apache.hadoop.hdfs.MiniDFSCluster;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.SparkSession;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.io.TempDir;
+
+import java.io.IOException;
+
+public class FunctionalTestHarness implements SparkProvider, DFSProvider {
+
+  private static transient SparkSession spark;
+  private static transient SQLContext sqlContext;
+  private static transient JavaSparkContext jsc;
+
+  private static transient HdfsTestService hdfsTestService;
+  private static transient MiniDFSCluster dfsCluster;
+  private static transient DistributedFileSystem dfs;
+
+  /**
+   * An indicator of the initialization status.
+   */
+  protected boolean initialized = false;

Review comment:
   okay `initialized = spark != null && hdfsTestService != null;` is what 
makes the spark and hdfsTestService singleton across a run.. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] WilliamWhispell commented on issue #998: Incremental view not implemented yet, for merge-on-read datasets

2020-07-05 Thread GitBox


WilliamWhispell commented on issue #998:
URL: https://github.com/apache/hudi/issues/998#issuecomment-653927435


   @n3nash - From your comment on 11-08-2019 this would be implemented soon, 
but I can't tell from https://issues.apache.org/jira/browse/HUDI-58 if it is 
now implemented. Any ETA on when this will be supported?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua opened a new pull request #1796: [MINOR] Add the users@ mailing list to the community page

2020-07-05 Thread GitBox


yanghua opened a new pull request #1796:
URL: https://github.com/apache/hudi/pull/1796


   
   
   ## What is the purpose of the pull request
   
   *Add the users@ mailing list to the community page*
   
   ## Brief change log
   
 - *Add the users@ mailing list to the community page*
   
   ## Verify this pull request
   
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 commented on pull request #1769: [DOC] Add document for the use of metrics system in Hudi.

2020-07-05 Thread GitBox


shenh062326 commented on pull request #1769:
URL: https://github.com/apache/hudi/pull/1769#issuecomment-653892441


   @leesf Yes, I will add document for Datadog.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on pull request #1769: [DOC] Add document for the use of metrics system in Hudi.

2020-07-05 Thread GitBox


leesf commented on pull request #1769:
URL: https://github.com/apache/hudi/pull/1769#issuecomment-653891474


   > > @shenh062326 Thanks for you contributing! would you please clarify why 
put the metrics in a separate section?
   > 
   > @leesf This was discussed in [#1672 
(comment)](https://github.com/apache/hudi/pull/1672#issuecomment-634841513)
   
   @xushiyan ack, seems like we would also move Datadog here?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on issue #1795: Question about ACID support .

2020-07-05 Thread GitBox


leesf commented on issue #1795:
URL: https://github.com/apache/hudi/issues/1795#issuecomment-653891062


Hudi enables ACID on both S3 and HDFS, you would find some info about ACID 
under these talks 
http://hudi.apache.org/docs/powered_by.html#talks--presentations,  and we would 
highlight this feature in the website.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-384) Treat compaction as commit action internally in Hudi to avoid special handling during state transitions

2020-07-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-384:
---
Labels: help-wanted  (was: )

> Treat compaction as commit action internally in Hudi to avoid special 
> handling during state transitions 
> 
>
> Key: HUDI-384
> URL: https://issues.apache.org/jira/browse/HUDI-384
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Priority: Minor
>  Labels: help-wanted
>
> Link : 
> [https://github.com/apache/incubator-hudi/pull/1009#discussion_r348089546]
> Came up during code-review. 
> ```
> seems most of the issue.stems from this switching of compaction => commit? 
> Just throwing out an idea to see if we can just call talk about compaction as 
> an implementation, but have the action be just commit? i.e remove Compaction 
> action and replace with Commit given we have requested and inflight there? 
> would that simplify the design? does it open new migration pains?
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-157) Allow more than 1 compaction to be run concurrently in deltastreamer after MOR Incremental read is fully supported

2020-07-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-157:
---
Labels: help-wanted  (was: )

> Allow more than 1 compaction to be run concurrently in deltastreamer after 
> MOR Incremental read is fully supported
> --
>
> Key: HUDI-157
> URL: https://issues.apache.org/jira/browse/HUDI-157
> Project: Apache Hudi
>  Issue Type: Task
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Priority: Major
>  Labels: help-wanted
>
> Only 1 compaction is run by deltastreamer. Once incremental MOR  is 
> supported, we can allow concurrent compaction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-39) Scope out how hudi can be integrated underneath Gobblin.. #407

2020-07-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-39?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-39:
--
Labels: help-wanted  (was: )

> Scope out how hudi can be integrated underneath Gobblin.. #407
> --
>
> Key: HUDI-39
> URL: https://issues.apache.org/jira/browse/HUDI-39
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: help-wanted
>
> https://github.com/uber/hudi/issues/407



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-59) Incremental pull should set the _hoodie_commit_time automatically from configs in HoodieInputFormat #16

2020-07-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-59:
--
Labels: help-wanted  (was: )

> Incremental pull should set the _hoodie_commit_time automatically from 
> configs in HoodieInputFormat #16
> ---
>
> Key: HUDI-59
> URL: https://issues.apache.org/jira/browse/HUDI-59
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: help-wanted
>
> https://github.com/uber/hudi/issues/16



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-323) Docker demo/integ-test stdout/stderr output only available on process exit

2020-07-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-323:
---
Labels: help-wanted newbie  (was: )

> Docker demo/integ-test stdout/stderr output only available on process exit
> --
>
> Key: HUDI-323
> URL: https://issues.apache.org/jira/browse/HUDI-323
> Project: Apache Hudi
>  Issue Type: Test
>  Components: newbie, Testing
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: help-wanted, newbie
>
> This is problematic, if the command hangs for e.g, we don't get any logs to 
> indicate where its hanging at.. Better approach is to use something like 
> Piped{Input|Oupput}Stream to implement a thread that can keep printing stdout 
> and stderr as the command runs..  
> Relevant classes : ITTestHoodieDemo/ ITTTestBase 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-71) HoodieRealtimeInputFormat does not apply application specific merge hooks #173

2020-07-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-71?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-71:
--
Labels: help-wanted starter  (was: )

> HoodieRealtimeInputFormat does not apply application specific merge hooks #173
> --
>
> Key: HUDI-71
> URL: https://issues.apache.org/jira/browse/HUDI-71
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: help-wanted, starter
>
> https://github.com/uber/hudi/issues/173



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-102) Beeline/Hive Client - select * on real-time views fails with schema related errors for tables with deep-nested schema #439

2020-07-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-102:
---
Labels: help-wanted  (was: )

> Beeline/Hive Client - select * on real-time views fails with schema related 
> errors for tables with deep-nested schema #439
> --
>
> Key: HUDI-102
> URL: https://issues.apache.org/jira/browse/HUDI-102
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: help-wanted
>
> https://github.com/apache/incubator-hudi/issues/439



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-465) Make Hive Sync via Spark painless

2020-07-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-465:
---
Labels: help-wanted starter  (was: )

> Make Hive Sync via Spark painless
> -
>
> Key: HUDI-465
> URL: https://issues.apache.org/jira/browse/HUDI-465
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Hive Integration, Spark Integration, Usability
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: help-wanted, starter
>
> Currently, we require many configs to be passed in for the Hive sync.. this 
> has to be simplified and experience should be close to how regular 
> spark.write.parquet registers into Hive.. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-274) Consolidate all scripts under top level scripts directory

2020-07-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-274:
---
Labels: starter  (was: )

> Consolidate all scripts under top level scripts directory
> -
>
> Key: HUDI-274
> URL: https://issues.apache.org/jira/browse/HUDI-274
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Balaji Varadarajan
>Priority: Major
>  Labels: starter
>
> Before we do this, let us revisit one more time if this is ideal. It has 
> pros/cons. Moving to one place makes it easy to find but the script should 
> assume the inter-directory structure. Also, each sub-module is not contained 
> entirely as the script is in different place
> This came up in a code-review discussion : 
> https://github.com/apache/incubator-hudi/pull/918#discussion_r327904862
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-284) Need Tests for Hudi handling of schema evolution

2020-07-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-284:
---
Labels: help-requested starter  (was: help-requested)

> Need  Tests for Hudi handling of schema evolution
> -
>
> Key: HUDI-284
> URL: https://issues.apache.org/jira/browse/HUDI-284
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Common Core, newbie, Testing
>Reporter: Balaji Varadarajan
>Priority: Major
>  Labels: help-requested, starter
>
> Context in : 
> https://github.com/apache/incubator-hudi/pull/927#pullrequestreview-293449514



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-854) Incremental Cleaning should not revert to brute force all-partition scanning in any cases

2020-07-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-854:
---
Labels: help-requested starter  (was: help-requested)

> Incremental Cleaning should not revert to brute force all-partition scanning 
> in any cases
> -
>
> Key: HUDI-854
> URL: https://issues.apache.org/jira/browse/HUDI-854
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Cleaner
>Reporter: Balaji Varadarajan
>Priority: Major
>  Labels: help-requested, starter
>
> After [https://github.com/apache/incubator-hudi/pull/1576] . Incremental 
> Cleaning would still resort to full partition scan when  no previous clean 
> operation was done in the dataset. This ticket is to design and implement a 
> safe solution which would avoid full scanning in all cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-52) Implement Savepoints for Merge On Read table #88

2020-07-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-52?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-52:
--
Labels: help-requested starter  (was: help-requested)

> Implement Savepoints for Merge On Read table #88
> 
>
> Key: HUDI-52
> URL: https://issues.apache.org/jira/browse/HUDI-52
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Storage Management, Writer Core
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: help-requested, starter
>
> https://github.com/uber/hudi/issues/88



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-45) Refactor handleWrite() in HoodieMergeHandle to offload conversion and merging of records to reader #374

2020-07-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-45:
--
Labels: help-requested starter  (was: help-requested)

> Refactor handleWrite() in HoodieMergeHandle to offload conversion and merging 
> of records to reader #374
> ---
>
> Key: HUDI-45
> URL: https://issues.apache.org/jira/browse/HUDI-45
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Code Cleanup, Writer Core
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: help-requested, starter
>
> https://github.com/uber/hudi/issues/374



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-48) Re-factor/clean up lazyBlockReading use in HoodieCompactedLogScanner #339

2020-07-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-48:
--
Labels: help-requested starter  (was: help-requested)

> Re-factor/clean up lazyBlockReading use in HoodieCompactedLogScanner #339
> -
>
> Key: HUDI-48
> URL: https://issues.apache.org/jira/browse/HUDI-48
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Code Cleanup, Compaction, Storage Management, Writer Core
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: help-requested, starter
>
> https://github.com/uber/hudi/issues/339



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-37) Persist the HoodieIndex type in the hoodie.properties file #409

2020-07-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-37?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-37:
--
Labels: help-requested newbie starter  (was: help-requested)

> Persist the HoodieIndex type in the hoodie.properties file #409
> ---
>
> Key: HUDI-37
> URL: https://issues.apache.org/jira/browse/HUDI-37
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Storage Management
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: help-requested, newbie, starter
>
> https://github.com/uber/hudi/issues/409



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-47) Revisit null checks in the Log Blocks, merge lazyreading with this null check #340

2020-07-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-47:
--
Labels: help-requested newbie starter  (was: help-requested)

> Revisit null checks in the Log Blocks, merge lazyreading with this null check 
> #340
> --
>
> Key: HUDI-47
> URL: https://issues.apache.org/jira/browse/HUDI-47
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup, Storage Management
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: help-requested, newbie, starter
>
> https://github.com/uber/hudi/issues/340



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-865) Improve Hive Syncing by directly translating avro schema to Hive types

2020-07-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-865:
---
Labels: starter  (was: )

> Improve Hive Syncing by directly translating avro schema to Hive types
> --
>
> Key: HUDI-865
> URL: https://issues.apache.org/jira/browse/HUDI-865
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Priority: Major
>  Labels: starter
>
> With the current code in master and proposed improvements with  
> [https://github.com/apache/incubator-hudi/pull/1559,|https://github.com/apache/incubator-hudi/pull/1559]
> Hive Sync integration would resort to the following translations for finding 
> table schema
>  Avro-Schema to Parquet-Schema to Hive Schema transformations
> We need to implement logic to skip the extra hop to parquet schema when 
> generating hive schema. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-871) Add support for Tencent cloud COS

2020-07-05 Thread leesf (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151576#comment-17151576
 ] 

leesf commented on HUDI-871:


hi [~felixzheng], do you have time to send a PR?

> Add support for Tencent cloud COS
> -
>
> Key: HUDI-871
> URL: https://issues.apache.org/jira/browse/HUDI-871
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Canbin Zheng
>Priority: Major
>  Labels: newbie, starter
>
> Tencent cloud COS is becoming a widely used Object Storage Service, more and 
> more users use COS as the backend storage system, therefore this ticket 
> proposes to add support for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-871) Add support for Tencent cloud COS

2020-07-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-871:
---
Labels: newbie starter  (was: )

> Add support for Tencent cloud COS
> -
>
> Key: HUDI-871
> URL: https://issues.apache.org/jira/browse/HUDI-871
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Canbin Zheng
>Priority: Major
>  Labels: newbie, starter
>
> Tencent cloud COS is becoming a widely used Object Storage Service, more and 
> more users use COS as the backend storage system, therefore this ticket 
> proposes to add support for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-984) Support Hive 1.x out of box

2020-07-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-984:
---
Labels: help-requested starter  (was: )

> Support Hive 1.x out of box
> ---
>
> Key: HUDI-984
> URL: https://issues.apache.org/jira/browse/HUDI-984
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Priority: Major
>  Labels: help-requested, starter
> Fix For: 0.6.0
>
>
> With 0.5.0, Hudi is using 2.x as part of its compile time dependency and 
> works with Hive 2.x servers out of the box.
> We need similar support for Hive 1.x as it is still being used.
> 1. Hive 1.x servers can run queries with Hudi table
> 2. Hive Sync must happen successfully between Hudi tables and Hive 1.x 
> services
>  
> Important Note: Hive 1.x has 2 classes of versions:
>  # pre 1.2.0
>  # 1.2.0 and later
> We had earlier found out that those 2 classes are not compatible with each 
> other unfortunately. CDH version of Hive used to have pre 1.2.0. We need to 
> look at the feasibility, cost and impact of supporting of one or more of this 
> class.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-635) MergeHandle's DiskBasedMap entries can be thinner

2020-07-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-635:
---
Labels: help-requested newbie starter  (was: help-requested)

> MergeHandle's DiskBasedMap entries can be thinner
> -
>
> Key: HUDI-635
> URL: https://issues.apache.org/jira/browse/HUDI-635
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Priority: Blocker
>  Labels: help-requested, newbie, starter
> Fix For: 0.6.0
>
>
> Instead of , we can just track  ... Helps 
> with use-cases like HUDI-625



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] dthauvin opened a new issue #1795: Question about ACID support .

2020-07-05 Thread GitBox


dthauvin opened a new issue #1795:
URL: https://github.com/apache/hudi/issues/1795


   Hi
does Apache Hudi enable ACID transactions on the same table with separate 
spark/hive/presto clusters on S3 or hdfs ? 
   I am not able to find this kind or information  in the documentation. 
   
   Thanks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] RajasekarSribalan opened a new issue #1794: [SUPPORT] Hudi delete operation but HiveSync failed

2020-07-05 Thread GitBox


RajasekarSribalan opened a new issue #1794:
URL: https://github.com/apache/hudi/issues/1794


   **Describe the problem you faced**
   
   Hi, We are doing upserts and deletes in Hudi COW tables. It is Spark 
streaming app which reads data from Kafka and upsert it in Hudi. Below is the 
psuedocode
   
   1. var df=  read kafka
   2. df.persist() // we persist the dataframe because we can have both upsert 
and delete records in single dataframe. SO filter them based or U or D
   3. Filter only upsert records and insert it in hudi
   4. Filter only Hudi records and insert it in Hudi
   5. df.unpersist()
   
   While doing delete, it is throwing below error. My Question, should we need 
to sync with Hive even for delete operation?Pls confirm.
   
   20/07/05 10:19:20 ERROR hive.HiveSyncTool: Got runtime exception when hive 
syncing
   18039 java.lang.IllegalArgumentException: Could not find any data file 
written for commit [20200705101913__commit__COMPLETED], could not get schema 
for table /user/admin/hudi/users, Metadata   
:HoodieCommitMetadata{partitionToWriteStats={}, compacted=false, 
extraMetadata={ROLLING_STAT={
   18040   "partitionToRollingStats" : {
   18041 "" : {
   18042   "d398058e-f8f4-4772-9fcb-012318ac8f47-0" : {
   18043 "fileId" : "d398058e-f8f4-4772-9fcb-012318ac8f47-0",
   18044 "inserts" : 989333,
   18045 "upserts" : 11,
   18046 "deletes" : 0,
   18047 "totalInputWriteBytesToDisk" : 0,
   18048 "totalInputWriteBytesOnDisk" : 49443028
   18049   },
   18050   "eed1f67c-8c46-425f-b740-2e21b84c6f13-0" : {
   18051 "fileId" : "eed1f67c-8c46-425f-b740-2e21b84c6f13-0",
   18052 "inserts" : 1263360,
   18053 "upserts" : 16,
   18054 "deletes" : 0,
   18055 "totalInputWriteBytesToDisk" : 0,
   18056 "totalInputWriteBytesOnDisk" : 49672386
   18057   },
   18058   "e9f38e55-acf2-4bd2-b568-def7361f2f29-0" : {
   18059 "fileId" : "e9f38e55-acf2-4bd2-b568-def7361f2f29-0",
   18060 "inserts" : 946616,
   18061 "upserts" : 6,
   18062 "deletes" : 0,
   18063 "totalInputWriteBytesToDisk" : 0,
   18064 "totalInputWriteBytesOnDisk" : 45686395
   18065   },
   18066   "8a93afac-d60e-41bb-a3e1-edd793e2a932-0" : {
   18067 "fileId" : "8a93afac-d60e-41bb-a3e1-edd793e2a932-0",
   18068 "inserts" : 482202,
   18069 "upserts" : 0,
   18070 "deletes" : 0,
   18071 "totalInputWriteBytesToDisk" : 0,
   18072 "totalInputWriteBytesOnDisk" : 49744729

   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.5.2
   
   * Spark version : CLoudera spark 2.2.0
   
   * Hive version : Cloudera Hive 1.1
   
   * Hadoop version :2.6
   
   * Storage (HDFS/S3/GCS..) :HDFS
   
   * Running on Docker? (yes/no) :No
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org