date:20200720

[GitHub] [hudi] vinothchandar commented on a change in pull request #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-07-20 Thread GitBox



vinothchandar commented on a change in pull request #1100:
URL: https://github.com/apache/hudi/pull/1100#discussion_r457851643



##
File path: hudi-hadoop-mr/pom.xml
##
@@ -125,6 +125,10 @@
   mockito-junit-jupiter
   test
 
+
+  org.mockito
+  mockito-junit-jupiter

Review comment:
   yes.. lets remove this.. may be a rebase/merge thing? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Resolved] (HUDI-839) Implement rollbacks using marker files instead of relying on commit metadata

2020-07-20 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-839.
-
Resolution: Fixed

> Implement rollbacks using marker files instead of relying on commit metadata
> 
>
> Key: HUDI-839
> URL: https://issues.apache.org/jira/browse/HUDI-839
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> This is more efficient and avoids the needs for caching the input into 
> memory. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Reopened] (HUDI-839) Implement rollbacks using marker files instead of relying on commit metadata

2020-07-20 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reopened HUDI-839:
-

> Implement rollbacks using marker files instead of relying on commit metadata
> 
>
> Key: HUDI-839
> URL: https://issues.apache.org/jira/browse/HUDI-839
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> This is more efficient and avoids the needs for caching the input into 
> memory. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-839) Implement rollbacks using marker files instead of relying on commit metadata

2020-07-20 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-839:

Status: Closed  (was: Patch Available)

> Implement rollbacks using marker files instead of relying on commit metadata
> 
>
> Key: HUDI-839
> URL: https://issues.apache.org/jira/browse/HUDI-839
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> This is more efficient and avoids the needs for caching the input into 
> memory. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] vinothchandar merged pull request #1756: [HUDI-839] Introducing support for rollbacks using marker files

2020-07-20 Thread GitBox



vinothchandar merged pull request #1756:
URL: https://github.com/apache/hudi/pull/1756


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Comment Edited] (HUDI-871) Add support for Tencent cloud COS

2020-07-20 Thread leesf (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161718#comment-17161718
 ] 

leesf edited comment on HUDI-871 at 7/21/20, 5:35 AM:
--

[~meimile] Sure, assigned to you and feel free to open a new PR.


was (Author: xleesf):
[~meimile] Sure, feel free to open a new PR.

> Add support for Tencent cloud COS
> -
>
> Key: HUDI-871
> URL: https://issues.apache.org/jira/browse/HUDI-871
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Canbin Zheng
>Assignee: deyzhong
>Priority: Major
>  Labels: newbie, starter
>
> Tencent cloud COS is becoming a widely used Object Storage Service, more and 
> more users use COS as the backend storage system, therefore this ticket 
> proposes to add support for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-871) Add support for Tencent cloud COS

2020-07-20 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf reassigned HUDI-871:
--

Assignee: deyzhong

> Add support for Tencent cloud COS
> -
>
> Key: HUDI-871
> URL: https://issues.apache.org/jira/browse/HUDI-871
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Canbin Zheng
>Assignee: deyzhong
>Priority: Major
>  Labels: newbie, starter
>
> Tencent cloud COS is becoming a widely used Object Storage Service, more and 
> more users use COS as the backend storage system, therefore this ticket 
> proposes to add support for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-871) Add support for Tencent cloud COS

2020-07-20 Thread leesf (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161718#comment-17161718
 ] 

leesf commented on HUDI-871:


[~meimile] Sure, feel free to open a new PR.

> Add support for Tencent cloud COS
> -
>
> Key: HUDI-871
> URL: https://issues.apache.org/jira/browse/HUDI-871
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Canbin Zheng
>Priority: Major
>  Labels: newbie, starter
>
> Tencent cloud COS is becoming a widely used Object Storage Service, more and 
> more users use COS as the backend storage system, therefore this ticket 
> proposes to add support for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] garyli1019 commented on pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-20 Thread GitBox



garyli1019 commented on pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#issuecomment-661642358


   @vinothchandar @umehrot2 Ready for review. Thanks!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] garyli1019 commented on a change in pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-20 Thread GitBox



garyli1019 commented on a change in pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#discussion_r457844997



##
File path: hudi-spark/src/main/scala/org/apache/hudi/SnapshotRelation.scala
##
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.avro.HoodieAvroUtils
+import org.apache.hudi.common.model.HoodieBaseFile
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
+import org.apache.hudi.exception.HoodieException
+import org.apache.hudi.hadoop.utils.HoodieRealtimeInputFormatUtils
+import 
org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.getMaxCompactionMemoryInBytes
+
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.mapred.JobConf
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
+import org.apache.spark.sql.{Row, SQLContext}
+import org.apache.spark.sql.sources.{BaseRelation, TableScan}
+import org.apache.spark.sql.types.StructType
+
+import scala.collection.JavaConverters._
+
+case class HudiMergeOnReadFileSplit(dataFile: PartitionedFile,
+logPaths: Option[List[String]],
+latestCommit: String,
+tablePath: String,
+maxCompactionMemoryInBytes: Long,
+skipMerge: Boolean)
+
+class SnapshotRelation (val sqlContext: SQLContext,
+val optParams: Map[String, String],
+val userSchema: StructType,
+val globPaths: Seq[Path],
+val metaClient: HoodieTableMetaClient)
+  extends BaseRelation with TableScan with Logging{

Review comment:
   `PrunedFilteredScan` will change the behavior of `ParquetRecordReader` 
inside `ParquetFileFormat` even we are not using the vectorized reader. Still 
trying to figure out why... I will follow up with `PrunedFilteredScan` in a 
separate PR.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on a change in pull request #1149: [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2020-07-20 Thread GitBox



vinothchandar commented on a change in pull request #1149:
URL: https://github.com/apache/hudi/pull/1149#discussion_r457839262



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -245,6 +250,16 @@ public int getMaxConsistencyCheckIntervalMs() {
 return 
Integer.parseInt(props.getProperty(MAX_CONSISTENCY_CHECK_INTERVAL_MS_PROP));
   }
 
+  public BulkInsertSortMode getBulkInsertSortMode() {
+String sortMode = props.getProperty(BULKINSERT_SORT_MODE);
+try {
+  return BulkInsertSortMode.valueOf(sortMode.toUpperCase());
+} catch (IllegalArgumentException e) {

Review comment:
   given IllegalArgumentException is itself runtime exception.. may be ok 
to just let that percolate. 

##
File path: 
hudi-client/src/main/java/org/apache/hudi/execution/CopyOnWriteInsertHandler.java
##
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.execution.LazyInsertIterable.HoodieInsertValueGenResult;
+import org.apache.hudi.io.HoodieWriteHandle;
+import org.apache.hudi.io.WriteHandleFactory;
+import org.apache.hudi.table.HoodieTable;
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Consumes stream of hoodie records from in-memory queue and writes to one or 
more create-handles.
+ */
+public class CopyOnWriteInsertHandler
+extends

Review comment:
   lets fix the alignment. may be here 

##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -632,6 +647,10 @@ public FileSystemViewStorageConfig 
getClientSpecifiedViewStorageConfig() {
 return clientSpecifiedViewStorageConfig;
   }
 
+  public boolean getStringFormation() {
+return Boolean.parseBoolean(props.getProperty("hoodie.tmp.string.format"));

Review comment:
   whats this? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on pull request #1756: [HUDI-839] Introducing support for rollbacks using marker files

2020-07-20 Thread GitBox



vinothchandar commented on pull request #1756:
URL: https://github.com/apache/hudi/pull/1756#issuecomment-661633207


   @lw309637554 Looks good. Planning to merge after CI passes this time.. 
Thanks a lot of for your contributions. This is one very important PR !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-871) Add support for Tencent cloud COS

2020-07-20 Thread deyzhong (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161690#comment-17161690
 ] 

deyzhong commented on HUDI-871:
---

I have solved this problem. Can I submit this PR？

[~xleesf] [~felixzheng]

> Add support for Tencent cloud COS
> -
>
> Key: HUDI-871
> URL: https://issues.apache.org/jira/browse/HUDI-871
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Canbin Zheng
>Priority: Major
>  Labels: newbie, starter
>
> Tencent cloud COS is becoming a widely used Object Storage Service, more and 
> more users use COS as the backend storage system, therefore this ticket 
> proposes to add support for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Build failed in Jenkins: hudi-snapshot-deployment-0.5 #345

2020-07-20 Thread Apache Jenkins Server

See 


Changes:


--
[...truncated 2.41 KB...]

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark-bundle_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities-bundle_${scala.binary.version}:[unknown-version],

[GitHub] [hudi] qingyuan18 opened a new issue #1854: query MOR table using spark sql error

2020-07-20 Thread GitBox



qingyuan18 opened a new issue #1854:
URL: https://github.com/apache/hudi/issues/1854


   version using 
   JDK: Jdk 1.8.0_242
   Scala: 2.11.12
   Spark: 2.4.0
   Hudi Spark bundle: 0.5.2-incubating
   
   Steps to reproduce the behavior:
   1. create managed hive table
   2. using Spark datasource to upset record into it
def upsert(albumDf: DataFrame, tableName: String, key: String, combineKey: 
String, tablePath:String):Unit = {
   albumDf.write
 .format("hudi")
 .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
 .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, key)
 .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey)
 .option(HoodieWriteConfig.TABLE_NAME, tableName)
 .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
 .option("hoodie.upsert.shuffle.parallelism", "2")
 .mode(SaveMode.Append)
 .save(tablePath)
 }
   3.  using spark sql to read the result
 val spark: SparkSession = SparkSession.builder()
   .appName("hudi-test")
   .master("yarn")
   .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
   .config("spark.sql.hive.convertMetastoreParquet", "false") // Uses Hive 
SerDe, this is mandatory for MoR tables
   .getOrCreate()
   spark.sql("select  * from  ._acidtest2 ").show()
   
   submit command:  spark-submit --master yarn --conf 
spark.sql.hive.convertMetastoreParquet=false 
HudiTechSpike-jar-with-dependencies.jar
   
   errors:
   java.io.IOException: Not a file: 
hdfs://nameservice1/data/operations/racoe/epi/hive/raw/_acidtest2/default
 at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:329)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
 at scala.Option.getOrElse(Option.scala:121)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
 at scala.Option.getOrElse(Option.scala:121)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
 at scala.Option.getOrElse(Option.scala:121)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
 at scala.Option.getOrElse(Option.scala:121)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
 at scala.Option.getOrElse(Option.scala:121)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
 at scala.Option.getOrElse(Option.scala:121)
   seems like it does not recognize the hudi data format/path structure 
   
   * Running on Docker? : No
   **Additional context**:  using spark-shell is the same error
   spark-shell --master yarn --conf 
spark.sql.hive.convertMetastoreParquet=false --jars 
hudi-spark-bundle_2.11-0.5.3.jar
   
   
   
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] zherenyu831 commented on pull request #1851: [HUDI-1113] Add user define metrics reporter

2020-07-20 Thread GitBox



zherenyu831 commented on pull request #1851:
URL: https://github.com/apache/hudi/pull/1851#issuecomment-661596798


   @leesf 
   Fixed, please check



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] henrywu2019 commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-07-20 Thread GitBox



henrywu2019 commented on a change in pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#discussion_r457798426



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/HoodieEngineContext.java
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common;
+
+import org.apache.hudi.client.TaskContextSupplier;
+import org.apache.hudi.common.config.SerializableConfiguration;
+
+/**
+ * Base class contains the context information needed by the engine at 
runtime. It will be extended by different
+ * engine implementation if needed.
+ */
+public class HoodieEngineContext {
+  /**
+   * A wrapped hadoop configuration which can be serialized.
+   */
+  private SerializableConfiguration hadoopConf;

Review comment:
   Oh...What I meant is at line 32 the name `hadoopConf`, not the class 
name, which implies `hadoop`. I bumped into this searching for Flink support 
from HUDI and this PR looks a big step moving in that direction. Thanks tons 
@Mathieu1124 and definitely @vinothchandar as well.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a change in pull request #1849: [WIP] Externalize test classes' configs

2020-07-20 Thread GitBox



xushiyan commented on a change in pull request #1849:
URL: https://github.com/apache/hudi/pull/1849#discussion_r457785815



##
File path: 
hudi-client/src/test/resources/org/apache/hudi/index/hbase/TestHBaseIndex.properties
##
@@ -0,0 +1,38 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# HoodieWriteConfig

Review comment:
   i agree that it's handy with builder pattern. I came from the point 
where most configs across test cases are static and duplicated; for each test 
class, only 1 or 2 properties to variate. That's what the minimal exposure via 
overwritingProps aim for. But i agree the benefits are less obvious. closing 
this one first. thanks for the input.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan closed pull request #1849: [WIP] Externalize test classes' configs

2020-07-20 Thread GitBox



xushiyan closed pull request #1849:
URL: https://github.com/apache/hudi/pull/1849


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Mathieu1124 commented on a change in pull request #1842: [HUDI-1037]Introduce a write committed callback hook

2020-07-20 Thread GitBox



Mathieu1124 commented on a change in pull request #1842:
URL: https://github.com/apache/hudi/pull/1842#discussion_r457778836



##
File path: 
hudi-client/src/main/java/org/apache/hudi/callback/impl/HoodieHttpWriteCommitCallback.java
##
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.callback.impl;
+
+import org.apache.hudi.callback.HoodieWriteCommitCallback;
+import org.apache.hudi.callback.client.http.HoodieWriteCallbackHttpClient;
+import org.apache.hudi.callback.common.HoodieBaseCommitCallbackMessage;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieCommitCallbackException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.codehaus.jackson.map.ObjectMapper;
+
+import java.io.IOException;
+
+/**
+ * A http implementation of {@link HoodieWriteCommitCallback}.
+ */
+public class HoodieHttpWriteCommitCallback implements 
HoodieWriteCommitCallback {

Review comment:
   > 
   > 
   > Is `HoodieWriteCommitHttpCallback` more reasonable?
   
   Yes, looks better





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yanghua commented on a change in pull request #1842: [HUDI-1037]Introduce a write committed callback hook

2020-07-20 Thread GitBox



yanghua commented on a change in pull request #1842:
URL: https://github.com/apache/hudi/pull/1842#discussion_r457769421



##
File path: 
hudi-client/src/main/java/org/apache/hudi/callback/HoodieWriteCommitCallback.java
##
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.callback;
+
+/**
+ * A callback interface help to call back when a write commit completes 
successfully.
+ */
+public interface HoodieWriteCommitCallback {
+
+  /**
+   * A callback method the user can implement to provide asynchronous handling 
of successful write.
+   * This method will be called when a write operation is committed 
successfully.
+   *
+   * @param commitTime commitTime which is successfully committed
+   */
+  void call(String commitTime);

Review comment:
   Can we provide both commit instant and table name?

##
File path: 
hudi-client/src/main/java/org/apache/hudi/callback/common/HoodieBaseCommitCallbackMessage.java
##
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.callback.common;
+
+import java.io.Serializable;
+
+/**
+ * Base callback message, which contains commitTime and tableName only for now.
+ */
+public class HoodieBaseCommitCallbackMessage implements Serializable {

Review comment:
   If we provide the  table name as a parameter in 
`HoodieWriteCommitCallback#call(...)`, then this class is not necessary, IMO.

##
File path: 
hudi-client/src/main/java/org/apache/hudi/callback/client/http/HoodieWriteCallbackHttpClient.java
##
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.callback.client.http;
+
+import org.apache.http.HttpHeaders;
+import org.apache.http.client.config.RequestConfig;
+import org.apache.http.client.methods.CloseableHttpResponse;
+import org.apache.http.client.methods.HttpPost;
+import org.apache.http.entity.ContentType;
+import org.apache.http.entity.StringEntity;
+import org.apache.http.impl.client.CloseableHttpClient;
+import org.apache.http.impl.client.HttpClientBuilder;
+import org.apache.hudi.config.HoodieWriteCommitCallbackConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.Properties;
+
+/**
+ * Write commit callback http client.
+ */
+public class HoodieWriteCallbackHttpClient implements Closeable {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieWriteCallbackHttpClient.class);
+
+  public static final String HEADER_KEY_API_KEY = "HUDI-CALLBACK-KEY";
+
+  private

[GitHub] [hudi] vinothchandar commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-07-20 Thread GitBox



vinothchandar commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-661504763


   @satishkotha @nbalajee @prashantwason @modi95  please take a look as well.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-845) Allow parallel writing and move the pending rollback work into cleaner

2020-07-20 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-845:

Status: Open  (was: New)

> Allow parallel writing and move the pending rollback work into cleaner
> --
>
> Key: HUDI-845
> URL: https://issues.apache.org/jira/browse/HUDI-845
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Priority: Blocker
>  Labels: help-requested
> Fix For: 0.6.0
>
>
> Things to think about 
>  * Commit time has to be unique across writers 
>  * Parallel writers can finish commits out of order i.e c2 commits before c1.
>  * MOR log blocks fence uncommited data.. 
>  * Cleaner should loudly complain if it cannot finish cleaning up partial 
> writes.  
>  
> P.S: think about what is left for the general thing : log files may have 
> different order, inserts may violate uniqueness constraint



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1098) Marker file finalizing may block on a data file that was never written

2020-07-20 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1098:
-
Status: In Progress  (was: Open)

> Marker file finalizing may block on a data file that was never written
> --
>
> Key: HUDI-1098
> URL: https://issues.apache.org/jira/browse/HUDI-1098
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.6.0
>
>
> {code:java}
> // Ensure all files in delete list is actually present. This is mandatory for 
> an eventually consistent FS. // Otherwise, we may miss deleting such files. 
> If files are not found even after retries, fail the commit 
> if (consistencyCheckEnabled) { 
>   // This will either ensure all files to be deleted are present. 
> waitForAllFiles(jsc, groupByPartition, FileVisibility.APPEAR); 
> }
> {code}
> We need to handle the case where marker file was created, but we crashed 
> before the data file was created. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1098) Marker file finalizing may block on a data file that was never written

2020-07-20 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1098:
-
Status: Open  (was: New)

> Marker file finalizing may block on a data file that was never written
> --
>
> Key: HUDI-1098
> URL: https://issues.apache.org/jira/browse/HUDI-1098
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.6.0
>
>
> {code:java}
> // Ensure all files in delete list is actually present. This is mandatory for 
> an eventually consistent FS. // Otherwise, we may miss deleting such files. 
> If files are not found even after retries, fail the commit 
> if (consistencyCheckEnabled) { 
>   // This will either ensure all files to be deleted are present. 
> waitForAllFiles(jsc, groupByPartition, FileVisibility.APPEAR); 
> }
> {code}
> We need to handle the case where marker file was created, but we crashed 
> before the data file was created. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-845) Allow parallel writing and move the pending rollback work into cleaner

2020-07-20 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-845:
---

Assignee: Vinoth Chandar

> Allow parallel writing and move the pending rollback work into cleaner
> --
>
> Key: HUDI-845
> URL: https://issues.apache.org/jira/browse/HUDI-845
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: help-requested
> Fix For: 0.6.0
>
>
> Things to think about 
>  * Commit time has to be unique across writers 
>  * Parallel writers can finish commits out of order i.e c2 commits before c1.
>  * MOR log blocks fence uncommited data.. 
>  * Cleaner should loudly complain if it cannot finish cleaning up partial 
> writes.  
>  
> P.S: think about what is left for the general thing : log files may have 
> different order, inserts may violate uniqueness constraint



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1014) Design and Implement upgrade-downgrade infrastrucutre

2020-07-20 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1014:
-
Status: In Progress  (was: Open)

> Design and Implement upgrade-downgrade infrastrucutre
> -
>
> Key: HUDI-1014
> URL: https://issues.apache.org/jira/browse/HUDI-1014
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, Writer Core
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1014) Design and Implement upgrade-downgrade infrastrucutre

2020-07-20 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1014:
-
Status: Open  (was: New)

> Design and Implement upgrade-downgrade infrastrucutre
> -
>
> Key: HUDI-1014
> URL: https://issues.apache.org/jira/browse/HUDI-1014
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, Writer Core
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1049) In inline compaction mode, previously failed compactions needs to be retried before new compactions

2020-07-20 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1049:
-
Status: Patch Available  (was: In Progress)

> In inline compaction mode, previously failed compactions needs to be retried 
> before new compactions 
> 
>
> Key: HUDI-1049
> URL: https://issues.apache.org/jira/browse/HUDI-1049
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Compaction
>Reporter: Balaji Varadarajan
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> With Async compaction, previously failed compactions are retried before new 
> compactions are run. With inline compaction, this failure retry is not 
> getting done.
>  
> As async compaction is the de-facto mode for MOR table, we haven't noticed 
> this problem in the community. But, this was reported recently as part of 
> [https://github.com/apache/hudi/issues/1764#issuecomment-648882567]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1049) In inline compaction mode, previously failed compactions needs to be retried before new compactions

2020-07-20 Thread Vinoth Chandar (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161625#comment-17161625
 ] 

Vinoth Chandar commented on HUDI-1049:
--

Need to add a test and retarget for master/0.6.0

> In inline compaction mode, previously failed compactions needs to be retried 
> before new compactions 
> 
>
> Key: HUDI-1049
> URL: https://issues.apache.org/jira/browse/HUDI-1049
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Compaction
>Reporter: Balaji Varadarajan
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> With Async compaction, previously failed compactions are retried before new 
> compactions are run. With inline compaction, this failure retry is not 
> getting done.
>  
> As async compaction is the de-facto mode for MOR table, we haven't noticed 
> this problem in the community. But, this was reported recently as part of 
> [https://github.com/apache/hudi/issues/1764#issuecomment-648882567]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1013) Bulk Insert w/o converting to RDD

2020-07-20 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1013:
-
Status: Patch Available  (was: In Progress)

> Bulk Insert w/o converting to RDD
> -
>
> Key: HUDI-1013
> URL: https://issues.apache.org/jira/browse/HUDI-1013
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Our bulk insert(not just bulk insert, all operations infact) does dataset to 
> rdd conversion in HoodieSparkSqlWriter and our HoodieClient deals with 
> JavaRDDs. We are trying to see if we can improve our 
> performance by avoiding the rdd conversion.  We will first start off w/ bulk 
> insert and get end to end working before we decide if we wanna do this for 
> other operations too after doing some perf analysis. 
>  
> On a high level, this is the idea
> 1. Dataset will be passed in all the way from spark sql writer to the 
> storage writer. We do not convert to HoodieRecord at any point in time. 
> 2. We need to use 
> [ParquetWriteSupport|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala]]
>  to write to Parquet as InternalRows.
> 3. So, gist of what we wanna do is, with the Datasets, sort by 
> partition path and record keys, repartition by parallelism config, and do 
> mapPartitions. Within MapPartitions, we will iterate through the Rows, encode 
> to InternalRows and write to Parquet using the write support linked above. 
> We first wanted to check if our strategy will actually improve the perf. So, 
> I did a quick hack of just the mapPartition func in HoodieSparkSqlWriter just 
> to see how the numbers look like. Check for operation 
> "bulk_insert_direct_parquet_write_support" 
> [here|#diff-5317f4121df875e406876f9f0f012fac]]. 
> These are the numbers I got. (1) is existing hoodie bulk insert which does 
> the rdd conversion to JavaRdd. (2) is writing directly to 
> parquet in spark. Code given below. (3) is the modified hoodie code i.e. 
> operation bulk_insert_direct_parquet_write_support)
>  
> | |5M records 100 parallelism input size 2.5 GB|
> |(1) Orig hoodie(unmodified)|169 secs. output size 2.7 GB|
> |(2) Parquet |62 secs. output size 2.5 GB|
> |(3) Modified hudi code. Direct Parquet Write |73 secs. output size 2.5 GB|
>  
> So, essentially our existing code for bulk insert is > 2x that of parquet. 
> Our modified hudi code (i.e. operation 
> bulk_insert_direct_parquet_write_support) is close to direct Parquet write in 
> spark, which shows that our strategy should work. 
> // This is the Parquet write in spark. (2) above. 
> transformedDF.sort(*"partition"*, *"key"*)
> .coalesce(parallelism)
>  .write.format(*"parquet"*)
>  .partitionBy(*"partition"*)
>  .mode(saveMode)
>  .save(*s"**$*outputPath*/**$*format*"*)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-651) Incremental Query on Hive via Spark SQL does not return expected results

2020-07-20 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-651:

Status: In Progress  (was: Open)

> Incremental Query on Hive via Spark SQL does not return expected results
> 
>
> Key: HUDI-651
> URL: https://issues.apache.org/jira/browse/HUDI-651
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Bhavani Sudha
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Using the docker demo, I added two delta commits to a MOR table and was a 
> hoping to incremental consume them like Hive QL.. Something amiss
> {code}
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.start.timestamp","20200302210147")
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.mode","INCREMENTAL")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> +---+
> |_hoodie_commit_time|
> +---+
> |20200302210010 |
> |20200302210147 |
> +---+
> scala> sc.setLogLevel("INFO")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44 stored as 
> values in memory (estimated size 292.3 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44_piece0 stored 
> as bytes in memory (estimated size 25.4 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO storage.BlockManagerInfo: Added broadcast_44_piece0 in 
> memory on adhoc-1:45623 (size: 25.4 KB, free: 366.2 MB)
> 20/03/02 21:15:37 INFO spark.SparkContext: Created broadcast 44 from 
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27, 
> file:/etc/hadoop/hive-site.xml], FileSystem: 
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1645984031_1, ugi=root 
> (auth:SIMPLE)]]]
> 20/03/02 21:15:37 INFO table.HoodieTableConfig: Loading table properties from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Finished Loading Table of 
> type MERGE_ON_READ(version=1) from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO mapred.FileInputFormat: Total input paths to process : 
> 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Found a total of 1 
> groups
> 20/03/02 21:15:37 INFO timeline.HoodieActiveTimeline: Loaded instants 
> [[20200302210010__clean__COMPLETED], 
> [20200302210010__deltacommit__COMPLETED], [20200302210147__clean__COMPLETED], 
> [20200302210147__deltacommit__COMPLETED]]
> 20/03/02 21:15:37 INFO view.HoodieTableFileSystemView: Adding file-groups for 
> partition :2018/08/31, #FileGroups=1
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: addFilesToView: 
> NumFiles=1, FileGroupsCreationTime=0, StoreTimeTaken=0
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Total paths to 
> process after hoodie filter 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27, 
>

[jira] [Updated] (HUDI-651) Incremental Query on Hive via Spark SQL does not return expected results

2020-07-20 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-651:

Status: Patch Available  (was: In Progress)

> Incremental Query on Hive via Spark SQL does not return expected results
> 
>
> Key: HUDI-651
> URL: https://issues.apache.org/jira/browse/HUDI-651
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Bhavani Sudha
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Using the docker demo, I added two delta commits to a MOR table and was a 
> hoping to incremental consume them like Hive QL.. Something amiss
> {code}
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.start.timestamp","20200302210147")
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.mode","INCREMENTAL")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> +---+
> |_hoodie_commit_time|
> +---+
> |20200302210010 |
> |20200302210147 |
> +---+
> scala> sc.setLogLevel("INFO")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44 stored as 
> values in memory (estimated size 292.3 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44_piece0 stored 
> as bytes in memory (estimated size 25.4 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO storage.BlockManagerInfo: Added broadcast_44_piece0 in 
> memory on adhoc-1:45623 (size: 25.4 KB, free: 366.2 MB)
> 20/03/02 21:15:37 INFO spark.SparkContext: Created broadcast 44 from 
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27, 
> file:/etc/hadoop/hive-site.xml], FileSystem: 
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1645984031_1, ugi=root 
> (auth:SIMPLE)]]]
> 20/03/02 21:15:37 INFO table.HoodieTableConfig: Loading table properties from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Finished Loading Table of 
> type MERGE_ON_READ(version=1) from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO mapred.FileInputFormat: Total input paths to process : 
> 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Found a total of 1 
> groups
> 20/03/02 21:15:37 INFO timeline.HoodieActiveTimeline: Loaded instants 
> [[20200302210010__clean__COMPLETED], 
> [20200302210010__deltacommit__COMPLETED], [20200302210147__clean__COMPLETED], 
> [20200302210147__deltacommit__COMPLETED]]
> 20/03/02 21:15:37 INFO view.HoodieTableFileSystemView: Adding file-groups for 
> partition :2018/08/31, #FileGroups=1
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: addFilesToView: 
> NumFiles=1, FileGroupsCreationTime=0, StoreTimeTaken=0
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Total paths to 
> process after hoodie filter 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27, 
>

[jira] [Updated] (HUDI-472) Make sortBy() inside bulkInsertInternal() configurable for bulk_insert

2020-07-20 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-472:

Status: Patch Available  (was: In Progress)

> Make sortBy() inside bulkInsertInternal() configurable for bulk_insert
> --
>
> Key: HUDI-472
> URL: https://issues.apache.org/jira/browse/HUDI-472
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Performance
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-305) Presto MOR "_rt" queries only reads base parquet file

2020-07-20 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-305:

Status: Patch Available  (was: In Progress)

> Presto MOR "_rt" queries only reads base parquet file 
> --
>
> Key: HUDI-305
> URL: https://issues.apache.org/jira/browse/HUDI-305
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Presto Integration
> Environment: On AWS EMR
>Reporter: Brandon Scheller
>Assignee: Bhavani Sudha
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Code example to reproduce.
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveMode
> val df = Seq(
>   ("100", "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
>   ("101", "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
>   ("104", "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
>   ("105", "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> var tableName = "hudi_events_mor_1"
> var tablePath = "s3://emr-users/wenningd/hudi/tables/events/" + tableName
> // write hudi dataset
> df.write.format("org.apache.hudi")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Overwrite)
>   .save(tablePath)
> // update a record with event_name "event_name_123" => "event_name_changed"
> val df1 = spark.read.format("org.apache.hudi").load(tablePath + "/*/*")
> val df2 = df1.filter($"event_id" === "104")
> val df3 = df2.withColumn("event_name", lit("event_name_changed"))
> // update hudi dataset
> df3.write.format("org.apache.hudi")
>.option(HoodieWriteConfig.TABLE_NAME, tableName)
>.option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
>.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>.option("hoodie.compact.inline", "false")
>.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>.option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>.mode(SaveMode.Append)
>.save(tablePath)
> {code}
> Now when querying the real-time table from Hive, we have no issue seeing the 
> updated value:
> {code:java}
> hive> select event_name from hudi_events_mor_1_rt;
> OK
> event_name_900
> event_name_changed
> event_name_546
> event_name_678
> Time taken: 0.103 seconds, Fetched: 4 row(s)
> {code}
> But when querying the real-time table from Presto, we only read the base 
> parquet file and do not see the update that should be merged in from the log 
> file.
> {code:java}
> presto:default> select event_name from hudi_events_mor_1_rt;
>event_name
> 
>  event_name_900
>  event_name_123
>  event_name_546
>  event_name_678
> (4 rows)
> {code}
> Our current understanding of this issue is that while the 
> HoodieParquetRealtimeInputFormat correctly generates the splits. The 
> RealtimeCompactedRecordReader record reader is not used so it is not reading 
> the log file and only reading the base parquet file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-1015) Audit all getAllPartitionPaths() calls and keep em out of fast path

2020-07-20 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-1015:


Assignee: Balaji Varadarajan  (was: Vinoth Chandar)

> Audit all getAllPartitionPaths() calls and keep em out of fast path
> ---
>
> Key: HUDI-1015
> URL: https://issues.apache.org/jira/browse/HUDI-1015
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-575) Support Async Compaction for spark streaming writes to hudi table

2020-07-20 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-575:
---

Assignee: Balaji Varadarajan  (was: Vinoth Chandar)

> Support Async Compaction for spark streaming writes to hudi table
> -
>
> Key: HUDI-575
> URL: https://issues.apache.org/jira/browse/HUDI-575
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Currenlty, only inline compaction is supported for Structured streaming 
> writes. 
>  
> We need to 
>  * Enable configuring async compaction for streaming writes 
>  * Implement a parallel compaction process like we did for delta streamer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] bvaradar commented on issue #1847: [SUPPORT] querying MoR tables on S3 becomes slow with number of files growing

2020-07-20 Thread GitBox



bvaradar commented on issue #1847:
URL: https://github.com/apache/hudi/issues/1847#issuecomment-661461345


   @zuyanton : I am not sure if I can find the source code of this class. 
@umehrot2 : Can you let me know if the current implementation of FileStatus 
returned S3NativeFileSystem overrides getLen() ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1846: [SUPPORT] HoodieSnapshotCopier example

2020-07-20 Thread GitBox



bvaradar commented on issue #1846:
URL: https://github.com/apache/hudi/issues/1846#issuecomment-661457074


   @xushiyan : As you are familiar with this part, would you be able to help 
answer this question ? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1115) Setup and run long running streaming job in AWS environment

2020-07-20 Thread Balaji Varadarajan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1115:
-
Status: In Progress  (was: Open)

> Setup and run long running streaming job in AWS environment
> ---
>
> Key: HUDI-1115
> URL: https://issues.apache.org/jira/browse/HUDI-1115
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Setup to run 
> [https://gist.github.com/bvaradar/d892c6c6a69664463f8601d09c187271] in AWS at 
> a larger scale. This will be useful for us to vet releases and to debug 
> issues reproducible only in AWS environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1115) Setup and run long running streaming job in AWS environment

2020-07-20 Thread Balaji Varadarajan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1115:
-
Status: Open  (was: New)

> Setup and run long running streaming job in AWS environment
> ---
>
> Key: HUDI-1115
> URL: https://issues.apache.org/jira/browse/HUDI-1115
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Setup to run 
> [https://gist.github.com/bvaradar/d892c6c6a69664463f8601d09c187271] in AWS at 
> a larger scale. This will be useful for us to vet releases and to debug 
> issues reproducible only in AWS environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-1115) Setup and run long running streaming job in AWS environment

2020-07-20 Thread Balaji Varadarajan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-1115:


Assignee: Balaji Varadarajan

> Setup and run long running streaming job in AWS environment
> ---
>
> Key: HUDI-1115
> URL: https://issues.apache.org/jira/browse/HUDI-1115
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Setup to run 
> [https://gist.github.com/bvaradar/d892c6c6a69664463f8601d09c187271] in AWS at 
> a larger scale. This will be useful for us to vet releases and to debug 
> issues reproducible only in AWS environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1115) Setup and run long running streaming job in AWS environment

2020-07-20 Thread Balaji Varadarajan (Jira)

Balaji Varadarajan created HUDI-1115:


 Summary: Setup and run long running streaming job in AWS 
environment
 Key: HUDI-1115
 URL: https://issues.apache.org/jira/browse/HUDI-1115
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Testing
Reporter: Balaji Varadarajan
 Fix For: 0.6.0


Setup to run 
[https://gist.github.com/bvaradar/d892c6c6a69664463f8601d09c187271] in AWS at a 
larger scale. This will be useful for us to vet releases and to debug issues 
reproducible only in AWS environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] nsivabalan commented on pull request #1149: [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2020-07-20 Thread GitBox



nsivabalan commented on pull request #1149:
URL: https://github.com/apache/hudi/pull/1149#issuecomment-661451741


   sure, thanks. Once done, do ping me and vinoth for review. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on pull request #1149: [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2020-07-20 Thread GitBox



yihua commented on pull request #1149:
URL: https://github.com/apache/hudi/pull/1149#issuecomment-661421240


   @nsivabalan Thanks for the fix.  There is some ad-hoc code in this PR just 
for testing.  Let me clean that up.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1072) Reader changes to support clustering and insert overwrite

2020-07-20 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1072:
-
Labels: pull-request-available  (was: )

> Reader changes to support clustering and insert overwrite
> -
>
> Key: HUDI-1072
> URL: https://issues.apache.org/jira/browse/HUDI-1072
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: satish
>Priority: Major
>  Labels: pull-request-available
>
> * Add metadata to track ‘replaced’ files. Replaced files are essentially file 
> groups to be ignored. For ‘insert overwrite’ this is all existing files in 
> the partition overwritten. For ‘clustering’, this is all file groups that are 
> merged into a new set of file groups.
> * Change Views to ignore replaced files (AbstractTableFileSystemView and all 
> subclasses)
> * Change cleaner to delete data files that have been replaced (Introduce a 
> new policy?)
> * Change archival to not delete active commits that have this special 
> metadata if corresponding data files are not deleted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] satishkotha opened a new pull request #1853: [HUDI-1072] Add replace metadata file to timeline

2020-07-20 Thread GitBox



satishkotha opened a new pull request #1853:
URL: https://github.com/apache/hudi/pull/1853


   ## What is the purpose of the pull request
   
   This is part of work required for RFC-18 and RFC-19.  Add replace action to 
valid actions in the timeline. 
   
   To keep the diff small and get feedback, i am sending just the structure of 
metadata. For examples of how this will be used, see POC here 
https://github.com/satishkotha/incubator-hudi/commit/45d8c26b407b2b5329925cb0ab5af93e293a1cae.
   
   I'm happy to bring in other changes from POC if you think we can push large 
change.
   
   ## Brief change log
   * Add replace action to valid actions in the timeline
   * Add replace metadata file format
   
   ## Verify this pull request
   
   This change added unit tests
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch hudi_test_suite_refactor updated (5cdfbe0 -> 8980e09)

2020-07-20 Thread nagarwal

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/hudi.git.


 discard 5cdfbe0  [HUDI-394] Provide a basic implementation of test suite
 add 8980e09  [HUDI-394] Provide a basic implementation of test suite

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (5cdfbe0)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (8980e09)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 .../hudi/integ/testsuite/helpers/HiveServiceProvider.java  | 14 --
 pom.xml|  2 +-
 2 files changed, 13 insertions(+), 3 deletions(-)

[GitHub] [hudi] vinothchandar commented on a change in pull request #1849: [WIP] Externalize test classes' configs

2020-07-20 Thread GitBox



vinothchandar commented on a change in pull request #1849:
URL: https://github.com/apache/hudi/pull/1849#discussion_r457717555



##
File path: 
hudi-client/src/test/resources/org/apache/hudi/index/hbase/TestHBaseIndex.properties
##
@@ -0,0 +1,38 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# HoodieWriteConfig

Review comment:
   whats the value of pull this into config instead of code? IMO, code is 
much easier to debug.. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] zuyanton edited a comment on issue #1847: [SUPPORT] querying MoR tables on S3 becomes slow with number of files growing

2020-07-20 Thread GitBox



zuyanton edited a comment on issue #1847:
URL: https://github.com/apache/hudi/issues/1847#issuecomment-661287128


   @bvaradar , logging  ```fileStatus.getClass().getName()``` from within 
```HoodieBaseFile``` constructor, gives me 
```com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem$3```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on a change in pull request #1756: [HUDI-839] Introducing support for rollbacks using marker files

2020-07-20 Thread GitBox



vinothchandar commented on a change in pull request #1756:
URL: https://github.com/apache/hudi/pull/1756#discussion_r457651663



##
File path: hudi-client/src/main/java/org/apache/hudi/table/MarkerFiles.java
##
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table;
+
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.io.IOType;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.LinkedList;
+import java.util.List;
+
+/**
+ * Operates on marker files for a given write action (commit, delta commit, 
compaction).
+ */
+public class MarkerFiles {
+
+  private static final Logger LOG = LogManager.getLogger(MarkerFiles.class);
+
+  public static String stripMarkerSuffix(String path) {
+return path.substring(0, path.indexOf(HoodieTableMetaClient.MARKER_EXTN));
+  }
+
+  private final String instantTime;
+  private final FileSystem fs;
+  private final Path markerDirPath;
+  private final String basePath;
+
+  public MarkerFiles(FileSystem fs, String basePath, String markerFolderPath, 
String instantTime) {
+this.instantTime = instantTime;
+this.fs = fs;
+this.markerDirPath = new Path(markerFolderPath);
+this.basePath = basePath;
+  }
+
+  public MarkerFiles(HoodieTable table, String instantTime) {
+this(table.getMetaClient().getFs(),
+table.getMetaClient().getBasePath(),
+table.getMetaClient().getMarkerFolderPath(instantTime),
+instantTime);
+  }
+
+  public void quietDeleteMarkerDir() {
+try {
+  deleteMarkerDir();
+} catch (HoodieIOException ioe) {
+  LOG.warn("Error deleting marker directory for instant " + instantTime, 
ioe);
+}
+  }
+
+  /**
+   * Delete Marker directory corresponding to an instant.
+   */
+  public boolean deleteMarkerDir() {
+try {
+  boolean result = fs.delete(markerDirPath, true);
+  if (result) {
+LOG.info("Removing marker directory at " + markerDirPath);
+  } else {
+LOG.info("No marker directory to delete at " + markerDirPath);
+  }
+  return result;
+} catch (IOException ioe) {
+  throw new HoodieIOException(ioe.getMessage(), ioe);
+}
+  }
+
+  public boolean doesMarkerDirExist() throws IOException {
+return fs.exists(markerDirPath);
+  }
+
+  public List createdAndMergedDataPaths() throws IOException {
+List dataFiles = new LinkedList<>();
+FSUtils.processFiles(fs, markerDirPath.toString(), (status) -> {
+  String pathStr = status.getPath().toString();
+  if (pathStr.contains(HoodieTableMetaClient.MARKER_EXTN) && 
!pathStr.endsWith(IOType.APPEND.name())) {

Review comment:
   yes. we will be performing an upgrade anyway to 0.6.0., which will list 
the inflight instant at the time of upgrade and then subsequently, write 
compatible, corresponding marker files 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] zuyanton commented on issue #1847: [SUPPORT] querying MoR tables on S3 becomes slow with number of files growing

2020-07-20 Thread GitBox



zuyanton commented on issue #1847:
URL: https://github.com/apache/hudi/issues/1847#issuecomment-661287128


   @bvaradar , logging  ```fileStatus.getClass().getName()``` from within 
```HoodieBaseFile``` constructor, gives me 
```com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch hudi_test_suite_refactor updated (82f06f3 -> 5cdfbe0)

2020-07-20 Thread nagarwal

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/hudi.git.


 discard 82f06f3  [HUDI-394] Provide a basic implementation of test suite
 add 5cdfbe0  [HUDI-394] Provide a basic implementation of test suite

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (82f06f3)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (5cdfbe0)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 .../testsuite/TestDFSHoodieTestSuiteWriterAdapter.java | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

[jira] [Updated] (HUDI-1114) Explore Spark Structure Streaming for Hudi Dataset

2020-07-20 Thread Yanjia Gary Li (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1114:
-
Status: Open  (was: New)

> Explore Spark Structure Streaming for Hudi Dataset
> --
>
> Key: HUDI-1114
> URL: https://issues.apache.org/jira/browse/HUDI-1114
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Yanjia Gary Li
>Priority: Minor
>
> [https://github.com/apache/hudi/issues/1839]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1114) Explore Spark Structure Streaming for Hudi Dataset

2020-07-20 Thread Yanjia Gary Li (Jira)

Yanjia Gary Li created HUDI-1114:


 Summary: Explore Spark Structure Streaming for Hudi Dataset
 Key: HUDI-1114
 URL: https://issues.apache.org/jira/browse/HUDI-1114
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Spark Integration
Reporter: Yanjia Gary Li


[https://github.com/apache/hudi/issues/1839]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] garyli1019 commented on issue #1839: Question, Add Support to Hudi datasets to spark structured streaming

2020-07-20 Thread GitBox



garyli1019 commented on issue #1839:
URL: https://github.com/apache/hudi/issues/1839#issuecomment-661272780


   This is an interesting feature. I created a ticket to track this. 
https://issues.apache.org/jira/browse/HUDI-1114.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rubenssoto commented on issue #1839: Question, Add Support to Hudi datasets to spark structured streaming

2020-07-20 Thread GitBox



rubenssoto commented on issue #1839:
URL: https://github.com/apache/hudi/issues/1839#issuecomment-661235040


   Hi Vinoth, thank you for your anwser.
   
   I will see your video, probably incremental query will help me for now, but 
we want to use spark structured streaming like a default for all our datasets, 
spark streaming take care about checkpoint and stuffs like this.
   
   If will could add spark structured streaming integration in a future 
version, will be great.
   
   Thank you! :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #1839: Question, Add Support to Hudi datasets to spark structured streaming

2020-07-20 Thread GitBox



vinothchandar commented on issue #1839:
URL: https://github.com/apache/hudi/issues/1839#issuecomment-661204597


   @rubenssoto yes. we already support incremental queries using the spark 
datasource. It seems like the only thing missing here is that you want the 
spark structured streaming integration? (which we can add after 0.6.0)
   https://hudi.apache.org/docs/querying_data.html#spark-incr-query
   
   https://www.youtube.com/watch?v=1w3IpavhSWA actually talks about a 
production use-case we build using an incremental query + some grouping on the 
sink side. Unlike delta, Hudi actually has record level metadata around arrival 
times and thus does not need anything like ignoreChanges. 
   
   I am not sure if I am missing something around your use-case, but feels like 
you should be able to get this working incrementally end-end with what we have 
today (again, we can add spark streaming read support.. if there are hands to 
help.. cc @garyli1019? :)) 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] asheeshgarg commented on issue #1787: Exception During Insert

2020-07-20 Thread GitBox



asheeshgarg commented on issue #1787:
URL: https://github.com/apache/hudi/issues/1787#issuecomment-661196097


   @bvaradar I am running  hudi-spark-bundle



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

2020-07-20 Thread GitBox



asheeshgarg commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-661195131


   @bvaradar  thanks Balaji for your continuous support will test this. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch hudi_test_suite_refactor updated (13e3d70 -> 82f06f3)

2020-07-20 Thread nagarwal

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/hudi.git.


 discard 13e3d70  [HUDI-394] Provide a basic implementation of test suite
 add 82f06f3  [HUDI-394] Provide a basic implementation of test suite

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (13e3d70)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (82f06f3)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 .../integ/testsuite/TestDFSHoodieTestSuiteWriterAdapter.java | 5 +++--
 .../apache/hudi/integ/testsuite/TestFileDeltaInputWriter.java| 5 +++--
 .../integ/testsuite/reader/TestDFSHoodieDatasetInputReader.java  | 9 -
 .../org/apache/hudi/utilities/testutils/UtilitiesTestBase.java   | 7 +++
 4 files changed, 13 insertions(+), 13 deletions(-)

[GitHub] [hudi] nsivabalan commented on a change in pull request #1819: [HUDI-1058] Make delete marker configurable

2020-07-20 Thread GitBox



nsivabalan commented on a change in pull request #1819:
URL: https://github.com/apache/hudi/pull/1819#discussion_r457479621



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
##
@@ -337,9 +337,15 @@ private void refreshTimeline() throws IOException {
 }
 
 JavaRDD avroRDD = avroRDDOptional.get();
+if (writeClient == null) {
+  this.schemaProvider = schemaProvider;
+  setupWriteClient();

Review comment:
   also thinking do we really need to instantiate the config. since it is 
just one property, can't we directly read if from TypedProperties? @bvaradar : 
do you have any thoughts on this. Basically we need to read just one config 
value for deleteMarker from the properties set. This step is little ahead of 
where we instantiate writeClient, so wondering how to go about it. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #1819: [HUDI-1058] Make delete marker configurable

2020-07-20 Thread GitBox



nsivabalan commented on a change in pull request #1819:
URL: https://github.com/apache/hudi/pull/1819#discussion_r457472660



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
##
@@ -337,9 +337,15 @@ private void refreshTimeline() throws IOException {
 }
 
 JavaRDD avroRDD = avroRDDOptional.get();
+if (writeClient == null) {
+  this.schemaProvider = schemaProvider;
+  setupWriteClient();

Review comment:
   all we need is a config here. don't think we need to initialize 
writeClient here.

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/OverwriteWithLatestAvroPayload.java
##
@@ -66,8 +74,9 @@ public OverwriteWithLatestAvroPayload 
preCombine(OverwriteWithLatestAvroPayload
 }
 
 GenericRecord genericRecord = (GenericRecord) recordOption.get();
-// combining strategy here trivially ignores currentValue on disk and 
writes this record
-Object deleteMarker = genericRecord.get("_hoodie_is_deleted");
+// combining strategy here trivially ignores currentValue on disk and 
writes this record吗
+String deleteField = isDeletedField == null ? "_hoodie_is_deleted" : 
isDeletedField;

Review comment:
   sorry, I didn't realize the other constructor. We could then initialize 
isDeletedField = "_hoodie_is_deleted"; So that one of the constructors will 
over-ride the value. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Mathieu1124 commented on a change in pull request #1842: [HUDI-1037]Introduce a write committed callback hook

2020-07-20 Thread GitBox



Mathieu1124 commented on a change in pull request #1842:
URL: https://github.com/apache/hudi/pull/1842#discussion_r457470710



##
File path: 
hudi-client/src/main/java/org/apache/hudi/exception/HoodieCommitCallbackException.java
##
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.exception;
+
+import org.apache.hudi.callback.HoodieWriteCommitCallback;
+
+/**
+ * Exception thrown for any higher level errors when {@link 
HoodieWriteCommitCallback} is executing a callback.
+ */
+public class HoodieCommitCallbackException extends HoodieException {
+
+  public HoodieCommitCallbackException(String msg, Throwable e) {
+super(msg, e);
+  }
+
+  public HoodieCommitCallbackException(String msg) {
+super(msg);
+  }
+

Review comment:
   > extra line
   
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Mathieu1124 commented on a change in pull request #1842: [HUDI-1037]Introduce a write committed callback hook

2020-07-20 Thread GitBox



Mathieu1124 commented on a change in pull request #1842:
URL: https://github.com/apache/hudi/pull/1842#discussion_r457470368



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteCommitCallbackConfig.java
##
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.config;
+
+import org.apache.hudi.callback.common.HoodieCommitCallbackType;
+import org.apache.hudi.common.config.DefaultHoodieConfig;
+
+import java.io.File;
+import java.io.FileReader;
+import java.io.IOException;
+import java.util.Properties;
+
+/**
+ * Write callback related config.
+ */
+public class HoodieWriteCommitCallbackConfig extends DefaultHoodieConfig {
+
+  public static final String CALLBACK_ON = "hoodie.write.commit.callback.on";
+  public static final boolean DEFAULT_CALLBACK_ON = false;
+
+  public static final String CALLBACK_TYPE_PROP = 
"hoodie.write.commit.callback.type";
+  public static final String DEFAULT_CALLBACK_TYPE_PROP = 
HoodieCommitCallbackType.HTTP.name();
+
+  public static final String CALLBACK_CLASS_PROP = 
"hoodie.write.commit.callback.class";
+  public static final String DEFAULT_CALLBACK_CLASS_PROP = "";
+
+  // * REST callback configs *
+  public static final String CALLBACK_HTTP_URL_PROP = 
"hoodie.write.commit.callback.http.url";
+  public static final String CALLBACK_API_KEY = 
"hoodie.write.commit.callback.http.api.key";
+  public static final String DEFAULT_CALLBACK_API_KEY = 
"hudi_write_commit_callback";
+  public static final String CALLBACK_TIMEOUT_SECONDS = 
"hoodie.write.commit.callback.rest.timeout.seconds";
+  public static final String DEFAULT_CALLBACK_TIMEOUT_SECONDS = "3";
+
+  private HoodieWriteCommitCallbackConfig(Properties props) {
+super(props);
+  }
+
+  public static HoodieWriteCommitCallbackConfig.Builder newBuilder() {
+return new HoodieWriteCommitCallbackConfig.Builder();
+  }
+
+  public static class Builder {
+
+private final Properties props = new Properties();
+
+public HoodieWriteCommitCallbackConfig.Builder fromFile(File 
propertiesFile) throws IOException {
+  try (FileReader reader = new FileReader(propertiesFile)) {
+this.props.load(reader);
+return this;
+  }
+}
+
+public HoodieWriteCommitCallbackConfig.Builder fromProperties(Properties 
props) {
+  this.props.putAll(props);
+  return this;
+}
+
+public HoodieWriteCommitCallbackConfig.Builder 
writeCommitCallbackOn(String callbackOn) {
+  props.setProperty(CALLBACK_ON, String.valueOf(callbackOn));

Review comment:
   > no need use `String.valueOf` here
   
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Mathieu1124 commented on a change in pull request #1842: [HUDI-1037]Introduce a write committed callback hook

2020-07-20 Thread GitBox



Mathieu1124 commented on a change in pull request #1842:
URL: https://github.com/apache/hudi/pull/1842#discussion_r457470536



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -632,6 +632,33 @@ public FileSystemViewStorageConfig 
getClientSpecifiedViewStorageConfig() {
 return clientSpecifiedViewStorageConfig;
   }
 
+  /**
+   * Commit call back configs.
+   */
+  public boolean enableWriteCommitCallback() {
+return 
Boolean.parseBoolean(props.getProperty(HoodieWriteCommitCallbackConfig.CALLBACK_ON));
+  }
+
+  public String getCallbackType() {
+return 
props.getProperty(HoodieWriteCommitCallbackConfig.CALLBACK_TYPE_PROP);
+  }
+
+  public String getCallbackClass() {
+return 
props.getProperty(HoodieWriteCommitCallbackConfig.CALLBACK_CLASS_PROP);
+  }
+
+  public String getCallbackRestUrl() {
+return 
props.getProperty(HoodieWriteCommitCallbackConfig.CALLBACK_HTTP_URL_PROP);
+  }
+
+  public int getCallbackRestTimeout() {
+return 
Integer.parseInt(props.getProperty(HoodieWriteCommitCallbackConfig.CALLBACK_TIMEOUT_SECONDS));
+  }
+
+  public String getCallbackRestApiKey() {
+return props.getProperty(HoodieWriteCommitCallbackConfig.CALLBACK_API_KEY);
+  }

Review comment:
   > should we move these methods into 
`HoodieHttpWriteCommitCallback.java`, I am worry about the explode of other 
callback methods in `HoodieWriteConfig`
   
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Mathieu1124 commented on a change in pull request #1842: [HUDI-1037]Introduce a write committed callback hook

2020-07-20 Thread GitBox



Mathieu1124 commented on a change in pull request #1842:
URL: https://github.com/apache/hudi/pull/1842#discussion_r457470209



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -632,6 +632,33 @@ public FileSystemViewStorageConfig 
getClientSpecifiedViewStorageConfig() {
 return clientSpecifiedViewStorageConfig;
   }
 
+  /**
+   * Commit call back configs.
+   */
+  public boolean enableWriteCommitCallback() {

Review comment:
   > keep consistent with callback_on above, so `writeCommitCallbackOn`?
   
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] tooptoop4 edited a comment on issue #1833: [SUPPORT] 100% update on 10mn keys in single partition slow

2020-07-20 Thread GitBox



tooptoop4 edited a comment on issue #1833:
URL: https://github.com/apache/hudi/issues/1833#issuecomment-660715533


   @bvaradar i noticed "There is insufficient memory for the Java Runtime 
Environment to continue." error so i reduced SPARK_WORKER_MEMORY (ie leave more 
room for OS memory). Now the timings I get are: 43mins for 
hoodie.bloom.index.bucketized.checking = false. 59 mins for 
hoodie.bloom.index.bucketized.checking = true.
   
   **hoodie.bloom.index.bucketized.checking = false**
   
   
![image](https://user-images.githubusercontent.com/33283496/87885750-b4829b80-ca0f-11ea-99d9-195b3a6cc562.png)
   
![image](https://user-images.githubusercontent.com/33283496/87885776-dda32c00-ca0f-11ea-9f8e-e9c15ead96c2.png)
   
![image](https://user-images.githubusercontent.com/33283496/87885794-fad7fa80-ca0f-11ea-8d16-b5a290676525.png)
   
![image](https://user-images.githubusercontent.com/33283496/87885812-1e9b4080-ca10-11ea-9ac7-e3a487f4a8b7.png)
   
![image](https://user-images.githubusercontent.com/33283496/87885847-5dc99180-ca10-11ea-9a13-fbef57f240b3.png)
   
![image](https://user-images.githubusercontent.com/33283496/87885876-91a4b700-ca10-11ea-906b-563cd0d25d55.png)
   
![image](https://user-images.githubusercontent.com/33283496/87885894-bb5dde00-ca10-11ea-977a-681a3c7b4d1c.png)
   
![image](https://user-images.githubusercontent.com/33283496/87885907-d6305280-ca10-11ea-8f2d-aeec67b1916b.png)
   
![image](https://user-images.githubusercontent.com/33283496/87885922-f19b5d80-ca10-11ea-8359-5fc0adecb8cb.png)
   
![image](https://user-images.githubusercontent.com/33283496/87885930-07a91e00-ca11-11ea-84f5-379f1953ad67.png)
   
![image](https://user-images.githubusercontent.com/33283496/87885947-1e4f7500-ca11-11ea-81cb-977a289eba53.png)
   
![image](https://user-images.githubusercontent.com/33283496/87885961-4343e800-ca11-11ea-9f7d-bea8d5a47012.png)
   
![image](https://user-images.githubusercontent.com/33283496/87885972-5eaef300-ca11-11ea-82a2-3dcc70474d5c.png)
   
   
   
   
   **hoodie.bloom.index.bucketized.checking = true**
   
   
   
   
![image](https://user-images.githubusercontent.com/33283496/87886008-a03f9e00-ca11-11ea-9a23-acccedbcae29.png)
   
![image](https://user-images.githubusercontent.com/33283496/87886021-bd746c80-ca11-11ea-986f-ce83b8430869.png)
   
![image](https://user-images.githubusercontent.com/33283496/87886046-e85ec080-ca11-11ea-99d0-52fe4d7bdc2d.png)
   
![image](https://user-images.githubusercontent.com/33283496/87886069-09271600-ca12-11ea-8bab-e06ccb503e80.png)
   
![image](https://user-images.githubusercontent.com/33283496/87886091-2bb92f00-ca12-11ea-9d00-561ef63bcabf.png)
   
![image](https://user-images.githubusercontent.com/33283496/87886110-4be8ee00-ca12-11ea-9eb3-d17de793bb9b.png)
   
![image](https://user-images.githubusercontent.com/33283496/87886117-63c07200-ca12-11ea-97fa-7655500c3848.png)
   
![image](https://user-images.githubusercontent.com/33283496/87886131-79ce3280-ca12-11ea-898b-bbaca156fd91.png)
   
![image](https://user-images.githubusercontent.com/33283496/87886152-95393d80-ca12-11ea-8b90-c6f6c52bff94.png)
   
![image](https://user-images.githubusercontent.com/33283496/87886164-ac782b00-ca12-11ea-8231-e147ad4376b5.png)
   
![image](https://user-images.githubusercontent.com/33283496/87886171-bc900a80-ca12-11ea-9ef6-a7b680d2943a.png)
   
   
   i wonder if https://issues.apache.org/jira/browse/SPARK-27734 is causing the 
memory issues



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ssomuah opened a new issue #1852: [SUPPORT]

2020-07-20 Thread GitBox



ssomuah opened a new issue #1852:
URL: https://github.com/apache/hudi/issues/1852


   **Describe the problem you faced**
   
   Write performance degrades over time 
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.Create an unpartitoned MOR table
   2.Use it for a few days
   
   **Expected behavior**
   
   Write performance should not degrade over time 
   **Environment Description**
   
   * Hudi version :  Master @ 3b9a305 
https://github.com/apache/hudi/tree/3b9a30528bd6a6369181702303f3384162b04a7f
   
   * Spark version : 2.4.4
   
   * Hive version : N/A
   
   * Hadoop version : 2.7.3
   
   * Storage (HDFS/S3/GCS..) : ABFSS
   
   * Running on Docker? (yes/no) : no 
   
   
   **Additional context**
   
   The MOR table has a single partition. 
   It's a spark streaming application with 5 minute batches. 
   Intially it runs and completes batches in the duration. But over time the 
time for batches to complete increases. 
   From the spark ui we can see that most of the time is being taken actually 
writing the files. 
   
   https://user-images.githubusercontent.com/2061955/87941642-7023e980-ca69-11ea-9f7a-9d801a9be131.png;>
   
   
   And looking at the thread dump of the executors they are almost always 
spending their time listing files. 
   
   I think the reason for this is we have an extremely high number of files in 
the single partition folder. 
   
   An ls on the folder is showing about 45,000 files. 
   
   The other odd thing is that when we look at the write tasks in the spark ui. 
There 
   are several tasks that seem to have tiny numbers of records in them. 
   
   https://user-images.githubusercontent.com/2061955/87941812-b11bfe00-ca69-11ea-9e77-32cf83f6e2e1.png;>
   
   
   
   We can see compaction taking place so it's not clear why we still have so 
many files. 
   https://user-images.githubusercontent.com/2061955/87941819-b4af8500-ca69-11ea-8f0a-9c23df08052e.png;>
   
   
   The table config is 
   
 .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
 .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
HoodieTableType.MERGE_ON_READ.name)
 .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
s"$META_COLUMN.version")
 .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, 
s"$META_COLUMN.key")
 .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
s"$META_COLUMN.partition")
 .option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY,
   "com.myCustompayloadClass")
 .option(HoodieCompactionConfig.PAYLOAD_CLASS_PROP,
   "com.myCustompayloadClass")
 .option(HoodieWriteConfig.UPSERT_PARALLELISM, 32)
 .option(HoodieWriteConfig.INSERT_PARALLELISM, 32)
 .option(HoodieCompactionConfig.CLEANER_COMMITS_RETAINED_PROP, 3)
 .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, 
12 )
 .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, true)
 .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, String.valueOf(256 
* 1024 * 1024))
 .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, 
String.valueOf(256 * 1024 * 1024))
 .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy")
   
   We're using our own payload class that decides what to keep based on a 
timestamp in the message and not latest. 
   
   **Stacktrace**
   
   
   StackTrace of list operation where we are spending a lot of time. 
   
   
sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:352)
   
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsHttpOperation.processResponse(AbfsHttpOperation.java:259)
   
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:167)
   
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:124)
   
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsClient.listPath(AbfsClient.java:180)
   
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listFiles(AzureBlobFileSystemStore.java:549)
   
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:628)
   
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:532)
   
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listStatus(AzureBlobFileSystem.java:344)
   org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
   org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
   
org.apache.hudi.common.fs.HoodieWrapperFileSystem.listStatus(HoodieWrapperFileSystem.java:487)

[GitHub] [hudi] leesf commented on pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

2020-07-20 Thread GitBox



leesf commented on pull request #1816:
URL: https://github.com/apache/hudi/pull/1816#issuecomment-660992766


   @pratyakshsharma Thanks for the updates and sorry for late response. For 
users not using latest master, they still need use 
`NonpartitionedKeyGenerator`, so I think it is valuable to mention it. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leesf commented on a change in pull request #1842: [HUDI-1037]Introduce a write committed callback hook

2020-07-20 Thread GitBox



leesf commented on a change in pull request #1842:
URL: https://github.com/apache/hudi/pull/1842#discussion_r457312361



##
File path: 
hudi-client/src/main/java/org/apache/hudi/exception/HoodieCommitCallbackException.java
##
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.exception;
+
+import org.apache.hudi.callback.HoodieWriteCommitCallback;
+
+/**
+ * Exception thrown for any higher level errors when {@link 
HoodieWriteCommitCallback} is executing a callback.
+ */
+public class HoodieCommitCallbackException extends HoodieException {
+
+  public HoodieCommitCallbackException(String msg, Throwable e) {
+super(msg, e);
+  }
+
+  public HoodieCommitCallbackException(String msg) {
+super(msg);
+  }
+

Review comment:
   extra line





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leesf commented on a change in pull request #1842: [HUDI-1037]Introduce a write committed callback hook

2020-07-20 Thread GitBox



leesf commented on a change in pull request #1842:
URL: https://github.com/apache/hudi/pull/1842#discussion_r457311954



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -632,6 +632,33 @@ public FileSystemViewStorageConfig 
getClientSpecifiedViewStorageConfig() {
 return clientSpecifiedViewStorageConfig;
   }
 
+  /**
+   * Commit call back configs.
+   */
+  public boolean enableWriteCommitCallback() {
+return 
Boolean.parseBoolean(props.getProperty(HoodieWriteCommitCallbackConfig.CALLBACK_ON));
+  }
+
+  public String getCallbackType() {
+return 
props.getProperty(HoodieWriteCommitCallbackConfig.CALLBACK_TYPE_PROP);
+  }
+
+  public String getCallbackClass() {
+return 
props.getProperty(HoodieWriteCommitCallbackConfig.CALLBACK_CLASS_PROP);
+  }
+
+  public String getCallbackRestUrl() {
+return 
props.getProperty(HoodieWriteCommitCallbackConfig.CALLBACK_HTTP_URL_PROP);
+  }
+
+  public int getCallbackRestTimeout() {
+return 
Integer.parseInt(props.getProperty(HoodieWriteCommitCallbackConfig.CALLBACK_TIMEOUT_SECONDS));
+  }
+
+  public String getCallbackRestApiKey() {
+return props.getProperty(HoodieWriteCommitCallbackConfig.CALLBACK_API_KEY);
+  }

Review comment:
   should we move these methods into `HoodieHttpWriteCommitCallback.java`, 
I am worry about the explode of other callback methods in `HoodieWriteConfig`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leesf commented on a change in pull request #1842: [HUDI-1037]Introduce a write committed callback hook

2020-07-20 Thread GitBox



leesf commented on a change in pull request #1842:
URL: https://github.com/apache/hudi/pull/1842#discussion_r457309449



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteCommitCallbackConfig.java
##
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.config;
+
+import org.apache.hudi.callback.common.HoodieCommitCallbackType;
+import org.apache.hudi.common.config.DefaultHoodieConfig;
+
+import java.io.File;
+import java.io.FileReader;
+import java.io.IOException;
+import java.util.Properties;
+
+/**
+ * Write callback related config.
+ */
+public class HoodieWriteCommitCallbackConfig extends DefaultHoodieConfig {
+
+  public static final String CALLBACK_ON = "hoodie.write.commit.callback.on";
+  public static final boolean DEFAULT_CALLBACK_ON = false;
+
+  public static final String CALLBACK_TYPE_PROP = 
"hoodie.write.commit.callback.type";
+  public static final String DEFAULT_CALLBACK_TYPE_PROP = 
HoodieCommitCallbackType.HTTP.name();
+
+  public static final String CALLBACK_CLASS_PROP = 
"hoodie.write.commit.callback.class";
+  public static final String DEFAULT_CALLBACK_CLASS_PROP = "";
+
+  // * REST callback configs *
+  public static final String CALLBACK_HTTP_URL_PROP = 
"hoodie.write.commit.callback.http.url";
+  public static final String CALLBACK_API_KEY = 
"hoodie.write.commit.callback.http.api.key";
+  public static final String DEFAULT_CALLBACK_API_KEY = 
"hudi_write_commit_callback";
+  public static final String CALLBACK_TIMEOUT_SECONDS = 
"hoodie.write.commit.callback.rest.timeout.seconds";
+  public static final String DEFAULT_CALLBACK_TIMEOUT_SECONDS = "3";
+
+  private HoodieWriteCommitCallbackConfig(Properties props) {
+super(props);
+  }
+
+  public static HoodieWriteCommitCallbackConfig.Builder newBuilder() {
+return new HoodieWriteCommitCallbackConfig.Builder();
+  }
+
+  public static class Builder {
+
+private final Properties props = new Properties();
+
+public HoodieWriteCommitCallbackConfig.Builder fromFile(File 
propertiesFile) throws IOException {
+  try (FileReader reader = new FileReader(propertiesFile)) {
+this.props.load(reader);
+return this;
+  }
+}
+
+public HoodieWriteCommitCallbackConfig.Builder fromProperties(Properties 
props) {
+  this.props.putAll(props);
+  return this;
+}
+
+public HoodieWriteCommitCallbackConfig.Builder 
writeCommitCallbackOn(String callbackOn) {
+  props.setProperty(CALLBACK_ON, String.valueOf(callbackOn));

Review comment:
   no need use `String.valueOf` here 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leesf commented on a change in pull request #1842: [HUDI-1037]Introduce a write committed callback hook

2020-07-20 Thread GitBox



leesf commented on a change in pull request #1842:
URL: https://github.com/apache/hudi/pull/1842#discussion_r457309211



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -632,6 +632,33 @@ public FileSystemViewStorageConfig 
getClientSpecifiedViewStorageConfig() {
 return clientSpecifiedViewStorageConfig;
   }
 
+  /**
+   * Commit call back configs.
+   */
+  public boolean enableWriteCommitCallback() {

Review comment:
   keep consistent with callback_on above,  so `writeCommitCallbackOn`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leesf commented on pull request #1842: [HUDI-1037]Introduce a write committed callback hook

2020-07-20 Thread GitBox



leesf commented on pull request #1842:
URL: https://github.com/apache/hudi/pull/1842#issuecomment-660975134


   > Hi, @yanghua @leesf, I was wondering maybe we should throw an exception 
instead of logging a warning when the callback service failed(log waring 
currently).
   > Since the callback is not enabled by default, If the user enables the 
callback service, then the callback is really what they want, so, if any error 
occurred, the job should fail.
   > what do you think?
   
   reasonable



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leesf commented on pull request #1851: [HUDI-1113] Add user define metrics reporter

2020-07-20 Thread GitBox



leesf commented on pull request #1851:
URL: https://github.com/apache/hudi/pull/1851#issuecomment-660974770


   @zherenyu831 Thanks for your contributing! would you please check the travis 
failure?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yanghua commented on a change in pull request #1774: [HUDI-703]Add unit test for HoodieSyncCommand

2020-07-20 Thread GitBox



yanghua commented on a change in pull request #1774:
URL: https://github.com/apache/hudi/pull/1774#discussion_r457278729



##
File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieSyncCommand.java
##
@@ -74,9 +74,9 @@ public String validateSync(
 }
 
 String targetLatestCommit =
-targetTimeline.getInstants().iterator().hasNext() ? "0" : 
targetTimeline.lastInstant().get().getTimestamp();
+targetTimeline.getInstants().iterator().hasNext() ? 
targetTimeline.lastInstant().get().getTimestamp() : "0";

Review comment:
   Good catch!

##
File path: 
hudi-integ-test/src/test/java/org/apache/hudi/integ/HoodieTestHiveBase.java
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.integ;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.util.collection.Pair;
+
+import java.io.IOException;
+import java.io.InputStream;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Base class to run cmd and generate data in hive.
+ */
+public class HoodieTestHiveBase extends ITTestBase {
+
+  protected enum PartitionType {
+SINGLE_KEY_PARTITIONED, MULTI_KEYS_PARTITIONED, NON_PARTITIONED,
+  }
+
+  /**
+   * A basic integration test that runs HoodieJavaApp to create a sample 
Hoodie data-set and performs upserts on it.
+   * Hive integration and upsert functionality is checked by running a count 
query in hive console. TODO: Add
+   * spark-shell test-case
+   */
+  public void generateDataByHoodieJavaApp(String hiveTableName, String 
tableType, PartitionType partitionType,
+  String commitType, String hoodieTableName) throws Exception {
+
+String hdfsPath = getHdfsPath(hiveTableName);
+String hdfsUrl = "hdfs://namenode" + hdfsPath;
+
+Pair stdOutErr;
+if ("overwrite".equals(commitType)) {
+  // Drop Table if it exists
+  try {
+dropHiveTables(hiveTableName, tableType);
+  } catch (AssertionError ex) {
+// In travis, sometimes, the hivemetastore is not ready even though we 
wait for the port to be up
+// Workaround to sleep for 5 secs and retry
+// Set sleep time by hoodie.hiveserver.time.wait
+Thread.sleep(getTimeWait());
+dropHiveTables(hiveTableName, tableType);
+  }
+
+  // Ensure table does not exist
+  stdOutErr = executeHiveCommand("show tables like '" + hiveTableName + 
"'");
+  assertTrue(stdOutErr.getLeft().isEmpty(), "Dropped table " + 
hiveTableName + " exists!");
+}
+
+// Run Hoodie Java App
+String cmd = String.format("%s %s --hive-sync --table-path %s  --hive-url 
%s  --table-type %s  --hive-table %s" +
+" --commit-type %s  --table-name %s", HOODIE_JAVA_APP, 
"HoodieJavaGenerateApp", hdfsUrl, HIVE_SERVER_JDBC_URL,
+tableType, hiveTableName, commitType, hoodieTableName);
+if (partitionType == PartitionType.MULTI_KEYS_PARTITIONED) {
+  cmd = cmd + " --use-multi-partition-keys";
+} else if (partitionType == PartitionType.NON_PARTITIONED){
+  cmd = cmd + " --non-partitioned";
+}
+executeCommandStringInDocker(ADHOC_1_CONTAINER, cmd, true);
+
+String snapshotTableName = getSnapshotTableName(tableType, hiveTableName);
+
+// Ensure table does exist
+stdOutErr = executeHiveCommand("show tables like '" + snapshotTableName + 
"'");
+assertEquals(snapshotTableName, stdOutErr.getLeft(), "Table exists");
+  }
+
+  protected void dropHiveTables(String hiveTableName, String tableType) throws 
Exception {
+if (tableType.equals(HoodieTableType.MERGE_ON_READ.name())) {
+  executeHiveCommand("drop table if exists " + hiveTableName + "_rt");
+  executeHiveCommand("drop table if exists " + hiveTableName + "_ro");
+} else {
+  executeHiveCommand("drop table if exists " + hiveTableName);
+}
+  }
+
+  protected String getHdfsPath(String hiveTableName) {

Review comment:
   `getHDFSPath` looks better?

##
File path:

[GitHub] [hudi] codecov-commenter commented on pull request #1770: [HUDI-708]Add temps show and unit test for TempViewCommand

2020-07-20 Thread GitBox



codecov-commenter commented on pull request #1770:
URL: https://github.com/apache/hudi/pull/1770#issuecomment-660950311


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/1770?src=pr=h1) Report
   > Merging 
[#1770](https://codecov.io/gh/apache/hudi/pull/1770?src=pr=desc) into 
[master](https://codecov.io/gh/apache/hudi/commit/c7f1a781ab4ff3784d53a102364fd85e379811d1=desc)
 will **decrease** coverage by `5.84%`.
   > The diff coverage is `64.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/1770/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/1770?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1770  +/-   ##
   
   - Coverage 60.48%   54.63%   -5.85% 
   + Complexity 3627 3022 -605 
   
 Files   439  404  -35 
 Lines 1900716867-2140 
 Branches   1916 1664 -252 
   
   - Hits  11496 9215-2281 
   - Misses 6725 7028 +303 
   + Partials786  624 -162 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | #hudicli | `68.96% <64.00%> (+0.58%)` | `1447.00 <6.00> (+18.00)` | |
   | #hudiclient | `79.25% <ø> (+0.05%)` | `1257.00 <ø> (ø)` | |
   | #hudicommon | `54.74% <ø> (+0.45%)` | `1508.00 <ø> (+22.00)` | |
   | #hudihadoopmr | `?` | `?` | |
   | #hudihivesync | `?` | `?` | |
   | #hudispark | `19.83% <ø> (-27.50%)` | `19.00 <ø> (-83.00)` | |
   | #huditimelineservice | `?` | `?` | |
   | #hudiutilities | `12.07% <ø> (-62.50%)` | `48.00 <ø> (-231.00)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/1770?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...g/apache/hudi/cli/utils/SparkTempViewProvider.java](https://codecov.io/gh/apache/hudi/pull/1770/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL3V0aWxzL1NwYXJrVGVtcFZpZXdQcm92aWRlci5qYXZh)
 | `59.67% <55.55%> (+59.67%)` | `12.00 <2.00> (+12.00)` | |
   | 
[.../org/apache/hudi/cli/commands/TempViewCommand.java](https://codecov.io/gh/apache/hudi/pull/1770/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL1RlbXBWaWV3Q29tbWFuZC5qYXZh)
 | `69.23% <66.66%> (+49.23%)` | `4.00 <3.00> (+3.00)` | |
   | 
[...i/src/main/java/org/apache/hudi/cli/HoodieCLI.java](https://codecov.io/gh/apache/hudi/pull/1770/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL0hvb2RpZUNMSS5qYXZh)
 | `89.18% <75.00%> (+7.37%)` | `18.00 <1.00> (+3.00)` | |
   | 
[...g/apache/hudi/keygen/GlobalDeleteKeyGenerator.java](https://codecov.io/gh/apache/hudi/pull/1770/diff?src=pr=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9rZXlnZW4vR2xvYmFsRGVsZXRlS2V5R2VuZXJhdG9yLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-7.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/1770/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/1770/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/1770/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/1770/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/1770/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   |

[jira] [Created] (HUDI-1113) Support user defined metrics reporter

2020-07-20 Thread Zheren Yu (Jira)

Zheren Yu created HUDI-1113:
---

 Summary: Support user defined metrics reporter
 Key: HUDI-1113
 URL: https://issues.apache.org/jira/browse/HUDI-1113
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Usability
Reporter: Zheren Yu
 Fix For: 0.5.3


Now metrics reporter only support datadog, jmx, Graphite, once user want to add 
their own metrics it will be difficult(our team is using new-relic), also, not 
everyone want that dependencies they don't want added in hudi components. So I 
suggest to having a user defined metrics reporter will be better to monitoring 
the metrics everywhere



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] zherenyu831 closed pull request #1851: Add user define metrics reporter

2020-07-20 Thread GitBox



zherenyu831 closed pull request #1851:
URL: https://github.com/apache/hudi/pull/1851


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] zherenyu831 opened a new pull request #1851: Add user define metrics reporter

2020-07-20 Thread GitBox



zherenyu831 opened a new pull request #1851:
URL: https://github.com/apache/hudi/pull/1851


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] mabin001 closed pull request #1832: [HUDI-1099]: improve quality of the code calling the method.HiveSyncTool#syncPartitions

2020-07-20 Thread GitBox



mabin001 closed pull request #1832:
URL: https://github.com/apache/hudi/pull/1832


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a change in pull request #1849: [WIP] Externalize test classes' configs

2020-07-20 Thread GitBox



xushiyan commented on a change in pull request #1849:
URL: https://github.com/apache/hudi/pull/1849#discussion_r457130487



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
##
@@ -274,6 +275,7 @@ public HoodieIndexConfig build() {
   DEFAULT_GLOBAL_SIMPLE_INDEX_PARALLELISM);
   setDefaultOnCondition(props, 
!props.containsKey(SIMPLE_INDEX_UPDATE_PARTITION_PATH),
   SIMPLE_INDEX_UPDATE_PARTITION_PATH, 
DEFAULT_SIMPLE_INDEX_UPDATE_PARTITION_PATH);
+  setDefaultOnCondition(props, !isHBaseIndexConfigSet, 
HoodieHBaseIndexConfig.newBuilder().fromProperties(props).build());

Review comment:
   @yanghua Maybe this can be done specially as below for HBase index. But 
I'm inclined to leaving it as a simple rule: any sub-config should be 
initialized regardless. Would like to have some input on this.
   
   ```suggestion
 setDefaultOnCondition(props, indexType==IndexType.HBASE && 
!isHBaseIndexConfigSet, 
HoodieHBaseIndexConfig.newBuilder().fromProperties(props).build());
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a change in pull request #1849: [WIP] Externalize test classes' configs

2020-07-20 Thread GitBox



xushiyan commented on a change in pull request #1849:
URL: https://github.com/apache/hudi/pull/1849#discussion_r457130487



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
##
@@ -274,6 +275,7 @@ public HoodieIndexConfig build() {
   DEFAULT_GLOBAL_SIMPLE_INDEX_PARALLELISM);
   setDefaultOnCondition(props, 
!props.containsKey(SIMPLE_INDEX_UPDATE_PARTITION_PATH),
   SIMPLE_INDEX_UPDATE_PARTITION_PATH, 
DEFAULT_SIMPLE_INDEX_UPDATE_PARTITION_PATH);
+  setDefaultOnCondition(props, !isHBaseIndexConfigSet, 
HoodieHBaseIndexConfig.newBuilder().fromProperties(props).build());

Review comment:
   @yanghua Maybe this can be done specially as below for HBase index. But 
I'm also open to leave it as is: any sub-config should be initialized 
regardless. Would like to have some input on this.
   
   ```suggestion
 setDefaultOnCondition(props, indexType==IndexType.HBASE && 
!isHBaseIndexConfigSet, 
HoodieHBaseIndexConfig.newBuilder().fromProperties(props).build());
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a change in pull request #1849: [WIP] Externalize test classes' configs

2020-07-20 Thread GitBox



xushiyan commented on a change in pull request #1849:
URL: https://github.com/apache/hudi/pull/1849#discussion_r457130487



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
##
@@ -274,6 +275,7 @@ public HoodieIndexConfig build() {
   DEFAULT_GLOBAL_SIMPLE_INDEX_PARALLELISM);
   setDefaultOnCondition(props, 
!props.containsKey(SIMPLE_INDEX_UPDATE_PARTITION_PATH),
   SIMPLE_INDEX_UPDATE_PARTITION_PATH, 
DEFAULT_SIMPLE_INDEX_UPDATE_PARTITION_PATH);
+  setDefaultOnCondition(props, !isHBaseIndexConfigSet, 
HoodieHBaseIndexConfig.newBuilder().fromProperties(props).build());

Review comment:
   @yanghua Maybe this can be done specially for HBase index. But I'm also 
open to leave it as is: any sub-config should be initialized regardless. Would 
like to have some input on this.
   
   ```suggestion
 setDefaultOnCondition(props, indexType==IndexType.HBASE && 
!isHBaseIndexConfigSet, 
HoodieHBaseIndexConfig.newBuilder().fromProperties(props).build());
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a change in pull request #1849: [WIP] Externalize test classes' configs

2020-07-20 Thread GitBox



xushiyan commented on a change in pull request #1849:
URL: https://github.com/apache/hudi/pull/1849#discussion_r457130487



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
##
@@ -274,6 +275,7 @@ public HoodieIndexConfig build() {
   DEFAULT_GLOBAL_SIMPLE_INDEX_PARALLELISM);
   setDefaultOnCondition(props, 
!props.containsKey(SIMPLE_INDEX_UPDATE_PARTITION_PATH),
   SIMPLE_INDEX_UPDATE_PARTITION_PATH, 
DEFAULT_SIMPLE_INDEX_UPDATE_PARTITION_PATH);
+  setDefaultOnCondition(props, !isHBaseIndexConfigSet, 
HoodieHBaseIndexConfig.newBuilder().fromProperties(props).build());

Review comment:
   @yanghua Maybe this can be done specially for HBase index. But I'm also 
open to leave it as is: any sub-config should be initialized regardless.
   
   ```suggestion
 setDefaultOnCondition(props, indexType==IndexType.HBASE && 
!isHBaseIndexConfigSet, 
HoodieHBaseIndexConfig.newBuilder().fromProperties(props).build());
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Mathieu1124 commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-07-20 Thread GitBox



Mathieu1124 commented on a change in pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#discussion_r457128988



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/HoodieEngineContext.java
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common;
+
+import org.apache.hudi.client.TaskContextSupplier;
+import org.apache.hudi.common.config.SerializableConfiguration;
+
+/**
+ * Base class contains the context information needed by the engine at 
runtime. It will be extended by different
+ * engine implementation if needed.
+ */
+public class HoodieEngineContext {
+  /**
+   * A wrapped hadoop configuration which can be serialized.
+   */
+  private SerializableConfiguration hadoopConf;

Review comment:
   
![图片](https://user-images.githubusercontent.com/49835526/87910984-76cc5400-ca9d-11ea-8279-b9a37b249e7b.png)
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] sbernauer commented on issue #1845: [SUPPORT] Support for Schema evolution. Facing an error

2020-07-20 Thread GitBox



sbernauer commented on issue #1845:
URL: https://github.com/apache/hudi/issues/1845#issuecomment-660853366


   Thanks for your fast reply!
   
   The PR adds a new Test and improves 2 existing tests. The mentioned 4 new 
cols in TestHoodieAvroUtils increase the number of tested cases, the tests are 
still working.
   The new test is in TestHoodieDeltaStreamer and starts the DeltaStreamer with 
different schemas and transformers to reproduce the actual problem.
   
   Regarding the both versions of my schema, i can only provide a diff but it 
should be sufficient. The new field has the same type (union of null and 
string, default of null) as reproduced in the test here 
https://github.com/apache/hudi/pull/1844/files#diff-07dd5ed6077721a382c35b2700da0883R130.
   
   diff Event.json_schema Event-aged.json_schema 
   > },
   > {
   >   "name": "agedOptionalField",
   >   "type": [
   > "null",
   > {
   >   "type": "string",
   >   "avro.java.string": "String"
   > }
   >   ],
   >   "doc": "New aged optional field",
   >   "default": null
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Mathieu1124 commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-07-20 Thread GitBox



Mathieu1124 commented on a change in pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#discussion_r457127004



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/HoodieEngineContext.java
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common;
+
+import org.apache.hudi.client.TaskContextSupplier;
+import org.apache.hudi.common.config.SerializableConfiguration;
+
+/**
+ * Base class contains the context information needed by the engine at 
runtime. It will be extended by different
+ * engine implementation if needed.
+ */
+public class HoodieEngineContext {
+  /**
+   * A wrapped hadoop configuration which can be serialized.
+   */
+  private SerializableConfiguration hadoopConf;

Review comment:
   > 
   > 
   > Just bump into this... Since this is a generic engine context, will it be 
better to use a generic name like `engineConfig`?
   
   Hi, henry thanks for your review. This class holds more than config 
stuff(your can see its child class `HoodieSparkEngineContext`),  maybe context 
is better, WDYT？





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1728: Processing time gradually increases while using spark structured streaming

2020-07-20 Thread GitBox



bvaradar commented on issue #1728:
URL: https://github.com/apache/hudi/issues/1728#issuecomment-660841314


   (copied the comment from 
https://github.com/apache/hudi/issues/1830#issuecomment-660840191)
   
   We spent time over the weekend setting up a local test bed with kafka and 
structured streaming to reproduce this behavior. Here are the steps I followed 
with code : https://gist.github.com/bvaradar/d892c6c6a69664463f8601d09c187271
   
   I ran the setup overnight for many hours with both MOR and COW tables but 
was not able to reproduce the gradual increase in time. I did see variance in 
processing time depending upon the incoming workload because of index lookup 
and parquet writing but there was no increase in processing time.
   
   We should try to run this in S3 environment because we suspect this is seen 
in S3 environment alone. If possible, Would you be interested in taking the 
above gist and run it in your setup ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1830: [SUPPORT] Processing time gradually increases while using Spark Streaming

2020-07-20 Thread GitBox



bvaradar commented on issue #1830:
URL: https://github.com/apache/hudi/issues/1830#issuecomment-660840191


   We spent time over the weekend setting up a local test bed with kafka and 
structured streaming to reproduce this behavior.  Here are the steps I followed 
with code : https://gist.github.com/bvaradar/d892c6c6a69664463f8601d09c187271 
   
   I ran the setup overnight for many hours with both MOR and COW tables but 
was not able to reproduce the gradual increase in time. I did see variance in 
processing time depending upon the incoming workload because of index lookup 
and parquet writing but there was no increase in processing time. 
   
   We should try to run this in S3 environment because we suspect this is seen 
in S3 environment alone. If possible,  Would you be interested in taking the 
above gist and run it in your setup ?
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1847: [SUPPORT] querying MoR tables on S3 becomes slow with number of files growing

2020-07-20 Thread GitBox



bvaradar commented on issue #1847:
URL: https://github.com/apache/hudi/issues/1847#issuecomment-660836081


   @zuyanton : Thanks for the detailed write-up.  This is very interesting. If 
you look at the base implementation of FileStatus  getLen() method, it returns 
a cached copy of the length. So, I wouldnt expect it to be the cause of such 
high variance. Also, 100 milliseconds you had observed would definitely making 
some blocking operations like RPC calls.  Does the EMR/S3 implementation of 
filesystem overrides these classes ? 
   
   ```
   
 /**
  * Get the length of this file, in bytes.
  * @return the length of this file, in bytes.
  */
 public long getLen() {
   return length;
 }
   ```
   
   @zuyanton : Can you track the class type for the incoming file-status object 
?
   
   cc @umehrot2 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on pull request #1850: [HUDI-994] Move TestHoodieIndex test cases to unit tests

2020-07-20 Thread GitBox



xushiyan commented on pull request #1850:
URL: https://github.com/apache/hudi/pull/1850#issuecomment-660834381


   @yanghua @vinothchandar This is a straightforward clean-up :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-490) Add DeltaStream API example to hudi-examples

2020-07-20 Thread Balaji Varadarajan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160933#comment-17160933
 ] 

Balaji Varadarajan commented on HUDI-490:
-

[~RocMarshal]: Thanks for your interest. I have assigned the ticket to you. 

> Add DeltaStream API example to hudi-examples
> 
>
> Key: HUDI-490
> URL: https://issues.apache.org/jira/browse/HUDI-490
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: dengziming
>Assignee: Roc Marshal
>Priority: Major
>  Labels: starter
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-490) Add DeltaStream API example to hudi-examples

2020-07-20 Thread Balaji Varadarajan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-490:
---

Assignee: Roc Marshal

> Add DeltaStream API example to hudi-examples
> 
>
> Key: HUDI-490
> URL: https://issues.apache.org/jira/browse/HUDI-490
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: dengziming
>Assignee: Roc Marshal
>Priority: Major
>  Labels: starter
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-490) Add DeltaStream API example to hudi-examples

2020-07-20 Thread Balaji Varadarajan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-490:

Status: Open  (was: New)

> Add DeltaStream API example to hudi-examples
> 
>
> Key: HUDI-490
> URL: https://issues.apache.org/jira/browse/HUDI-490
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: dengziming
>Priority: Major
>  Labels: starter
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-490) Add DeltaStream API example to hudi-examples

2020-07-20 Thread Balaji Varadarajan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160929#comment-17160929
 ] 

Balaji Varadarajan commented on HUDI-490:
-

[~dengziming] : Can you add some description on what needs to be done here ? 

> Add DeltaStream API example to hudi-examples
> 
>
> Key: HUDI-490
> URL: https://issues.apache.org/jira/browse/HUDI-490
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: dengziming
>Priority: Major
>  Labels: starter
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-490) Add DeltaStream API example to hudi-examples

2020-07-20 Thread Roc Marshal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160926#comment-17160926
 ] 

Roc Marshal commented on HUDI-490:
--

[~vbalaji]

I'm willing to do this.

Could you assign this ticket to me if no one is working on this ?

Thank You.

> Add DeltaStream API example to hudi-examples
> 
>
> Key: HUDI-490
> URL: https://issues.apache.org/jira/browse/HUDI-490
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: dengziming
>Priority: Major
>  Labels: starter
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] xushiyan opened a new pull request #1850: [HUDI-994] Move TestHoodieIndex test cases to unit tests

2020-07-20 Thread GitBox



xushiyan opened a new pull request #1850:
URL: https://github.com/apache/hudi/pull/1850


   Split unit test cases `testCreateIndex()`, `testCreateDummyIndex()` and 
`testCreateIndexWithException()` to `TestHoodieIndexConfigs`.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

99 matches

Mail list logo